SB/IV behavior of LDs, MOB and putting it together in code

perfwise · ‎08-11-2012

Hi,

I have found the perf events documented to be very helpful in previous emails, so thank you for providing information as to their behaivor.

I was looking at the stats on LD behavior and the memory ordering buffer. I have some quesitons on the behavior of the hardware and what the stats measured at the link below refer to:

http://redfort-software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/ug_docs/reference/index.htm#pmn/events/about_front_end_performance_tuning_events.html

1) the MOB is for STLF interactions, right?

2) how is the MOB used? Is it just for STLF?

3) I wasn't aware there was a reservation station in SB/IV, is there, I thought all results were sent from Sched -> EX -> LD buffer?

4) does unit mask 0x7 signify all loads executed from the scheduler?

4a) does unit mask 4 signify all loads performed from the MOB, i.e. they are getting there results from a previous STORE?

4b) does unit maks 2 signify the result of the STORE is not in the MOB yet, but waiting a cycle allows the uop to get it from the MOB?

4b*) why does 1 cycle make such a difference? what's the average STLF latency of writing the store to the MOB and then loading it back?

4c) unit mask 1, does this signify the general case of loads SC->EX which have no STORE dependency and simply get their data from the L1D?

Thanks for any clarifications.. looks like interesting stuff which can be performance eye opening.

Perfwise

Hussam_Mousa__Intel_ · ‎08-13-2012

Hi perfwise,

Thank you for your question. It will take me some time to assemble and confirm the answers for 1-3 since these are microarchitecture details and are extremely model and implementation specific.

For question 4, the link points to several Front Enf events and the questions are inquiring about unit masks. Can you clarify which event (Event Code) each question is referring to.

I am assuming you are referring to SNB/IVB platforms, but please confirm since I only notice the platform mentioned as part of question #3.

It would also be helpful if you let us know of the collector and analysis tools you are using, and the overall goal you are pursuiing so as to help set some context for the questions.

Thanks,
Hussam

perfwise · ‎08-13-2012

I am using publicly availalble software as well as my personal software to measure the PMC events on various architectures. I've found it very useful getting a perspective of how my code, dense linear algebra and other high performance codes, runs upon Intel platforms.

I am also using SDE, which I've found to be very valuable, in understanding the "workload" signature, outside of architecture, that my codes have. It was from this that I observed the large number of Store To Load Forwarding instances in my code and other codes.

I am currently on both SB and IB.

It's for this reason I want to undersand what the MOB and it's behavior ellucidated by the PMC events at the link below more clearly:

http://redfort-software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/ug_docs/reference/pmw_dp/events/load_dispatch.html

0) Looking at the micro-arch diagrams in the Opt Guide, where is the "reservation station" on SB/IB? By dispatch, is it implying from the micro-op Q?

0a) when a LD op is dispatched from the micro-op Q, does it have to look for tokens in the LdQ?

1) Does the MOB keep track of store data, how many stores can it track (equal to or less than the ST Q?).

2) Does it do this so as to forward ST data to loads which hit upon previous store data? If so, how much latency is saved by doing so?

3) IV doesn't have a reservation station, I believe. Right? It's a scheduler. So:

3a) LOAD DISPATCH:RS --- PMC 0x13: unit mask 0x01: does this count the number of LDs executed which bypassed the MOB and loaded data not tracked in the MOB?

3b) LOAD DISPATCH:RS_DELAY --- PMC 0x13: unit mask 0x02: what is stage 305, and what does this measure? False STLF which is made by a partial match to a previous store (don't think so). What is this measuring, very confusing.

3c) LOAD DISPATCH: MOB --- presume this is counting the # of LDs which obtained their data from the MOB because of a previous store and it meeting conditions outlined in the opt guide and it's valid to be forwarded, right? This load still takes a LD buffer token, right?

3d) this must count all the type of load scenarios above.

Thanks for any clarification..

Perfwise

perfwise · ‎08-27-2012

Anybody going to respond to the questions above or is it not known as to what the PMCs are measuring? Any help is appreciated.

I read more in the Opt guide.. seems at rename you make sure you have enough tokens for LDq, STq, ROB, etc. I've implemented those PMCs and they appear to work. I also implemented PMC 0x10.. but that doesn't appear to work on my DGEMM, I get rubbish in the results on IV. Likewise for PMC 0x12 (though which looks quite useful).

I tried to measure some of the MOB stats, but none apparently work that you document. Any help in understanding the behavior or action a load goes through via interactions within the MOB.. would be enlightening. It appears to me from the opt guide that all loads go through the MOB (it contains the LDq, STq, etc.. ) so how does it bypass the MOB? The declaration of the PMC seems to need some clarification for true usefulness.

Perfwise