Do you mean you're trying to verify at run-time the limitations about cache bank conflicts, or whether load-pair helps overcome them? There are VTune events associated with those. Collecting raw PMU events sounds even more difficult to correlate with your stated goal, and of less than academic interest.
To my best knowledge, the bundle pair that contain load instruction can be classified to 3 categories inthisway:
1. containonly one load instruction 2. contain two load instructions, AND they are not conflict on L2D cache bank, sothetwo requests are likely to be satified simultaneously, they are completely in parallel 3. contain two load instructions, BUT they are with bank conflict, so the two requestswould bepartly in parallel or completely not in parallel
What Iwould like to mesure is the real degree of parallelism that L2D cache achieved at run-time. I think a estimationwould be like this: degreeof parallelism = number of load insts / (number of bundpairthat the loads distribute to+number of bundpair that with load-pair conflict) We assume that while bank conflict occurs, we can not get any parallelism at all.
I'm using pfmon tool, any related event name is ok. Maybe I will check the Vtune feature first.
As I recall, when there is a bank conflict, the 2nd instruction is retried several cycles later, repeatedly, delaying the pipeline, until it succeeds. So, it is painfully evident, if one is counting instructions. When the compiler is able to see that data are in the same bank, it should schedule loads in different cycles, or combine them in a load-pair. Loads can't cross a bank boundary, so there is no such thing as partly in parallel, unless you mean when they are scheduled correctly to issue on consecutive cycles, so they overlap.
Hi all, I'm using a dual-core Itanium2 processor. The processor can issue multiple load instructions within 1 cycle(i.e. in one bundle pair). I would like to find out some numbers:
1. how many executed bundle pairs contain only one load instruction, and 2. how many executed bundle pairs contain more than one load instructions
I do this because I want to find out the degree of parallelism of L2D cache. How can I get them using the PMU on my processor?
I haven't worked on Itanium but while going through Itanium Manual, probably would conclude that with 2 bundles/clock there would be 2 loads/clock and maximum of two loads or two stores per bundle.
Also, the document qoutes "The front-end, with two levels of branch prediction, two TLBs, and a 0 cycle branch predictor, feeds two bundles of three instructions each into the instruction buffer every cycle. This 8 entry queue decouples the front-end from the back-end and delivers up to two bundles, of any alignment, to the remaining 6 stages of the pipeline."
Could you check L2D cache analysis for Bundle-Pair for LOAD inst. using "SB_BUNPAIRS_IN" event by performing EBS using Intel Vtune.
Level of LOAD instructions being used is almost 20% - 22% in a given program. I would suggest analyze a hotspots using Intel VTune and interpret the Instructions Level Parallelism (ILP), specially the flow when LOAD instructions are feteched & decoded from registers. Here, I would mean ID & IF.
You can analyze both - "executed bundle pairs containing only one load instruction" & "executed bundle pairs containing more than one load instructions" through interpreting disassembly of the hotspots and finally conclude the "degree of parallelism of L2D cache".