Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Scott1
Beginner
107 Views

Load instructions on dual-core Itanium2 Processor

Hi all, I'm using a dual-core Itanium2 processor.The processor can issuemultiple load instructions within 1 cycle(i.e. in one bundle pair). I would like to find out some numbers:

1. how many executed bundle pairs contain only one load instruction, and
2. how many executed bundle pairs containmore than oneload instructions

I do this because I want to find out the degree of parallelism of L2D cache.
How can I get them using the PMU on my processor?
0 Kudos
5 Replies
TimP
Black Belt
107 Views

Do you mean you're trying to verify at run-time the limitations about cache bank conflicts, or whether load-pair helps overcome them? There are VTune events associated with those. Collecting raw PMU events sounds even more difficult to correlate with your stated goal, and of less than academic interest.
Scott1
Beginner
107 Views

To my best knowledge, the bundle pair that contain load instruction can be classified to 3 categories inthisway:

1. containonly one load instruction
2. contain two load instructions, AND they are not conflict on L2D cache bank, sothetwo requests are likely to be satified simultaneously, they are completely in parallel
3. contain two load instructions, BUT they are with bank conflict, so the two requestswould bepartly in parallel or completely not in parallel

What Iwould like to mesure is the real degree of parallelism that L2D cache achieved at run-time. I think a estimationwould be like this:
degreeof parallelism = number of load insts / (number of bundpairthat the loads distribute to+number of bundpair that with load-pair conflict)
We assume that while bank conflict occurs, we can not get any parallelism at all.

I'm using pfmon tool, any related event name is ok.
Maybe I will check the Vtune feature first.
TimP
Black Belt
107 Views

As I recall, when there is a bank conflict, the 2nd instruction is retried several cycles later, repeatedly, delaying the pipeline, until it succeeds. So, it is painfully evident, if one is counting instructions. When the compiler is able to see that data are in the same bank, it should schedule loads in different cycles, or combine them in a load-pair. Loads can't cross a bank boundary, so there is no such thing as partly in parallel, unless you mean when they are scheduled correctly to issue on consecutive cycles, so they overlap.
srimks
New Contributor II
107 Views

Quoting - Scott
Hi all, I'm using a dual-core Itanium2 processor. The processor can issue multiple load instructions within 1 cycle(i.e. in one bundle pair). I would like to find out some numbers:

1. how many executed bundle pairs contain only one load instruction, and
2. how many executed bundle pairs contain more than one load instructions

I do this because I want to find out the degree of parallelism of L2D cache.
How can I get them using the PMU on my processor?
I haven't worked on Itanium but while going through Itanium Manual, probably would conclude that with 2 bundles/clock there would be 2 loads/clock and maximum of two loads or two stores per bundle.

Also, the document qoutes "The front-end, with two levels of branch prediction, two TLBs, and a 0 cycle branch predictor, feeds two bundles of three instructions each into the instruction buffer every cycle. This 8 entry queue decouples the front-end from the back-end and delivers up to two bundles, of any alignment, to the remaining 6 stages of the pipeline."

Could you check L2D cache analysis for Bundle-Pair for LOAD inst. using "SB_BUNPAIRS_IN" event by performing EBS using Intel Vtune.

Level of LOAD instructions being used is almost 20% - 22% in a given program. I would suggest analyze a hotspots using Intel VTune and interpret the Instructions Level Parallelism (ILP), specially the flow when LOAD instructions are feteched & decoded from registers. Here, I would mean ID & IF.

You can analyze both - "executed bundle pairs containing only one load instruction" & "executed bundle pairs containing more than one load instructions" through interpreting disassembly of the hotspots and finally conclude the "degree of parallelism of L2D cache".

~BR
Scott1
Beginner
107 Views

Yes, Thanks very much all.
I think I must do that with the help of binary code analysis.
Reply