- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all, I'm using a dual-core Itanium2 processor.The processor can issuemultiple load instructions within 1 cycle(i.e. in one bundle pair). I would like to find out some numbers:
1. how many executed bundle pairs contain only one load instruction, and
2. how many executed bundle pairs containmore than oneload instructions
I do this because I want to find out the degree of parallelism of L2D cache.
How can I get them using the PMU on my processor?
1. how many executed bundle pairs contain only one load instruction, and
2. how many executed bundle pairs containmore than oneload instructions
I do this because I want to find out the degree of parallelism of L2D cache.
How can I get them using the PMU on my processor?
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean you're trying to verify at run-time the limitations about cache bank conflicts, or whether load-pair helps overcome them? There are VTune events associated with those. Collecting raw PMU events sounds even more difficult to correlate with your stated goal, and of less than academic interest.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To my best knowledge, the bundle pair that contain load instruction can be classified to 3 categories inthisway:
1. containonly one load instruction
2. contain two load instructions, AND they are not conflict on L2D cache bank, sothetwo requests are likely to be satified simultaneously, they are completely in parallel
3. contain two load instructions, BUT they are with bank conflict, so the two requestswould bepartly in parallel or completely not in parallel
What Iwould like to mesure is the real degree of parallelism that L2D cache achieved at run-time. I think a estimationwould be like this:
degreeof parallelism = number of load insts / (number of bundpairthat the loads distribute to+number of bundpair that with load-pair conflict)
We assume that while bank conflict occurs, we can not get any parallelism at all.
I'm using pfmon tool, any related event name is ok.
Maybe I will check the Vtune feature first.
1. containonly one load instruction
2. contain two load instructions, AND they are not conflict on L2D cache bank, sothetwo requests are likely to be satified simultaneously, they are completely in parallel
3. contain two load instructions, BUT they are with bank conflict, so the two requestswould bepartly in parallel or completely not in parallel
What Iwould like to mesure is the real degree of parallelism that L2D cache achieved at run-time. I think a estimationwould be like this:
degreeof parallelism = number of load insts / (number of bundpairthat the loads distribute to+number of bundpair that with load-pair conflict)
We assume that while bank conflict occurs, we can not get any parallelism at all.
I'm using pfmon tool, any related event name is ok.
Maybe I will check the Vtune feature first.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I recall, when there is a bank conflict, the 2nd instruction is retried several cycles later, repeatedly, delaying the pipeline, until it succeeds. So, it is painfully evident, if one is counting instructions. When the compiler is able to see that data are in the same bank, it should schedule loads in different cycles, or combine them in a load-pair. Loads can't cross a bank boundary, so there is no such thing as partly in parallel, unless you mean when they are scheduled correctly to issue on consecutive cycles, so they overlap.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Scott
Hi all, I'm using a dual-core Itanium2 processor. The processor can issue multiple load instructions within 1 cycle(i.e. in one bundle pair). I would like to find out some numbers:
1. how many executed bundle pairs contain only one load instruction, and
2. how many executed bundle pairs contain more than one load instructions
I do this because I want to find out the degree of parallelism of L2D cache.
How can I get them using the PMU on my processor?
1. how many executed bundle pairs contain only one load instruction, and
2. how many executed bundle pairs contain more than one load instructions
I do this because I want to find out the degree of parallelism of L2D cache.
How can I get them using the PMU on my processor?
Also, the document qoutes "The front-end, with two levels of branch prediction, two TLBs, and a 0 cycle branch predictor, feeds two bundles of three instructions each into the instruction buffer every cycle. This 8 entry queue decouples the front-end from the back-end and delivers up to two bundles, of any alignment, to the remaining 6 stages of the pipeline."
Could you check L2D cache analysis for Bundle-Pair for LOAD inst. using "SB_BUNPAIRS_IN" event by performing EBS using Intel Vtune.
Level of LOAD instructions being used is almost 20% - 22% in a given program. I would suggest analyze a hotspots using Intel VTune and interpret the Instructions Level Parallelism (ILP), specially the flow when LOAD instructions are feteched & decoded from registers. Here, I would mean ID & IF.
You can analyze both - "executed bundle pairs containing only one load instruction" & "executed bundle pairs containing more than one load instructions" through interpreting disassembly of the hotspots and finally conclude the "degree of parallelism of L2D cache".
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, Thanks very much all.
I think I must do that with the help of binary code analysis.
I think I must do that with the help of binary code analysis.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page