- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm currently optimizing some algorithms in assembler using software prefetching.
Now I'd like to measure the changes. For different reasons I can't use VTune and have to read the performance counters manually.
I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses increase.
The data input for my algorithm is exactly 256kByte large and I use a loop which works with 128Byte each Iteration. I prefetched the first value and unrolled the last Iteration. So that all values used by that loop are being prefetched without any overhead. Prefetch Scheduling distance is one Iteration, which seems to work fine.
Performance counters I used:
Use Eventnum Umask
L1 Requests 0x40 0x0F
L2 Requests 0x2E 0xFF
L1 Misses 0xCB 0x02
L2 Misses 0x24 0x01
Is there any reasonable explanation for this behavior?
Thanks in advance,
Michael
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you changed the code, are you sure if Executed Instructions of sensitive code are same? I mean that extra code will increase cycles. Use performance counter INST_RETIRED.ANY to measure, to compare them.
If you changed data structureONLY and improved cache hit rates,you can compare their performance directly.
By the way, here is VTune Performance Analyzer's forum, usually we discussthe problem based on data, generated by tools.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
D-Cache is what I meant.
I am sorry but I'm currently writing my bachelor thesis and the university doesn't provide me with VTune.
I changed nothing but added one prefetch instruction in every iteration of the main loop. So that cache hit rates should be improved.
Anyway, I measured INST_RETIRED.ANY and found out that the number is lower in the algorithm with prefetching.
What does that mean?
Thank you for your help,
Michael
Edit: I also measured INST_RETIRED.LOADS and found out that nearly all instructions retiring the execution are load operations. So, INST_RETIRED.LOADS decreases, that means less cache misses retire execution, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you please useother measurement, like as CPI = total cycles / total instructions retired, to compare?
Use event RESOURCE_STALLS.ANY first; Dig out root-cause, try events in below table to find other factors.
About Stall Events
This group contains events that monitor various stall conditions.
Symbol Name | Event Code | Description |
---|---|---|
DELAYED_BYPASS.FP | 0x19 | Delayed bypass to FP operation. |
DELAYED_BYPASS.LOAD | 0x19 | Delayed bypass to load operation. |
DELAYED_BYPASS.SIMD | 0x19 | Delayed bypass to SIMD operation. |
LOAD_BLOCK.L1D | 0x03 | Loads blocked by the L1 data cache. |
LOAD_BLOCK.OVERLAP_STORE | 0x03 | Loads that partially overlap an earlier store, or 4K aliased with a previous store. |
LOAD_BLOCK.STA | 0x03 | Loads blocked by a preceding store with unknown address. |
LOAD_BLOCK.STD | 0x03 | Loads blocked by a preceding store with unknown data. |
LOAD_BLOCK.UNTIL_RETIRE | 0x03 | Loads blocked until retirement. |
MACHINE_NUKES.MEM_ORDER | 0xC3 | Execution pipeline restart due to memory ordering conflict or memory disambiguation misprediction. |
MACHINE_NUKES.SMC | 0xC3 | Self-Modifying Code detected. |
MEM_LOAD_RETIRED.DTLB_MISS | 0xCB | Retired loads that miss the DTLB (precise event). |
MEM_LOAD_RETIRED.L1D_LINE_MISS | 0xCB | L1 data cache line missed by retired loads (precise event). |
MEM_LOAD_RETIRED.L1D_MISS | 0xCB | Retired loads that miss the L1 data cache (precise event). |
MEM_LOAD_RETIRED.L2_LINE_MISS | 0xCB | L2 cache line missed by retired loads (precise event). |
MEM_LOAD_RETIRED.L2_MISS | 0xCB | Retired loads that miss the L2 cache (precise event). |
RAT_STALLS.ANY | 0xD2 | All RAT stall cycles. |
RAT_STALLS.FLAGS | 0xD2 | Flag stall cycles. |
RAT_STALLS.FLAGS_COUNT | 0xD2 | Flag stall events. |
RAT_STALLS.FPSW | 0xD2 | FPU status word stall. |
RAT_STALLS.PARTIAL_COUNT | 0xD2 | Partial register stall events. |
RAT_STALLS.PARTIAL_CYCLES | 0xD2 | Partial register stall cycles. |
RAT_STALLS.ROB_READ_PORT | 0xD2 | ROB read port stalls cycles. |
RESOURCE_STALLS.ANY | 0xDC | Resource related stalls. |
RESOURCE_STALLS.BR_MISS_CLEAR | 0xDC | Cycles stalled due to branch misprediction. |
RESOURCE_STALLS.FPCW | 0xDC | Cycles stalled due to FPU control word write. |
RESOURCE_STALLS.LD_ST | 0xDC | Cycles during which the pipeline has exceeded load or store limit or waiting to commit all stores. |
RESOURCE_STALLS.ROB_FULL | 0xDC | Cycles during which the ROB is full. |
RESOURCE_STALLS.RS_FULL | 0xDC | Cycles during which the RS is full. |
SB_DRAIN_CYCLES | 0x04 | Cycles while stores are blocked due to store buffer drain. |
STORE_BLOCK.ORDER | 0x04 | Cycles while store is waiting for a preceding store to be globally observed. |
STORE_BLOCK.SNOOP | 0x04 | A store is blocked due to a conflict with an external or internal snoop. |
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is an interesting change ...
This is the result of a measurment with and without software prefetching (I only commented out the prefetch instruction in one case)
NO PREFETCHING PREFETCHING
RESOURCE_STALLS.ANY 239050 215770
RESOURCE_STALLS.LD_ST 71716 141970
RESOURCE_STALLS.RS_FULL 16638 76847
Rest is quite similar in both.
What could that mean?Reservation stations full less often should mean there are less stalls because of long latency operations.
This could hint to less long delaying memory accesses. But why are there more load stalls then?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page