About Stall Events

xift · ‎06-22-2010

Hi,
I'm currently optimizing some algorithms in assembler using software prefetching.

Now I'd like to measure the changes. For different reasons I can't use VTune and have to read the performance counters manually.
I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses increase.

The data input for my algorithm is exactly 256kByte large and I use a loop which works with 128Byte each Iteration. I prefetched the first value and unrolled the last Iteration. So that all values used by that loop are being prefetched without any overhead. Prefetch Scheduling distance is one Iteration, which seems to work fine.

Performance counters I used:

Use Eventnum Umask
L1 Requests 0x40 0x0F
L2 Requests 0x2E 0xFF
L1 Misses 0xCB 0x02
L2 Misses 0x24 0x01

Is there any reasonable explanation for this behavior?

Thanks in advance,
Michael

Peter_W_Intel · ‎06-22-2010

Did you mean D-Cache, not I-Cache? Is it Serial code?

If you changed the code, are you sure if Executed Instructions of sensitive code are same? I mean that extra code will increase cycles. Use performance counter INST_RETIRED.ANY to measure, to compare them.

If you changed data structureONLY and improved cache hit rates,you can compare their performance directly.

By the way, here is VTune Performance Analyzer's forum, usually we discussthe problem based on data, generated by tools.

Regards, Peter

xift · ‎06-22-2010

Thanks for your fast answer.
D-Cache is what I meant.
I am sorry but I'm currently writing my bachelor thesis and the university doesn't provide me with VTune.

I changed nothing but added one prefetch instruction in every iteration of the main loop. So that cache hit rates should be improved.
Anyway, I measured INST_RETIRED.ANY and found out that the number is lower in the algorithm with prefetching.
What does that mean?

Thank you for your help,
Michael

Edit: I also measured INST_RETIRED.LOADS and found out that nearly all instructions retiring the execution are load operations. So, INST_RETIRED.LOADS decreases, that means less cache misses retire execution, right?

Peter_W_Intel · ‎06-22-2010

Hi Michael,

Can you please useother measurement, like as CPI = total cycles / total instructions retired, to compare?

Use event RESOURCE_STALLS.ANY first; Dig out root-cause, try events in below table to find other factors.

About Stall Events

This group contains events that monitor various stall conditions.

Symbol Name	Event Code	Description
DELAYED_BYPASS.FP	0x19	Delayed bypass to FP operation.
DELAYED_BYPASS.LOAD	0x19	Delayed bypass to load operation.
DELAYED_BYPASS.SIMD	0x19	Delayed bypass to SIMD operation.
LOAD_BLOCK.L1D	0x03	Loads blocked by the L1 data cache.
LOAD_BLOCK.OVERLAP_STORE	0x03	Loads that partially overlap an earlier store, or 4K aliased with a previous store.
LOAD_BLOCK.STA	0x03	Loads blocked by a preceding store with unknown address.
LOAD_BLOCK.STD	0x03	Loads blocked by a preceding store with unknown data.
LOAD_BLOCK.UNTIL_RETIRE	0x03	Loads blocked until retirement.
MACHINE_NUKES.MEM_ORDER	0xC3	Execution pipeline restart due to memory ordering conflict or memory disambiguation misprediction.
MACHINE_NUKES.SMC	0xC3	Self-Modifying Code detected.
MEM_LOAD_RETIRED.DTLB_MISS	0xCB	Retired loads that miss the DTLB (precise event).
MEM_LOAD_RETIRED.L1D_LINE_MISS	0xCB	L1 data cache line missed by retired loads (precise event).
MEM_LOAD_RETIRED.L1D_MISS	0xCB	Retired loads that miss the L1 data cache (precise event).
MEM_LOAD_RETIRED.L2_LINE_MISS	0xCB	L2 cache line missed by retired loads (precise event).
MEM_LOAD_RETIRED.L2_MISS	0xCB	Retired loads that miss the L2 cache (precise event).
RAT_STALLS.ANY	0xD2	All RAT stall cycles.
RAT_STALLS.FLAGS	0xD2	Flag stall cycles.
RAT_STALLS.FLAGS_COUNT	0xD2	Flag stall events.
RAT_STALLS.FPSW	0xD2	FPU status word stall.
RAT_STALLS.PARTIAL_COUNT	0xD2	Partial register stall events.
RAT_STALLS.PARTIAL_CYCLES	0xD2	Partial register stall cycles.
RAT_STALLS.ROB_READ_PORT	0xD2	ROB read port stalls cycles.
RESOURCE_STALLS.ANY	0xDC	Resource related stalls.
RESOURCE_STALLS.BR_MISS_CLEAR	0xDC	Cycles stalled due to branch misprediction.
RESOURCE_STALLS.FPCW	0xDC	Cycles stalled due to FPU control word write.
RESOURCE_STALLS.LD_ST	0xDC	Cycles during which the pipeline has exceeded load or store limit or waiting to commit all stores.
RESOURCE_STALLS.ROB_FULL	0xDC	Cycles during which the ROB is full.
RESOURCE_STALLS.RS_FULL	0xDC	Cycles during which the RS is full.
SB_DRAIN_CYCLES	0x04	Cycles while stores are blocked due to store buffer drain.
STORE_BLOCK.ORDER	0x04	Cycles while store is waiting for a preceding store to be globally observed.
STORE_BLOCK.SNOOP	0x04	A store is blocked due to a conflict with an external or internal snoop.

Good luck!

Regards, Peter

xift · ‎06-22-2010

Thanks,

There is an interesting change ...
This is the result of a measurment with and without software prefetching (I only commented out the prefetch instruction in one case)
NO PREFETCHING PREFETCHING

RESOURCE_STALLS.ANY 239050 215770

RESOURCE_STALLS.LD_ST 71716 141970

RESOURCE_STALLS.RS_FULL 16638 76847

Rest is quite similar in both.

What could that mean?
Reservation stations full less often should mean there are less stalls because of long latency operations.
This could hint to less long delaying memory accesses. But why are there more load stalls then?

Weired behavior - L1 and L2 Cache Misses

About Stall Events

Symbol Name

Event Code

Description