Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4995 Discussions

Weired behavior - L1 and L2 Cache Misses

xift
Beginner
451 Views
Hi,
I'm currently optimizing some algorithms in assembler using software prefetching.

Now I'd like to measure the changes. For different reasons I can't use VTune and have to read the performance counters manually.
I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses increase.

The data input for my algorithm is exactly 256kByte large and I use a loop which works with 128Byte each Iteration. I prefetched the first value and unrolled the last Iteration. So that all values used by that loop are being prefetched without any overhead. Prefetch Scheduling distance is one Iteration, which seems to work fine.

Performance counters I used:

Use Eventnum Umask
L1 Requests 0x40 0x0F
L2 Requests 0x2E 0xFF
L1 Misses 0xCB 0x02
L2 Misses 0x24 0x01

Is there any reasonable explanation for this behavior?

Thanks in advance,
Michael
0 Kudos
4 Replies
Peter_W_Intel
Employee
451 Views
Did you mean D-Cache, not I-Cache? Is it Serial code?

If you changed the code, are you sure if Executed Instructions of sensitive code are same? I mean that extra code will increase cycles. Use performance counter INST_RETIRED.ANY to measure, to compare them.

If you changed data structureONLY and improved cache hit rates,you can compare their performance directly.

By the way, here is VTune Performance Analyzer's forum, usually we discussthe problem based on data, generated by tools.


Regards, Peter
0 Kudos
xift
Beginner
451 Views
Thanks for your fast answer.
D-Cache is what I meant.
I am sorry but I'm currently writing my bachelor thesis and the university doesn't provide me with VTune.

I changed nothing but added one prefetch instruction in every iteration of the main loop. So that cache hit rates should be improved.
Anyway, I measured INST_RETIRED.ANY and found out that the number is lower in the algorithm with prefetching.
What does that mean?

Thank you for your help,
Michael

Edit: I also measured INST_RETIRED.LOADS and found out that nearly all instructions retiring the execution are load operations. So, INST_RETIRED.LOADS decreases, that means less cache misses retire execution, right?
0 Kudos
Peter_W_Intel
Employee
451 Views
Hi Michael,

Can you please useother measurement, like as CPI = total cycles / total instructions retired, to compare?

Use event RESOURCE_STALLS.ANY first; Dig out root-cause, try events in below table to find other factors.


About Stall Events

This group contains events that monitor various stall conditions.



Symbol Name

Event Code

Description

DELAYED_BYPASS.FP

0x19

Delayed bypass to FP operation.

DELAYED_BYPASS.LOAD

0x19

Delayed bypass to load operation.

DELAYED_BYPASS.SIMD

0x19

Delayed bypass to SIMD operation.

LOAD_BLOCK.L1D

0x03

Loads blocked by the L1 data cache.

LOAD_BLOCK.OVERLAP_STORE

0x03

Loads that partially overlap an earlier store, or 4K aliased with a previous store.

LOAD_BLOCK.STA

0x03

Loads blocked by a preceding store with unknown address.

LOAD_BLOCK.STD

0x03

Loads blocked by a preceding store with unknown data.

LOAD_BLOCK.UNTIL_RETIRE

0x03

Loads blocked until retirement.

MACHINE_NUKES.MEM_ORDER

0xC3

Execution pipeline restart due to memory ordering conflict or memory disambiguation misprediction.

MACHINE_NUKES.SMC

0xC3

Self-Modifying Code detected.

MEM_LOAD_RETIRED.DTLB_MISS

0xCB

Retired loads that miss the DTLB (precise event).

MEM_LOAD_RETIRED.L1D_LINE_MISS

0xCB

L1 data cache line missed by retired loads (precise event).

MEM_LOAD_RETIRED.L1D_MISS

0xCB

Retired loads that miss the L1 data cache (precise event).

MEM_LOAD_RETIRED.L2_LINE_MISS

0xCB

L2 cache line missed by retired loads (precise event).

MEM_LOAD_RETIRED.L2_MISS

0xCB

Retired loads that miss the L2 cache (precise event).

RAT_STALLS.ANY

0xD2

All RAT stall cycles.

RAT_STALLS.FLAGS

0xD2

Flag stall cycles.

RAT_STALLS.FLAGS_COUNT

0xD2

Flag stall events.

RAT_STALLS.FPSW

0xD2

FPU status word stall.

RAT_STALLS.PARTIAL_COUNT

0xD2

Partial register stall events.

RAT_STALLS.PARTIAL_CYCLES

0xD2

Partial register stall cycles.

RAT_STALLS.ROB_READ_PORT

0xD2

ROB read port stalls cycles.

RESOURCE_STALLS.ANY

0xDC

Resource related stalls.

RESOURCE_STALLS.BR_MISS_CLEAR

0xDC

Cycles stalled due to branch misprediction.

RESOURCE_STALLS.FPCW

0xDC

Cycles stalled due to FPU control word write.

RESOURCE_STALLS.LD_ST

0xDC

Cycles during which the pipeline has exceeded load or store limit or waiting to commit all stores.

RESOURCE_STALLS.ROB_FULL

0xDC

Cycles during which the ROB is full.

RESOURCE_STALLS.RS_FULL

0xDC

Cycles during which the RS is full.

SB_DRAIN_CYCLES

0x04

Cycles while stores are blocked due to store buffer drain.

STORE_BLOCK.ORDER

0x04

Cycles while store is waiting for a preceding store to be globally observed.

STORE_BLOCK.SNOOP

0x04

A store is blocked due to a conflict with an external or internal snoop.

Good luck!

Regards, Peter
0 Kudos
xift
Beginner
451 Views
Thanks,

There is an interesting change ...
This is the result of a measurment with and without software prefetching (I only commented out the prefetch instruction in one case)
NO PREFETCHING PREFETCHING

RESOURCE_STALLS.ANY 239050 215770

RESOURCE_STALLS.LD_ST 71716 141970

RESOURCE_STALLS.RS_FULL 16638 76847

Rest is quite similar in both.

What could that mean?
Reservation stations full less often should mean there are less stalls because of long latency operations.
This could hint to less long delaying memory accesses. But why are there more load stalls then?
0 Kudos
Reply