@ Maria M

Ayam · ‎03-24-2014

Hello,

Can you please explain the parameters L2_RQSTS.IFETCH_MISS and L2_RQSTS.IFETCH_HIT?
I am using Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz. This system does not have L2_RQSTS.IFETCH_MISS and L2_RQSTS.IFETCH_HIT parameters.

Are these paramters equal to
L2_RQSTS.IFETCH_MISS = L2_RQSTS.PF_MISS
L2_RQSTS.IFETCH_HIT = L2_RQSTS.PT_HIT

L2_RQSTS.PF_MISS+L2_RQSTS.PT_HIT = L2_RQSTS.ALL_PF

Regards,

Ayam · ‎03-24-2014

Moreover, I am also looking for the equivalent parameters of following in the Intel(R) Xeon(R) CPU E5-2420

1. MEM_LOAD_RETIRED.LLC_UNSHARED_HIT
2. MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM
3. OFFCORE_REQUESTS_OUTSTANDING.DEMAND.READ_DATA_NOT_EMPTY
4. ITLB_MISSES.ANY
4. UOPS_RETIRED.ACTIVE_CYCLES; Can I calculate it by UOPS_RETIRED.TOTAL_CYCLES - UOPS_RETIRED.STALL_CYCLES = UOPS_RETIED.ACTIVE_CYCLES
5. DTLB_LOAD_MISSES.WALK_CYCLES is equal to DTLB_LOAD_MISSES.WALK_DURATION

Sorry for asking about a few parameters however I wanted to be sure of these before running the applications.
Appreciate your help

Thanks

Bernard · ‎03-24-2014

Short explanation taken from the documentation.

L2_RQSTS.IFETCH_HIT
	     (Event 24H, Umask 10H) Counts number of instruction fetches that hit the L2 cache.
	     L2 instruction fetches include both L1I demand misses as well as L1I instruction
	     prefetches.

L2_RQSTS.IFETCH_MISS
	     (Event 24H, Umask 20H) Counts number of instruction fetches that miss the L2 cache.
	     L2 instruction fetches include both L1I demand misses as well as L1I instruction
	     prefetches.

Ayam · ‎03-24-2014

@iliyapolak thanks for replying.

As I have mentioned I have Intel Xeon E5-2400 Family (SNB-EN) that is not showing the parameters L2_RQSTS.IFETCH_HIT and

L2_RQSTS.IFETCH_MISS. I have the L2_RQSTS.PF_MISS and L2_RQSTS.ALL_PF. Can you please confirm that I can use L2_RQSTS.PF_MISS instead of L2_RQSTS.IFETCH_MISS and L2_RQSTS.IFETCH_HIT can be calculated by subtracting the

L2_RQSTS.ALL_PF - L2_RQSTS.PF_MISS?

Bernard · ‎03-24-2014

>>>Are these paramters equal to
L2_RQSTS.IFETCH_MISS = L2_RQSTS.PF_MISS
L2_RQSTS.IFETCH_HIT = L2_RQSTS.PT_HIT>>>

It seems that L2_RQSTS.IFETCH_MISS counts only machine code instructions which missed L1 Icache and hit L2 cache this includes also misses of L1 Icache prefetches.

It seems that L2.RQSTS.IFETCH.HIT counts only machine code instructions which hit L2 Icache.

L2_RQSTS.PF_MISS - counts L2 prefetche misses for both code and data.For example large array which is prefetched to L2 cache and misses L2 cache for both data and code.

L2_RQSTS.PT_HIT - counts L2 prefetche hits for both code and data.For example large array which is prefetched to L2 cache and hits L2 cache for both data and code.

http://www.unix.com/man-page/freebsd/3/pmc.corei7/

Bernard · ‎03-24-2014

>>>Can you please confirm that I can use L2_RQSTS.PF_MISS instead of L2_RQSTS.IFETCH_MISS >>>

I am not sure about this.I only suppose that L2.RQSTS.PF_MISS could contain accumulated result of L2-RQSTS.IFETCH_MISS.

Peter_W_Intel · ‎03-24-2014

There is no L2 instruction cache miss event supported in VTune for Sandbridge processor.

Can we use - events MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS & L2_LINES_IN.ALL?

L2_LINES_IN.ALL records all L2 cache misses, and you can exclude what are for memory access?

Ayam · ‎03-25-2014

@Peter Wang. So I can find L2 cache miss rate by = L2_LINE_IN.ALL/INST_RETIRED.ANY?

Actually I have observed in one of the papers using the following formula to calculate the L2-I miss rate (L2 is unified)

L1-I miss rate = 1000 * L1I.MISSES / INST RETIRED.ANY
L2-I miss rate = L1-I miss rate * L2 RQSTS.IFETCH MISS / (L2 RQSTS.IFETCH HIT + L2 RQSTS.IFETCH MISS)

I guess I can use ICACHE.MISSES instead of L1I.MISSES. However, my system does not have parameters L2 RQSTS.IFETCH MISS and L2 RQSTS.IFETCH HIT.

I will appreciate if you explain how can I change these formulas for SNB-EN system

David_A_Intel1 · ‎03-25-2014

Hi Maria:

Have you checked out the tuning guides? There is a specific paper for Sandy Bridge EP/EX/EN processors in which the following is formulas are documented:

% of cycles spent on memory access (LLC misses):
(MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS * 210) / CPU_CLK_UNHALTED.THREAD

% of cycles spent on last level cache access (2nd level misses that hit in LLC):
((MEM_LOAD_RETIRED.L3_HIT_PS * 40) + (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS
* 88) +
(MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS * 99)) / CPU_CLK_UNHALTED.THREAD

Thresholds: Investigate if –
% cycles for LLC miss ≥ .2,
% cycles for LLC Hit ≥ .2

As well as other performance-impacting issues. I know you are focused on L1 and L2, but if you haven't tuned your LLC, those won't matter. Also, there are other, more impacting issues, you should investigate before focusing on L1 and L2. Please review the paper for insights from our tuning experts.

Ayam · ‎03-25-2014

Thanks for your input @MrAnderson

Can you please elaborate what do you mean by saying "but if you haven't tuned your LLC, those won't matter"?

Moreover, I have gone through the " using-intel-vtune-amplifier-xe-on-xeon-e5-family-1.0.pdf". The formulas you have referred are for the %of cycles spent on accesses. I am looking for the miss rates of caches.

Actually, I want to use the same framework mentioned in the paper "http://parsa.epfl.ch/~jevdjic/papers/TOCS12_Quantifying.pdf" using intel vtune.

David_A_Intel1 · ‎03-25-2014

@Maria M., the farther you get from the processor, the more impact cache misses have. If you are suffering from LLC cache misses, tuning L1 isn't going to help. That's what I meant. You should start with LLC and work up.

I understand many people are used to tuning using cache-miss rates. However, our methodology looks at the "impact", that is, the "cost" of cache misses. Therefore, if your misses are costing you significant performance, the tool will identify those locations in your code. However, it does not show you a graph of your cache-miss rate over time, for example.

I'm not familiar with the paper you are referencing. All you can do is look at the list of available events from the Software Developer's Manual (see Chapter 19 of Volume 3B) and try to figure out which events measure what you are looking for. I don't know what they are and it is not part of our methodology. After a quick glance, there are a bunch of L2 events, but you would have to figure out what you want to count (that is, which misses/hits). Sorry, but it's going to take some work - no easy way around - and maybe some trial-and-error. :(

Peter_W_Intel · ‎03-25-2014

@ Maria M

> So I can find L2 cache miss rate by = L2_LINE_IN.ALL/INST_RETIRED.ANY?

My understanding is that you want to know L2-ICache miss rate? As I said before, L2_LINE_IN.ALL includes all L2 misses for I & D.

L2_RQSTS.PF_MISS & L2_RQSTS.PF_HIT can be used when code prefecthing...... L2_RQSTS.CODE_RD_MISS & L2_RQSTS.CODE_RD_HIT can be used when code fetching. However, I don't think that their penalties will be high...<Intel® 64 and IA-32 Architectures Optimization Reference Manual> doc doesn't mean these.

As Mr.Anderson said, “I understand many people are used to tuning using cache-miss rates. However, our methodology looks at the "impact", that is, the "cost" of cache misses” - it means, L2 miss for memory access impacts on performance is high.

Bernard · ‎03-26-2014

@Maria

What application are you trying to profile?

Ayam · ‎03-27-2014

I am looking a few applications but I have to name one for example then currently I am looking at the hadoop sort application.

Bernard · ‎03-27-2014

So probably in your application where large allocations are made LLC tuning will be mostly important.

Ayam · ‎03-28-2014

Thanks for your feedback. I guess I have to dig into the documents mentioned by all of you to get the better understanding about the parameters.

Sorry to bother you again. @iliyapolak you have mentioned the document "http://www.unix.com/man-page/freebsd/3/pmc.corei7/".

If I am using one parameter for example UOPS_EXECUTED.CORE_ACTIVE_CYCLES, do i have to play with the event and umask , invert options? As it is mentioned that use Cmask = 1 for active cycles, cmask = 0 for weighted cycles.

Regards,

Bernard · ‎03-29-2014

Hi Maria

Do it exactly as it is explained in that doc.

UOPS_EXECUTED.CORE_ACTIVE_CYCLES
	     (Event B1H, Umask 3FH) Counts cycles when the Uops are executing. Use Cmask=1 for
	     active cycles; Cmask=0 for weighted cycles; Use CMask=1, Invert=1 to count P0-4
	     stalled cycles Use Cmask=1, Edge=1, Invert=1 to count P0-4 stalls

For example: Cmask=1 , Invert=1,Edge=1 to count *P0-P4 stalls.

*P0-P4 - execution Ports

Explain the parameters L2 Request Instruction Fetch