Community
cancel
Showing results for 
Search instead for 
Did you mean: 
yeyangever
Beginner
31 Views

What set of events to use to profile the intra-processor and inter-processor NUMA cache coherence overhead

Hi all,
I have a program that is known to have a lot of intra-processor (different cores in the processor) and intra-processor (two processors in the machine) Non-uniform memory aceess cache corehence overhead.
But I want to quantify the overhead using Vtune, which sepcific set of events shall I choose?
I am thinking of :

MEM_LOAD_RETIRED.LLC_HIT_OTHER_CORE_HIT_HITM

MEM_UNCORE_RETIRED.LOCAL_DRAM

MEM_UNCORE_RETIRED.REMOTE_DRAM

Seeking your suggestion, thank you!

0 Kudos
5 Replies
Rob5
New Contributor II
31 Views

We are looking into various aspects and hope to have information for you soon.
Thanks
Rob
Intel Support
yeyangever
Beginner
31 Views

Hi BOb,
Thanks! It should be really helpful.
One errate of my question:
we're looking what set of events to profile to quantify the "intra-processor (different cores in the processor) and inter-processor(chip/die) (different processors in the machine) Non-uniform memory aceess cache corehence overhead.
Rob5
New Contributor II
31 Views

Hello,

We are still researching the possibilities. However, to do this properly can you provide additional information to clarify? Can you clarify what is meant by coherence as it is referenced in your original post?

Are you using NHM? There is a formula for NUMA distribution on NMH that might be of interest to you. To determine the NUMA distribution for your workload:

% Remote Access =

(OFFCORE_RESPONSE_0.DATA_IN.REMOTE_DRAM / (OFFCORE_RESPONSE_0.DATA_IN.LOCAL_DRAM + OFFCORE_RESPONSE_0.DATA_IN.REMOTE_DRAM)) * 100

Thanks
Rob
Intel Support

yeyangever
Beginner
31 Views

Hi Rob,
Thank you for the information.
Yes I am using Xeon E5620 * 2 processors. Each with 4 cores and 12MB L3 cache.
I mean the coherency overhead by:
Core 0 read some data that is in exclusive/shared/modified status of Core 1's cache.
For exclusive status, the latency is researched to be about 65 cycles: if data's in Core 1's L1, L2 or L3 cache.
For modified status, the latecy is about 80 cycles if data's in Core 1's L1 or L2 cache, 38 cycles if it's in the L3 cache (shared by Core 0 - Core 3)
For shared status, it's 38 cycles.
The above is the inter-core, intra-processor(die/chip) cache coherency overhead.
But if Core 0 access data in exclusive/shared/modified status of Core 4's cache (Another chip/socket):
For exclusive status, the latency is about 190 cycles whether it's in Core 4's L1, L2 or L3
For modified status, the latency is about 105 cycles whether it's in Core 4's L1, L2 or L3
For shared status, the latency is about 170 cycles whether it's in Core 4's L1, L2 or L3

The above is inter-processor(die/chip) cache coherency overhead.
The data are cited from:
So my question is: what's the corresponding names of the above six cache coherency events?


Rob5
New Contributor II
31 Views

Hello,

The events that are available are listed below. Unfortunately, there are not events for each of the scenarios you have described. However, these events may prove useful in the work you are attempting to accomplish. Detailed information about each of these events can be found in the Intel VTune Amplifier XE 2011 on-line help. To access on-line help, open Intel VTune Amplifier XE 2011 click on Help > Intel VTune Amplifier XE 2011 Help. The on-line help window will appear. Click on the Index tab and enter the event name in the keyword field. The related documentation will display.


Local read:

Found in requesting cores L1:

- In E state: L1D_CACHE_LD.E_STATE

- In S state: L1D_CACHE_LD.S_STATE

- In M state: L1D_CACHE_LD.M_STATE

Found in requesting cores L2:

- In E state: L2_DATA_RQSTS.DEMAND.E_STATE

- In S State: L2_DATA_RQSTS.DEMAND.S_STATE

- In M state: L2_DATA_RQSTS.DEMAND.M_STATE

Found in shared L3:

- In E or M state: MEM_LOAD_RETIRED.LLC_UNSHARED_HIT

- In S State: not available

Found in sibling cores L2:

- In E, S, or M state: MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM

Remote read:

Found in locally homed remote L3:

- In E or S state: REMOTE_CACHE_LOCAL_HOME_HIT

- In M state: not available

Found in remotely homed remote L3:

- In E, S, or M state: not available


Let us know if you need additional information or follow-up.


Thanks
Rob
Intel Support