Community
cancel
Showing results for 
Search instead for 
Did you mean: 
pmonday
Beginner
70 Views

NUMA Analysis

I'm trying to understand how local / remote (is that the correct term for memory on the bus near a CPU vs. on the bus near the other CPU) accesses are affecting the performance of my application. I have a highly randomized data set and I'm trying to optimize / localize access to the memory as part of my speedup effort (the data is highly randomized but repeatedly accessed by a thread after the first time).

I found this thread (What set of events to use to profile the intra-processor and inter-processor NUMA cache coherence ov...) with some suggestions for NUMA cache coherence towards the bottom, that is helpful, but I am also looking for general memory information.

Also, some of the suggestions in the thread are dated and don't appear to exist in Update 2 (REMOTE_CACHE_LOCAL_HOME_HIT), or perhaps just not my processor.

I am relatively new to VTune so I hope this doesn't appear overly naive :-) Any help you can give would be greatly appreciated.

0 Kudos
1 Reply
Peter_W_Intel
Employee
70 Views

Hi,

First at all, I recommend this article for your reference.

If you set NUMA on in BIOS, so associated performance event counts can be used:
OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_DRAM
OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_DRAM

Above indicates memory access for all offcore cacheline traffic. There are similar events can be used:
MEM_UNCORE_RETIRED.LCOAL_DRAM
MEM_UNCORE_RETIRED.REMOTE_DRAM

Additionally the article provides many latency info (penalty) for offcore memory access

To evaluateData Latency Analysis Ratios caused by "Remote DRAM", the formula is:
"LLC Load Driven Misses - Remote DRAM" = 275 * MEM_UNCORE_RETIRED.REMOTE_DRAM / CPU_CLK_UNHALTED.THREAD

About using performance counts on VTune AmplifierXE 2011 Update directly (command line) - please refer to this article.

Regards, Peter

Reply