Solved: Computing bandwidth in KNL FLAT mode

jang__jaeyoung · ‎12-20-2016

Hello,

I saw the way of computing "MCDRAM bandwidth in Cache Mode" in Intel Xeon Phi processor Performance Monitoring vol2. (section 3, 34)

like this:

MCDRAM Cache read bandwidth

= (ECLK_Events_RPQ_Inserts - UCLK_Events_EDC_Hit/Miss_MISS_CLEAN - UCLK_Events_EDC_Hit/Miss_MISS_DIRTY) * 64 / Time

MCDRAM Cache write bandwidth

= (ECLK_Events_WPQ_Inserts - DCLK_Events_CAS_Reads) * 64 / Time

(ref: https://software.intel.com/en-us/articles/intel-xeon-phi-x200-family-processor-performance-monitoring-reference-manual)

In FLAT mode, is it right to use just RPQ_inserts or WPQ_inserts to calculate read or write bandwidth

by removing eviction terms (MISS_DIRTY, MISS_CLEAN) and CAS_Reads?

(e.g. Flat read bw = ECLK_EVENTS_RPQ_Inserts * 64 / Time)

Thanks,

Jaey

McCalpinJohn · ‎12-20-2016

To answer the original question:

Your interpretation is correct...
- In Flat Mode, MCDRAM read bandwidth is simply 64*RPQ.INSERTS and MCDRAM write bandwidth is simply 64*WPQ.INSERTS.
The performance monitoring guide should be updated to make this more clear...
- (While they are at it, it would be really nice if they re-wrote the descriptions of what all the MCDRAM events count in cache and hybrid mode -- the existing descriptions have dangling clauses that make multiple interpretations possible.)

It is likely that streaming/non-temporal stores will sometimes be split, and generate extra (sub-cache-line) transactions. There are performance events to count full and partial streaming stores, but I have not checked to see if they work -- I have never seen more than about 1% over-counting due to partial streaming stores, and that is well within my margin of "don't care". The 64-Byte AVX-512 stores on KNL should make split streaming stores even more rare than they have been in the past.

Read counts are typically higher than manual bulk data transfer estimates by 3%-5% percent (sometimes up to 7%-8%). When I have looked at this in the past I have found most of the difference to be explainable by TLB refills, but some is likely due to the OS scheduler interrupt handler.

View solution in original post

jimdempseyatthecove · ‎12-20-2016

When the MCDRAM is configured as non-cache memory it appears as RAM in 1/2/4 NUMA node(s) above the nodes used by the logical processors for computation. You can perform your test data array allocations from these nodes or use the High Bandwidth Memory allocation routines. Properly allocated, you would then use the standard DRAM related counters. Note, in this configuration, normal heap allocations come from the external DRAM.

Note, there is some disagreement as to if the normal heap resides in hbm or external DRAM, there may be a property for your heap manager that can be used to specify your preference.

Jim Dempsey

jang__jaeyoung · ‎12-20-2016

Thanks.

NUMA configuration I supposed is to use all-to-all cluster mode

and numactl command in linux to bind heap allocation to MCDRAM.

jimdempseyatthecove · ‎12-20-2016

On my system using SNC-4+Cache I see 4 NUMA nodes, and infer LLC == 16GiB / 4 per node.
With SNC-4 and MCDRAM configured as addressable, I see 8 NUMA nodes and infer LLC = 0

Actually the determination of LLC is a little more complicated. Currently I know of no way to determine on-board MCDRAM size other than by indirect means. Somewhat rough pseudo code:

nNUMAnodes = someFunctionToGetNumaNodes
bool NUMAnodesComputing[nNUMAnodes] // array of flags
NUMAnodesComputing = false // set all to false
StartParallelRegionAllThreads
  NUMAnodesComputing[myPreferredNUMAnode] = true // overstrike with true
EndParallelRegion
nComputingNUMANodes = CountTrue(NUMAnodesComputing)
// above assumes KNL packs computing nodes in low range, fastmem in high range
LLCsize = apiToGetCacheSize(3)
if(LLCsize == 0)
  // may be KNL or CPU without L3 cache (KNL currently has no L3)
  if(nNUMAnodes > nComputingNUMANodes)
    // assume KNL with 16GiB MCDRAM
    NUMAallocatableRAM = SUM(NUMAnodeAvailableRAM where not(NUMAnodesComputing)
    LLCsize = max(16*1024*1024*1024 - NUMAallocatableRAM, 0) // guess
  else
    if(CoreCount >= KNLminimumCoreCount)
      // assume KNL with 16GiB MCDRAM
      LLCsize = 16*1024*1024*1024
      // or you can use the hbm_... library to obtain total size
    endif
  endif
endif

The actual code I use uses CPUID to make determinations, this is an exercise left to you.

Jim Dempsey

McCalpinJohn · ‎12-20-2016

To answer the original question:

Your interpretation is correct...
- In Flat Mode, MCDRAM read bandwidth is simply 64*RPQ.INSERTS and MCDRAM write bandwidth is simply 64*WPQ.INSERTS.
The performance monitoring guide should be updated to make this more clear...
- (While they are at it, it would be really nice if they re-wrote the descriptions of what all the MCDRAM events count in cache and hybrid mode -- the existing descriptions have dangling clauses that make multiple interpretations possible.)

It is likely that streaming/non-temporal stores will sometimes be split, and generate extra (sub-cache-line) transactions. There are performance events to count full and partial streaming stores, but I have not checked to see if they work -- I have never seen more than about 1% over-counting due to partial streaming stores, and that is well within my margin of "don't care". The 64-Byte AVX-512 stores on KNL should make split streaming stores even more rare than they have been in the past.

Read counts are typically higher than manual bulk data transfer estimates by 3%-5% percent (sometimes up to 7%-8%). When I have looked at this in the past I have found most of the difference to be explainable by TLB refills, but some is likely due to the OS scheduler interrupt handler.

CPati2 · ‎03-22-2018

Hi Jaeyoung,

How did you calculate bandwidth in Hybrid mode?

I agree with John, that document needs to be updated. As they list following counters required for Cache mode bandwidth calculation:

ECLK_Events_RPQ_Inserts
ECLK_Events_WPQ_Inserts
UCLK_Events_EDC_Hit/Miss_HIT_CLEAN
UCLK_Events_EDC_Hit/Miss_HIT_DIRTY
UCLK_Events_EDC_Hit/Miss_MISS_CLEAN
UCLK_Events_EDC_Hit/Miss_MISS_DIRTY
DCLK_Events_CAS_Reads

However, the formula uses only (1), (2), (4), (5) and (6) for total bandwidth calculation.

Thanks.