- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I saw the way of computing "MCDRAM bandwidth in Cache Mode" in Intel Xeon Phi processor Performance Monitoring vol2. (section 3, 34)
like this:
MCDRAM Cache read bandwidth
= (ECLK_Events_RPQ_Inserts - UCLK_Events_EDC_Hit/Miss_MISS_CLEAN - UCLK_Events_EDC_Hit/Miss_MISS_DIRTY) * 64 / Time
MCDRAM Cache write bandwidth
= (ECLK_Events_WPQ_Inserts - DCLK_Events_CAS_Reads) * 64 / Time
In FLAT mode, is it right to use just RPQ_inserts or WPQ_inserts to calculate read or write bandwidth
by removing eviction terms (MISS_DIRTY, MISS_CLEAN) and CAS_Reads?
(e.g. Flat read bw = ECLK_EVENTS_RPQ_Inserts * 64 / Time)
Thanks,
Jaey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To answer the original question:
- Your interpretation is correct...
- In Flat Mode, MCDRAM read bandwidth is simply 64*RPQ.INSERTS and MCDRAM write bandwidth is simply 64*WPQ.INSERTS.
- The performance monitoring guide should be updated to make this more clear...
- (While they are at it, it would be really nice if they re-wrote the descriptions of what all the MCDRAM events count in cache and hybrid mode -- the existing descriptions have dangling clauses that make multiple interpretations possible.)
It is likely that streaming/non-temporal stores will sometimes be split, and generate extra (sub-cache-line) transactions. There are performance events to count full and partial streaming stores, but I have not checked to see if they work -- I have never seen more than about 1% over-counting due to partial streaming stores, and that is well within my margin of "don't care". The 64-Byte AVX-512 stores on KNL should make split streaming stores even more rare than they have been in the past.
Read counts are typically higher than manual bulk data transfer estimates by 3%-5% percent (sometimes up to 7%-8%). When I have looked at this in the past I have found most of the difference to be explainable by TLB refills, but some is likely due to the OS scheduler interrupt handler.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When the MCDRAM is configured as non-cache memory it appears as RAM in 1/2/4 NUMA node(s) above the nodes used by the logical processors for computation. You can perform your test data array allocations from these nodes or use the High Bandwidth Memory allocation routines. Properly allocated, you would then use the standard DRAM related counters. Note, in this configuration, normal heap allocations come from the external DRAM.
Note, there is some disagreement as to if the normal heap resides in hbm or external DRAM, there may be a property for your heap manager that can be used to specify your preference.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
NUMA configuration I supposed is to use all-to-all cluster mode
and numactl command in linux to bind heap allocation to MCDRAM.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On my system using SNC-4+Cache I see 4 NUMA nodes, and infer LLC == 16GiB / 4 per node.
With SNC-4 and MCDRAM configured as addressable, I see 8 NUMA nodes and infer LLC = 0
Actually the determination of LLC is a little more complicated. Currently I know of no way to determine on-board MCDRAM size other than by indirect means. Somewhat rough pseudo code:
nNUMAnodes = someFunctionToGetNumaNodes bool NUMAnodesComputing[nNUMAnodes] // array of flags NUMAnodesComputing = false // set all to false StartParallelRegionAllThreads NUMAnodesComputing[myPreferredNUMAnode] = true // overstrike with true EndParallelRegion nComputingNUMANodes = CountTrue(NUMAnodesComputing) // above assumes KNL packs computing nodes in low range, fastmem in high range LLCsize = apiToGetCacheSize(3) if(LLCsize == 0) // may be KNL or CPU without L3 cache (KNL currently has no L3) if(nNUMAnodes > nComputingNUMANodes) // assume KNL with 16GiB MCDRAM NUMAallocatableRAM = SUM(NUMAnodeAvailableRAM where not(NUMAnodesComputing) LLCsize = max(16*1024*1024*1024 - NUMAallocatableRAM, 0) // guess else if(CoreCount >= KNLminimumCoreCount) // assume KNL with 16GiB MCDRAM LLCsize = 16*1024*1024*1024 // or you can use the hbm_... library to obtain total size endif endif endif
The actual code I use uses CPUID to make determinations, this is an exercise left to you.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To answer the original question:
- Your interpretation is correct...
- In Flat Mode, MCDRAM read bandwidth is simply 64*RPQ.INSERTS and MCDRAM write bandwidth is simply 64*WPQ.INSERTS.
- The performance monitoring guide should be updated to make this more clear...
- (While they are at it, it would be really nice if they re-wrote the descriptions of what all the MCDRAM events count in cache and hybrid mode -- the existing descriptions have dangling clauses that make multiple interpretations possible.)
It is likely that streaming/non-temporal stores will sometimes be split, and generate extra (sub-cache-line) transactions. There are performance events to count full and partial streaming stores, but I have not checked to see if they work -- I have never seen more than about 1% over-counting due to partial streaming stores, and that is well within my margin of "don't care". The 64-Byte AVX-512 stores on KNL should make split streaming stores even more rare than they have been in the past.
Read counts are typically higher than manual bulk data transfer estimates by 3%-5% percent (sometimes up to 7%-8%). When I have looked at this in the past I have found most of the difference to be explainable by TLB refills, but some is likely due to the OS scheduler interrupt handler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jaeyoung,
How did you calculate bandwidth in Hybrid mode?
I agree with John, that document needs to be updated. As they list following counters required for Cache mode bandwidth calculation:
- ECLK_Events_RPQ_Inserts
- ECLK_Events_WPQ_Inserts
- UCLK_Events_EDC_Hit/Miss_HIT_CLEAN
- UCLK_Events_EDC_Hit/Miss_HIT_DIRTY
- UCLK_Events_EDC_Hit/Miss_MISS_CLEAN
- UCLK_Events_EDC_Hit/Miss_MISS_DIRTY
- DCLK_Events_CAS_Reads
However, the formula uses only (1), (2), (4), (5) and (6) for total bandwidth calculation.
Thanks.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page