- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am using Xeon Phi running CentOS. To profile performance counter for workload I am running, I use perf. I am trying to get good counters that can give details about request towards L2, then towards MCDRAM and then for DDR4 for given workloads.
As of now, I rely on following counters:
- L2_requests.miss
- u,offcore_response.any_request.ddr
- u,offcore_response.any_request.mcdram
However, the MCDRAM and DDR data I am getting for above are not acceptable as per my analysis. I see more DDR request even though the application uses only MCDRAM in flat mode.
Are there any specific perf/PMU counters that I should look to have clear understanding of what are L2, MCDRAM and DDR request? There are more than 100 counters that I get when I list all using perf.
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All the information you need is in the two volume document on Xeon Phi Performance monitoring
- "Intel Xeon Phi Processor Performance Monitoring Reference Manual - Volume 1: Registers", document 332972, revision 002, March 2017
- "Intel Xeon Phi Processor Performance Monitoring Reference Manual - Volume 2: Events", document 334480, revision 002, March 2017.
The second document lists the performance counter events for the MCDRAM (in the EDC unit) and for the DDR4 (in the MC unit). There are actually very few events described. If you are operating in "Cache" or "Hybrid" mode, the discussion in Section 3.1 explains how to compute the effective MCDRAM and DDR4 bandwidth from the available events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
Thank you.
I am trying to understand how I can use these events with perf. Is it possible to pass these events to perf which can eventually log it for me? Can you please give an example on how to use the events in Volume 2 like: EDC Hit/Miss (section 3)?
Is it possible to counter request sent to specific EDCs (there are 8 of them)? So for core 0 can I know how many request were sent to EDC 0/1/2/3/4/5/6/7?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would suggest you to use VTune for this purpose. This will probably be the easiest since it will allow collect events by their names. So a simple command line like this:
amplxe-cl -collect-with runsa -knob event-config=UNC_E_EDC_ACCESS.HIT_CLEAN,UNC_E_EDC_ACCESS.MISS_CLEAN
will get you event counts for UNC_E_EDC_ACCESS.HIT_CLEAN and UNC_E_EDC_ACCESS.MISS_CLEAN (split per each EDC 0/1/2/3/4/5/6/7).
You can even try to use VTune pre-defined memory-access analysis which provides MCDRAM bandwidth and hit/miss rates as metrics so you may won't need to deal with raw events at all.
But using perf directly is also possible of course. You just need to properly specify the event encodings. For example the following perf command will collect UNC_E_EDC_ACCESS.HIT_CLEAN and UNC_E_EDC_ACCESS.MISS_CLEAN event counts for edc 0:
perf stat -a -e uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x1,thresh=0x0/,uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x4,thresh=0x0/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
Thank you for detailed response.
Dmitry Ryabtsev (Intel) wrote:
But using perf directly is also possible of course. You just need to properly specify the event encodings. For example the following perf command will collect UNC_E_EDC_ACCESS.HIT_CLEAN and UNC_E_EDC_ACCESS.MISS_CLEAN event counts for edc 0:
perf stat -a -e uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x1,thresh=0x0/,uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x4,thresh=0x0/
Could you help me understand how you got this encoding? I am fairly new to this process of encoding of raw events and ideally would like to use perf.
I ran your suggested command for perf, and I get values as "0". I read in another thread on this forum, that EDC hit/miss events are not correctly tuned or don't work fro Xeon Phi.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
I changed your perf command a bit and it gives me counters as following. I changed event=0x2 to event=0x1, which I guess will log "Unit Masks for RPQ" and that's what is important to me as it will log, "Counts the number of read requests received by the MCDRAM controller.".
Do you think below raw event passed to perf is correct or I am doing something wrong?
perf stat -a -I 1000 -e uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ numactl -m 0 ./mmatest1.out Matrix A[ 16384 x 16384 ] Matrix B[ 16384 x 16384 ] Matrix C[ 16384 x 16384 ] Number of OpenMP threads: 1 # time counts unit events 1.000411613 718 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ 2.001636947 153,036 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ 3.002591611 32,780 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ 4.003632304 79,190 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ 5.004665028 35,246 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ 5.125346273 46,958 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am able to get more details from system using raw events. Ideally, L2 Miss = MCDRAM Read + DDR4 Read. Which seems to be the case when I use perf for following log, but with ~13% more read request for EDC and MC.
Can anyone please verify my "uncore" EDC and MC events below used with perf and suggest if I am correct or wrong? I believe, for EDC total requests are being shared among each of the EDC equally.
Profiling
perf stat -a -I 500 -e cpu-cycles,instructions,l2_requests.miss,uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/,uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/ numactl -m 0 ./matrixmultiplication
Output
Matrix A[ 1024 x 1024 ]
Matrix B[ 1024 x 1024 ]
Matrix C[ 1024 x 1024 ]
OMP: Warning #63: KMP_AFFINITY: proclist specified, setting affinity type to "explicit".
Number of OpenMP threads: 1
# time counts unit events
0.558262823 1,062,234,265 cpu-cycles
0.558262823 212,545,396 instructions # 0.20 insn per cycle
0.558262823 283,494 l2_requests.miss
0.558262823 152 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 144 uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 138 uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 104 uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 126 uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 126 uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 120 uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 130 uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
0.558262823 78,468 uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
0.558262823 57,968 uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
1.086696051 1,082,936,073 cpu-cycles
1.086696051 320,935,651 instructions # 0.30 insn per cycle
1.086696051 1,286,804 l2_requests.miss
1.086696051 13,338 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 12,690 uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 13,252 uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 12,550 uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 12,696 uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 12,532 uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 12,746 uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 12,798 uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.086696051 583,952 uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
1.086696051 561,752 uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
MKL - Completed in: 0.1964464 seconds
1.205390526 433,643,461 cpu-cycles
1.205390526 239,348,009 instructions # 0.28 insn per cycle
1.205390526 1,523,587 l2_requests.miss
1.205390526 50,578 uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 50,644 uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 50,724 uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 50,224 uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 50,394 uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 50,254 uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 50,384 uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 50,496 uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
1.205390526 462,624 uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
1.205390526 446,272 uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are the encodings VTune uses internally when it works over perf:
uncore_edc_eclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ - UNC_E_RPQ_INSERTS
uncore_imc_0/period=0x0,inv=0x0,edge=0x0,event=0x3,umask=0x1,thresh=0x0/ - UNC_M_CAS_COUNT.RD
(of course you will need to add same for edc_1, edc_2, ... and imc_1, imc_2,...)
These events can be used to get total read traffic passed through memory controllers. The documentation for them is following:
UNC_E_RPQ_INSERTS: Counts the number of read requests received by the MCDRAM controller. This event is valid in all three memory modes: flat, cache and hybrid. In cache and hybrid memory mode, this event counts all read requests as well as streaming stores that hit or miss in the MCDRAM cache.
UNC_M_CAS_COUNT.RD: All DRAM Read CAS Commands issued
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
I was able to get these counters, however I am not able to trust these values due to uniform read counts for all the EDC and MC. Also, the total request received by MCDRAM and DDR4, should eventually match L2 miss which is not the case for my case.
Any suggestions please?
Thanks,
Chetan Arvind Patil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The individual EDC and IMC units here correspond to memory channels and you usually do have uniform load between channels. So it could be fine.
You can verify the correctness by running some memory bandwidth benchmark like stream or mlc and check if the traffic rate is expected.
It is also important to know in what mode the MCDRAM is configured on your system. Is it flat, cache or hybrid?
You also can't expect that L2 miss core event counts will match with memory controller uncore event counts. For example, I'm not sure the L2 miss event takes into account hw prefetcher requests. Also a miss in a particular core L2 can be satisfied from another core L2 - so no access to memory will be done at all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
I am running system in Flat mode. The two benchmarks I run, one is Intel MKL Matrix Multiplication (takes ~2GB MCDRAM) and Intel Caffe (takes ~16GB+ and consumes full MCDRAM and part of DDR4).
Not sure why a L2 miss in a core will fetch data from another core L2 as in Xeon Phi, all request go to tag directory which eventually point to where in memory the data is? Same should be the case with hardware/software prefetcher, as then miss should not occur if the data is already available to cores?
Few more questions and observations:
1) If I map threads at different cores of Xeon Phi and collect data, I observe total EDC and MC request to be different (2-3%). Shouldn't mapping thread at different cores should lead to uniform request?
2) Is it possible that EDC and MC request also count I/O request?
Please correct me, if I am wrong. Thank you again for your detailed response.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some systems perform "clean" cache-to-cache interventions to reduce memory traffic. There are many possible variations on how this can behave.
Contiguous accesses should generate almost almost perfectly uniform MCDRAM and DRAM references in Flat-All2All mode, but will show small variations in Flat-Quadrant mode. Every time you run the executable will get a different set of physical addresses, which will interact with the address hash in different ways, so the variation you are seeing across cores should be run-to-run variability, rather than core-to-core variability. (All cores see the same address hash, so accessing the same physical addresses from different cores should generate the same access patterns. This can be done from a single-threaded program that allocates the working arrays once and then repeats the workload multiple times using the Linux sched_setaffinity() call to bind the process to different cores.)
The EDC and MC counters will definitely count any IO requests that access memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
Thank you for detailed answer.
Is it possible to log IO request separately using counters?
Thanks,
Chetan Arvind Patil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are not as many IO counters in KNL as in the mainstream Xeon processors. Some information is available from the M2PCIe counters and some from the IRP counters, but not many events are available, and it does not look like they can be used to compute IO-based memory accesses directly.
In these sorts of cases, the best you can do is perform multiple runs on a quiet system and hope that not much IO happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
Thank you.
Is it possible to log L1 cache miss? I can get L2 miss, but want to understand more about L1 private cache.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use the event MEM_UOPS_RETIRED.L1_MISS_LOADS for load misses (event=0x04,umask=0x01). Description: This event counts the number of load micro-ops retired that miss in L1 Data cache. Note that prefetch misses will not be counted.
I havn't found a possibility to count store misses on KNL

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page