MCDRAM and DDR4 Performance Counter Xeon Phi

CPati2 · ‎01-18-2018

Hi All,

I am using Xeon Phi running CentOS. To profile performance counter for workload I am running, I use perf. I am trying to get good counters that can give details about request towards L2, then towards MCDRAM and then for DDR4 for given workloads.

As of now, I rely on following counters:

L2_requests.miss
u,offcore_response.any_request.ddr
u,offcore_response.any_request.mcdram

However, the MCDRAM and DDR data I am getting for above are not acceptable as per my analysis. I see more DDR request even though the application uses only MCDRAM in flat mode.

Are there any specific perf/PMU counters that I should look to have clear understanding of what are L2, MCDRAM and DDR request? There are more than 100 counters that I get when I list all using perf.

Thanks.

McCalpinJohn · ‎01-18-2018

All the information you need is in the two volume document on Xeon Phi Performance monitoring

"Intel Xeon Phi Processor Performance Monitoring Reference Manual - Volume 1: Registers", document 332972, revision 002, March 2017
"Intel Xeon Phi Processor Performance Monitoring Reference Manual - Volume 2: Events", document 334480, revision 002, March 2017.

The second document lists the performance counter events for the MCDRAM (in the EDC unit) and for the DDR4 (in the MC unit). There are actually very few events described. If you are operating in "Cache" or "Hybrid" mode, the discussion in Section 3.1 explains how to compute the effective MCDRAM and DDR4 bandwidth from the available events.

CPati2 · ‎01-18-2018

Hi John,

Thank you.

I am trying to understand how I can use these events with perf. Is it possible to pass these events to perf which can eventually log it for me? Can you please give an example on how to use the events in Volume 2 like: EDC Hit/Miss (section 3)?

Is it possible to counter request sent to specific EDCs (there are 8 of them)? So for core 0 can I know how many request were sent to EDC 0/1/2/3/4/5/6/7?

Thanks.

Dmitry_R_Intel1 · ‎01-19-2018

I would suggest you to use VTune for this purpose. This will probably be the easiest since it will allow collect events by their names. So a simple command line like this:

amplxe-cl -collect-with runsa -knob event-config=UNC_E_EDC_ACCESS.HIT_CLEAN,UNC_E_EDC_ACCESS.MISS_CLEAN

will get you event counts for UNC_E_EDC_ACCESS.HIT_CLEAN and UNC_E_EDC_ACCESS.MISS_CLEAN (split per each EDC 0/1/2/3/4/5/6/7).

You can even try to use VTune pre-defined memory-access analysis which provides MCDRAM bandwidth and hit/miss rates as metrics so you may won't need to deal with raw events at all.

But using perf directly is also possible of course. You just need to properly specify the event encodings. For example the following perf command will collect UNC_E_EDC_ACCESS.HIT_CLEAN and UNC_E_EDC_ACCESS.MISS_CLEAN event counts for edc 0:

perf stat -a -e uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x1,thresh=0x0/,uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x4,thresh=0x0/

CPati2 · ‎01-19-2018

Hi Dmitry,

Thank you for detailed response.

Dmitry Ryabtsev (Intel) wrote:

But using perf directly is also possible of course. You just need to properly specify the event encodings. For example the following perf command will collect UNC_E_EDC_ACCESS.HIT_CLEAN and UNC_E_EDC_ACCESS.MISS_CLEAN event counts for edc 0:

perf stat -a -e uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x1,thresh=0x0/,uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x2,umask=0x4,thresh=0x0/

Could you help me understand how you got this encoding? I am fairly new to this process of encoding of raw events and ideally would like to use perf.

I ran your suggested command for perf, and I get values as "0". I read in another thread on this forum, that EDC hit/miss events are not correctly tuned or don't work fro Xeon Phi.

Thanks.

CPati2 · ‎01-19-2018

Hi Dmitry,

I changed your perf command a bit and it gives me counters as following. I changed event=0x2 to event=0x1, which I guess will log "Unit Masks for RPQ" and that's what is important to me as it will log, "Counts the number of read requests received by the MCDRAM controller.".

Do you think below raw event passed to perf is correct or I am doing something wrong?

perf stat -a -I 1000 -e uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ numactl -m 0 ./mmatest1.out

Matrix A[ 16384 x 16384 ]
Matrix B[ 16384 x 16384 ]
Matrix C[ 16384 x 16384 ]

Number of OpenMP threads:   1

#           time             counts unit events
     1.000411613                718      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     2.001636947            153,036      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     3.002591611             32,780      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     4.003632304             79,190      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     5.004665028             35,246      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     5.125346273             46,958      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/

Thanks.

CPati2 · ‎01-20-2018

Hi All,

I am able to get more details from system using raw events. Ideally, L2 Miss = MCDRAM Read + DDR4 Read. Which seems to be the case when I use perf for following log, but with ~13% more read request for EDC and MC.

Can anyone please verify my "uncore" EDC and MC events below used with perf and suggest if I am correct or wrong? I believe, for EDC total requests are being shared among each of the EDC equally.

Profiling

perf stat -a -I 500 -e cpu-cycles,instructions,l2_requests.miss,uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/,uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/,uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/ numactl -m 0 ./matrixmultiplication

Output

Matrix A[ 1024 x 1024 ]
Matrix B[ 1024 x 1024 ]
Matrix C[ 1024 x 1024 ]
OMP: Warning #63: KMP_AFFINITY: proclist specified, setting affinity type to "explicit".
Number of OpenMP threads:   1
#           time             counts unit events
     0.558262823      1,062,234,265      cpu-cycles
     0.558262823        212,545,396      instructions              #    0.20  insn per cycle
     0.558262823            283,494      l2_requests.miss
     0.558262823                152      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823                144      uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823                138      uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823                104      uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823                126      uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823                126      uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823                120      uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823                130      uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     0.558262823             78,468      uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
     0.558262823             57,968      uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
     1.086696051      1,082,936,073      cpu-cycles
     1.086696051        320,935,651      instructions              #    0.30  insn per cycle
     1.086696051          1,286,804      l2_requests.miss
     1.086696051             13,338      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051             12,690      uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051             13,252      uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051             12,550      uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051             12,696      uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051             12,532      uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051             12,746      uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051             12,798      uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.086696051            583,952      uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
     1.086696051            561,752      uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
        MKL  - Completed in: 0.1964464 seconds
     1.205390526        433,643,461      cpu-cycles
     1.205390526        239,348,009      instructions              #    0.28  insn per cycle
     1.205390526          1,523,587      l2_requests.miss
     1.205390526             50,578      uncore_edc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526             50,644      uncore_edc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526             50,724      uncore_edc_uclk_2/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526             50,224      uncore_edc_uclk_3/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526             50,394      uncore_edc_uclk_4/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526             50,254      uncore_edc_uclk_5/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526             50,384      uncore_edc_uclk_6/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526             50,496      uncore_edc_uclk_7/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/
     1.205390526            462,624      uncore_imc_uclk_0/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/
     1.205390526            446,272      uncore_imc_uclk_1/period=0x0,inv=0x0,edge=0x0,event=0x01,umask=0x01,thresh=0x0/

Thanks.

Dmitry_R_Intel1 · ‎01-21-2018

Here are the encodings VTune uses internally when it works over perf:

uncore_edc_eclk_0/period=0x0,inv=0x0,edge=0x0,event=0x1,umask=0x1,thresh=0x0/ - UNC_E_RPQ_INSERTS

uncore_imc_0/period=0x0,inv=0x0,edge=0x0,event=0x3,umask=0x1,thresh=0x0/ - UNC_M_CAS_COUNT.RD

(of course you will need to add same for edc_1, edc_2, ... and imc_1, imc_2,...)

These events can be used to get total read traffic passed through memory controllers. The documentation for them is following:

UNC_E_RPQ_INSERTS: Counts the number of read requests received by the MCDRAM controller. This event is valid in all three memory modes: flat, cache and hybrid. In cache and hybrid memory mode, this event counts all read requests as well as streaming stores that hit or miss in the MCDRAM cache.

UNC_M_CAS_COUNT.RD: All DRAM Read CAS Commands issued

CPati2 · ‎01-22-2018

Hi Dmitry,

I was able to get these counters, however I am not able to trust these values due to uniform read counts for all the EDC and MC. Also, the total request received by MCDRAM and DDR4, should eventually match L2 miss which is not the case for my case.

Any suggestions please?

Thanks,
Chetan Arvind Patil

Dmitry_R_Intel1 · ‎01-23-2018

The individual EDC and IMC units here correspond to memory channels and you usually do have uniform load between channels. So it could be fine.

You can verify the correctness by running some memory bandwidth benchmark like stream or mlc and check if the traffic rate is expected.

It is also important to know in what mode the MCDRAM is configured on your system. Is it flat, cache or hybrid?

You also can't expect that L2 miss core event counts will match with memory controller uncore event counts. For example, I'm not sure the L2 miss event takes into account hw prefetcher requests. Also a miss in a particular core L2 can be satisfied from another core L2 - so no access to memory will be done at all.

CPati2 · ‎01-23-2018

Hi Dmitry,

I am running system in Flat mode. The two benchmarks I run, one is Intel MKL Matrix Multiplication (takes ~2GB MCDRAM) and Intel Caffe (takes ~16GB+ and consumes full MCDRAM and part of DDR4).

Not sure why a L2 miss in a core will fetch data from another core L2 as in Xeon Phi, all request go to tag directory which eventually point to where in memory the data is? Same should be the case with hardware/software prefetcher, as then miss should not occur if the data is already available to cores?

Few more questions and observations:

1) If I map threads at different cores of Xeon Phi and collect data, I observe total EDC and MC request to be different (2-3%). Shouldn't mapping thread at different cores should lead to uniform request?
2) Is it possible that EDC and MC request also count I/O request?

Please correct me, if I am wrong. Thank you again for your detailed response.

Thanks.

McCalpinJohn · ‎01-23-2018

Some systems perform "clean" cache-to-cache interventions to reduce memory traffic. There are many possible variations on how this can behave.

Contiguous accesses should generate almost almost perfectly uniform MCDRAM and DRAM references in Flat-All2All mode, but will show small variations in Flat-Quadrant mode. Every time you run the executable will get a different set of physical addresses, which will interact with the address hash in different ways, so the variation you are seeing across cores should be run-to-run variability, rather than core-to-core variability. (All cores see the same address hash, so accessing the same physical addresses from different cores should generate the same access patterns. This can be done from a single-threaded program that allocates the working arrays once and then repeats the workload multiple times using the Linux sched_setaffinity() call to bind the process to different cores.)

The EDC and MC counters will definitely count any IO requests that access memory.

CPati2 · ‎02-01-2018

Hi John,

Thank you for detailed answer.

Is it possible to log IO request separately using counters?

Thanks,
Chetan Arvind Patil

McCalpinJohn · ‎02-02-2018

There are not as many IO counters in KNL as in the mainstream Xeon processors. Some information is available from the M2PCIe counters and some from the IRP counters, but not many events are available, and it does not look like they can be used to compute IO-based memory accesses directly.

In these sorts of cases, the best you can do is perform multiple runs on a quiet system and hope that not much IO happens.

CPati2 · ‎02-06-2018

Hi John,

Thank you.

Is it possible to log L1 cache miss? I can get L2 miss, but want to understand more about L1 private cache.

Thanks.

Thomas_G_4 · ‎02-07-2018

You can use the event MEM_UOPS_RETIRED.L1_MISS_LOADS for load misses (event=0x04,umask=0x01). Description: This event counts the number of load micro-ops retired that miss in L1 Data cache. Note that prefetch misses will not be counted.

I havn't found a possibility to count store misses on KNL