Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2221 Discussions

Measuring data movement from DRAM to KNL memory

Md_Fazlay_R_
Beginner
1,048 Views

Dear All,

I am implementing and testing LOBPCG algorithm on KNL machine for some big sparse matrices. For the performance report, I need to measure how much data is transferred from DRAM to KNL memory. I am wondering if there is a simple way of doing this. Any help or idea is appreciated.

Regards,

Fazlay

0 Kudos
5 Replies
McCalpinJohn
Honored Contributor III
1,048 Views

Volume 2 of the document "Intel Xeon Phi Processor Performance Monitoring Reference Manual -- Volume 2: Events" (document 334480-002, March 2017) contains a discussion of how to compute bandwidths for KNL running in "cached mode" in Section 3.1.

Under normal (processor-based) operations, all DDR4 DRAM reads send their data to the MCDRAM, and all MCDRAM cache dirty victim writebacks send their data to DDR4 DRAM --- so all you need are the raw DDR4 bandwidths.   These are available using the CAS.Reads and CAS.Writes operations for the six DDR4 controllers (where each increment of a CAS event corresponds to a 64-Byte cache line transfer).  These devices are discussed in Chapter 3 of Volume 1 of the document "Intel Xeon Phi Processor Performance Monitoring Reference Manual -- Volume 1: Registers" (document 332972-002, March 2017).   

With a recent version of Linux (we are running CentOS 7.3), the OS understands enough about KNL to allow access to the memory controller performance counters.   A simple script to program all the events and run a job is:

#!/bin/bash

# Three events are defined for each DRAM channel:
#   CAS.READS (Event 0x03, Umask 0x01)
#   CAS.WRITES (Event 0x03, Umask 0x02)
#   DCLKS     (Event 0x00, Umask 0x00)	<--- DDR4 major clock cycles (1.200 GHz on most models)

CHANNEL0="-e uncore_imc_0/event=0x03,umask=0x01/ -e uncore_imc_0/event=0x03,umask=0x02/ -e uncore_imc_0/event=0x00,umask=0x00/"
CHANNEL1="-e uncore_imc_1/event=0x03,umask=0x01/ -e uncore_imc_1/event=0x03,umask=0x02/ -e uncore_imc_1/event=0x00,umask=0x00/"
CHANNEL2="-e uncore_imc_2/event=0x03,umask=0x01/ -e uncore_imc_2/event=0x03,umask=0x02/ -e uncore_imc_2/event=0x00,umask=0x00/"
CHANNEL3="-e uncore_imc_3/event=0x03,umask=0x01/ -e uncore_imc_3/event=0x03,umask=0x02/ -e uncore_imc_3/event=0x00,umask=0x00/"
CHANNEL4="-e uncore_imc_4/event=0x03,umask=0x01/ -e uncore_imc_4/event=0x03,umask=0x02/ -e uncore_imc_4/event=0x00,umask=0x00/"
CHANNEL5="-e uncore_imc_5/event=0x03,umask=0x01/ -e uncore_imc_5/event=0x03,umask=0x02/ -e uncore_imc_5/event=0x00,umask=0x00/"

# combine all the DCLK channel events
IMC_ALL="$CHANNEL0 $CHANNEL1 $CHANNEL2 $CHANNEL3 $CHANNEL4 $CHANNEL5"

perf stat -a $IMC_ALL $*

 

0 Kudos
CPati2
New Contributor III
1,048 Views

Hi John,

Thank you for sharing these details about counters. I am trying to capture total request sent to MCDRAM (8 EDC) and DDR4 (2 MC or 6 controllers as you pointed above). Do you think my following raw event conversion will give me correct counters?

For EDC and MC I am using following:

EDC0: uncore_edc_eclk_0/event=0x1,umask=0x1/
EDC1: uncore_edc_eclk_1/event=0x1,umask=0x1/
EDC2: uncore_edc_eclk_2/event=0x1,umask=0x1/
EDC3: uncore_edc_eclk_3/event=0x1,umask=0x1/
EDC4: uncore_edc_eclk_4/event=0x1,umask=0x1/
EDC5: uncore_edc_eclk_5/event=0x1,umask=0x1/
EDC6: uncore_edc_eclk_6/event=0x1,umask=0x1/
EDC7: uncore_edc_eclk_7/event=0x1,umask=0x1/

MC0: uncore_imc_0/event=0x03,umask=0x01/
MC1: uncore_imc_1/event=0x03,umask=0x01/
MC2: uncore_imc_2/event=0x03,umask=0x01/
MC3: uncore_imc_3/event=0x03,umask=0x01/
MC4: uncore_imc_4/event=0x03,umask=0x01/
MC5: uncore_imc_5/event=0x03,umask=0x01/

Using perf, I run following. I get the values, but the total of MC + EDC is not coming out to be l2_requests.miss.

COUNTERS=cpu-cycles,instructions,l2_requests.miss,uncore_edc_eclk_0/event=0x1,umask=0x1/,uncore_edc_eclk_1/event=0x1,umask=0x1/,uncore_edc_eclk_2/event=0x1,umask=0x1/,uncore_edc_eclk_3/event=0x1,umask=0x1/,uncore_edc_eclk_4/event=0x1,umask=0x1/,uncore_edc_eclk_5/event=0x1,umask=0x1/,uncore_edc_eclk_6/event=0x1,umask=0x1/,uncore_edc_eclk_7/event=0x1,umask=0x1/,uncore_imc_0/event=0x03,umask=0x01/,uncore_imc_1/event=0x03,umask=0x01/,uncore_imc_2/event=0x03,umask=0x01/,uncore_imc_3/event=0x03,umask=0x01/,uncore_imc_4/event=0x03,umask=0x01/,uncore_imc_5/event=0x03,umask=0x01/

perf stat -a -I 500 -e $COUNTERS ./app.out

If you can share any feedback, it will be helpful.

Thanks.

0 Kudos
McCalpinJohn
Honored Contributor III
1,048 Views

Bridging performance counters is not a particularly easy job....

First, it is critical to know whether the KNL is configured in "Flat" mode or "Cache" mode.   The behavior of L2 misses is entirely different in the two modes, and nothing will make sense until this is clarified.

Another issue that may be causing your mismatch is the "L2_REQUESTS.MISS" event.   On KNL, this is Event 0x2E, which is an "architectural" performance monitoring event.  Section 18.2.1 of Volume 3 of the Intel Architectures Software Developers Manual (document 325384) notes

"Because cache hierarchy, cache sizes and other implementation-specific characteristics; value comparison to estimate performance differences is not recommended."

There are some serious grammatical issues with that sentence, but it should be taken as a warning that the counter may not be counting what you expect, and the specific set of events counted may be different on different platforms.  For example, on most Xeon processors that I have tested, this performance counter event does *not* count L2 HW prefetcher accesses that miss in the last-level cache.  These will, of course, generate memory accesses and therefore cause a mismatch in the counts.   Similarly, streaming stores will access memory, but will not be counted by this LLC cache miss event.

When trying to understand the hardware performance counters, it is essential to have a set of microbenchmarks that have known characteristics to use for validation.   For memory hierarchy studies, it is also a good idea to have a tool to disable the hardware prefetchers to compare results with and without.  (https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,048 Views

>>For example, on most Xeon processors that I have tested, this performance counter event does *not* count L2 HW prefetcher accesses that miss in the last-level cache.  These will, of course, generate memory accesses and therefore cause a mismatch in the counts

This would also unnecessarily consume memory bandwidth while reading beyond your loop (beyond array data). Hence you may wish to take your measurement at least one HW prefetch distance prior to the end of the loop.

Jim Dempsey

0 Kudos
McCalpinJohn
Honored Contributor III
1,048 Views

Jim Dempsey wrote:

This would also unnecessarily consume memory bandwidth while reading beyond your loop (beyond array data). Hence you may wish to take your measurement at least one HW prefetch distance prior to the end of the loop.

The L2 HW prefetchers operate somewhat autonomously from the core, fetching lines that are anywhere between 1 line and 20 lines ahead of the most recent demand load (or software prefetch, or L1 HW prefetch) from the core.   (The limit of 20 lines ahead is documented for the Sandy Bridge processor, but since KNL HW prefetches are also limited to being within a single 4KiB page (64 lines), the limit is unlikely to be dramatically different.)  

Because of this asynchronous behavior, I don't see any practical way to know when to make the measurement to avoid "beyond the loop" accesses.  For something like STREAM, the array lengths are so long (millions of cache lines) that having the HW prefetcher load an extra 20 lines is effectively invisible.  (This overhead is certainly smaller than the typical overhead of occasionally having to go to memory to read page table entries.  It is also much smaller than the overhead caused by streaming stores that get split, forcing the target lines to be read before being overwritten.)

The bigger problem with aggressive L2 HW prefetches on KNL is that they use extra space in the shared L2 cache -- often displacing cache lines that the other core is trying to use.   I see this with multi-threaded DGEMM (using MKL) -- disabling the HW prefetchers improves performance by more than 3% in my tests.

0 Kudos
Reply