Solved: PAPI (TAU) and STREAM Benchmark

Sidharth_Kashyap · ‎01-03-2016

I am trying to understand the reliability of the PAPI counters and I thought that the STREAM benchmark was the best source due to its proven acceptance and accuracy.

Here is the method followed

1) Compiled the Stream benchmark (C Version) using the TAU wrappers (Compiler Instrumentation)

2) Set the environment variables for the PAPI hardware counters that I wanted to monitor PAPI_LD_INS (load instructions), PAPI_SR_INS (store instructions) and TIME respectively.

3) Ran the application

Upon completion, I set a derived metric, Memory Bandwidth=(PAPI_LD_INS+PAPI_SR_INS)/TIME

I see that the numbers obtained through this metric and the numbers reported by stream vary by a huge margin.

Is the above PAPI interpretation right?

Thank you for your help.

McCalpinJohn · ‎01-04-2016

Major issues:

You have to be sure that you account for the payload sizes of the loads and stores (in Bytes) to convert from "instructions" to "bandwidth".
STREAM reports an estimated bandwidth based on the assumption that the main arrays are being loaded from memory and written to memory on each iteration. It does not attempt to estimate instruction counts -- just main memory access counts.
STREAM reports estimated bandwidth for the kernels only -- not for the initialization code or the validation code.

More details:

1. Depending on the compiler used, the compilation options used, and the hardware platform used, each "load instruction" could be loading anywhere from 8 Bytes to 32 Bytes of data (or up to 64 Bytes on Xeon Phi). If you instrument at the loop level most of the loads and stores of the main array variables will be of the same size, but if you measure at the whole program level you will be seeing a mix of many different load and store sizes.

For Haswell systems and high levels of optimization, almost all accesses to the main arrays will be using 256-bit (32-Byte) AVX load and store instructions.
For Sandy Bridge/Ivy Bridge systems, the compiler might use 256-bit (32-Byte) load and store instructions or it might load the 256-bit registers using two 128-bit load instructions. Using 128-bit loads often provides slightly better (~5% to ~10%) performance on Sandy Bridge/Ivy Bridge processors.

2. In the particular case of counting loads, I have seen a surprising number of times that a compiler will re-load one or more pointers in each iteration of the loop -- even when the loop is clearly vectorized and unrolled. These pointers will almost always be found in the L1 Data Cache (so the loads will not impact the memory bandwidth or performance), but they will skew the load and store counts for the loops in very confusing ways. It is much less common for the compiler to insert "extraneous" stores, but I have seen that happen, too.

3. The bandwidths reported by STREAM are based only on the best time measured for each of the four kernels.

STREAM does not attempt to estimate memory traffic or performance associated with the initialization of the main arrays (including instantiation of the virtual to physical page mappings) or for the validation of the results.
With the default setting of NTIMES=10, the data initialization and results validation result in non-negligible additional memory access instructions and memory traffic.

View solution in original post

McCalpinJohn · ‎01-04-2016

Major issues:

You have to be sure that you account for the payload sizes of the loads and stores (in Bytes) to convert from "instructions" to "bandwidth".
STREAM reports an estimated bandwidth based on the assumption that the main arrays are being loaded from memory and written to memory on each iteration. It does not attempt to estimate instruction counts -- just main memory access counts.
STREAM reports estimated bandwidth for the kernels only -- not for the initialization code or the validation code.

More details:

1. Depending on the compiler used, the compilation options used, and the hardware platform used, each "load instruction" could be loading anywhere from 8 Bytes to 32 Bytes of data (or up to 64 Bytes on Xeon Phi). If you instrument at the loop level most of the loads and stores of the main array variables will be of the same size, but if you measure at the whole program level you will be seeing a mix of many different load and store sizes.

For Haswell systems and high levels of optimization, almost all accesses to the main arrays will be using 256-bit (32-Byte) AVX load and store instructions.
For Sandy Bridge/Ivy Bridge systems, the compiler might use 256-bit (32-Byte) load and store instructions or it might load the 256-bit registers using two 128-bit load instructions. Using 128-bit loads often provides slightly better (~5% to ~10%) performance on Sandy Bridge/Ivy Bridge processors.

2. In the particular case of counting loads, I have seen a surprising number of times that a compiler will re-load one or more pointers in each iteration of the loop -- even when the loop is clearly vectorized and unrolled. These pointers will almost always be found in the L1 Data Cache (so the loads will not impact the memory bandwidth or performance), but they will skew the load and store counts for the loops in very confusing ways. It is much less common for the compiler to insert "extraneous" stores, but I have seen that happen, too.

3. The bandwidths reported by STREAM are based only on the best time measured for each of the four kernels.

STREAM does not attempt to estimate memory traffic or performance associated with the initialization of the main arrays (including instantiation of the virtual to physical page mappings) or for the validation of the results.
With the default setting of NTIMES=10, the data initialization and results validation result in non-negligible additional memory access instructions and memory traffic.

TimP · ‎01-04-2016

If you don't compile with streaming stores (for example, if you usèd gcc), stores would incur a preload of the same data, which papi should report.

McCalpinJohn · ‎01-04-2016

The PAPI_SR_INS event should map to something like MEM_UOP_RETIRED:ANY_STORES, which will only count the instructions that have stores.

With "normal" stores, the hardware will load the data from DRAM -- typically by a hardware prefetch into the L3 or L2, followed by a store miss that brings that data into the L1 Data Cache. So the data motion changes, but the instruction counts don't change.

Sidharth_Kashyap · ‎01-04-2016

Hello John/Tim,

Thank you for the guidance.

> You have to be sure that you account for the payload sizes of the loads and stores (in Bytes) to convert from "instructions" to "bandwidth"

I stand corrected, the corrected memory bandwidth formula = (PAPI_LD_INS+PAPI_SR_INS)*8/TIME

> STREAM reports estimated bandwidth for the kernels only -- not for the initialization code or the validation code.

TAU lets us choose the key lines of code, for example, lines 342-345 in stream.c file, where the triad operation is performed. I ignored the aggregated numbers outside the key loops.

> Depending on the compiler used, the compilation options used, and the hardware platform used, each "load instruction" could be loading anywhere from 8 Bytes to 32 Bytes of data (or up to 64 Bytes on Xeon Phi)

I will verify the assembly generated, is it recommended to put the corresponding (32 if it is doing AVX vector addition etc.) number of bytes as the multiplier to (PAPI_LD_INS+PAPI_SR_INS)?

>MEM_UOP_RETIRED:ANY_STORES

Thank you, I will compare this and PAPI using VTune.

This exercise is mainly being done to verify the percentage accuracy of using PAPI counters and whether we can trust the numbers reported (mainly FLOPS and MB/s) on larger scientific code which do not render well to this kind of verifiability.

Thank you.

McCalpinJohn · ‎01-04-2016

The PAPI_LD_INS and PAPI_SR_INS events count load instructions, not bandwidth. Even with the STREAM benchmark there is no attempt to make sure that all load instructions miss all the caches -- only that (effectively) all of the data for the main arrays comes from main memory (and returns there after being updated). Even limiting the analysis to a single source code loop is not enough to ensure that no additional loads are generated -- you have to inspect the assembly code and understand enough to be sure which code path(s) are being executed. A compiler may generate additional loads (e.g., re-loading pointers) without any obvious reason, and may switch between generating 128-bit loads and generating 256-bit loads due to factors external to the loop being compiled (such as estimated loop lengths or concerns about alignment).

The MEM_UOP_RETIRED:ANY_LOADS and MEM_UOP_RETIRED:ANY_STORES events are generally accurate on Intel processors, though there are some processors where these events have significant bugs if HyperThreading is enabled.

If you want to count memory bandwidth, you should use counters that are intended to count memory bandwidth, such as the Memory Controller counters in the Uncore on Xeon E5 platforms. These are described in the "Uncore Performance Monitoring Guide" documents for each Xeon E5 family.

The Floating Point operation counters are known to overcount seriously (up to 10x) for STREAM on Sandy Bridge and Ivy Bridge cores, and are not available on the Haswell core. If you have a Broadwell or Skylake processor (and a new enough version of PAPI), there are new performance counter events for floating-point operations, but I have not been able to test either of those yet.

Sidharth_Kashyap · ‎01-04-2016

Thank you "Dr. Bandwidth", that clarifies the huge doubt.

I tried the suggested method and this formula is reporting numbers similar to Triad (counters measured in the region of interest using Instruments).

(MEM_LOAD_UOPS_RETIRED.L3_MISS+MEM_UOPS_RETIRED.ALL_STORES)*8/Time

Machine - Broadwell i7, 3.1 GHz, OMP Threads=4

Results:

Number reported by stream: 13802.5 MB/s

Counter: 13653.8 MB/s

Regards,

Sid

McCalpinJohn · ‎01-05-2016

Looks good. The STREAM output includes the bandwidth based on the minimum time, but also includes the minimum, average, and maximum times measured for each kernel. If you scale the number reported by STREAM by the ratio of the minimum to average times you will probably get an even closer agreement (assuming that there are enough useful printed digits in the min/avg/max times).

Of course you can also change the STREAM source code so that it computes the bandwidth using the average time instead of the minimum time, but the 1.09% difference in your results is entirely typical of the difference between average and best-case performance, so there are no indications of anything amiss.

John_W_3 · ‎06-30-2016

I'm trying to reproduce Sidharth's result on a Haswell (Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz) using Open|SpeedShop with PAPI counters, but I'm not getting the same result. Any ideas?

wohlbier@r11i4n12:~/devel/oss/stream> ls
Makefile READ.ME bandwidth.sh mysecond.c run stream.c stream.f
wohlbier@r11i4n12:~/devel/oss/stream> icc --version
icc (ICC) 16.0.3 20160415
Copyright (C) 1985-2016 Intel Corporation. All rights reserved.

wohlbier@r11i4n12:~/devel/oss/stream> make
icc -O2 -DNTIMES=100 -DTUNED=1 -fno-inline-functions -c -o mysecond.o mysecond.c
icc -O2 -DNTIMES=100 -DTUNED=1 -fno-inline-functions -c mysecond.c
ifort -O2 -c stream.f
ifort -O2 stream.o mysecond.o -o stream_f.exe
icc -O2 -DNTIMES=100 -DTUNED=1 -fno-inline-functions stream.c -o stream_c.exe
stream.c(267): warning #3180: unrecognized OpenMP #pragma
#pragma omp parallel for
^

stream.c(286): warning #3180: unrecognized OpenMP #pragma
#pragma omp parallel for
^

stream.c(556): warning #3180: unrecognized OpenMP #pragma
#pragma omp parallel for
^

stream.c(564): warning #3180: unrecognized OpenMP #pragma
#pragma omp parallel for
^

stream.c(572): warning #3180: unrecognized OpenMP #pragma
#pragma omp parallel for
^

stream.c(580): warning #3180: unrecognized OpenMP #pragma
#pragma omp parallel for
^

wohlbier@r11i4n12:~/devel/oss/stream> osshwcsamp ./stream_c.exe MEM_LOAD_UOPS_RETIRED:L3_MISS,MEM_UOPS_RETIRED:ALL_STORES
[openss]: hwcsamp using default sampling rate: "100".
[openss]: hwcsamp using user specified papi events: "MEM_LOAD_UOPS_RETIRED:L3_MISS,MEM_UOPS_RETIRED:ALL_STORES"
IN foundLibary, libname=libmpi
IN foundLibary, exename=./stream_c.exe
IN foundLibary, ldd=0x6b8670
start=
start=/lib64/libm.so.6
start=/app/gmpapp/gcc/platform/gcc-4.8.4/lib64/libgcc_s.so.1
start=/lib64/libc.so.6
start=/lib64/libdl.so.2
start=/lib64/ld-linux-x86-64.so.2
In foundLibrary, return false=/lib64/ld-linux-x86-64.so.2
Creating topology file for pbs frontend node r11i4n12
Generated topology file: ./cbtfAutoTopology
Running hwcsamp collector.
Program: ./stream_c.exe
Number of mrnet backends: 1
Topology file used: ./cbtfAutoTopology
IN foundLibary, libname=libmpi
IN foundLibary, exename=./stream_c.exe
IN foundLibary, ldd=0x7353f0
start=
start=/lib64/libm.so.6
start=/app/gmpapp/gcc/platform/gcc-4.8.4/lib64/libgcc_s.so.1
start=/lib64/libc.so.6
start=/lib64/libdl.so.2
start=/lib64/ld-linux-x86-64.so.2
In foundLibrary, return false=/lib64/ld-linux-x86-64.so.2
executing sequential program: cbtfrun -m -c hwcsamp ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 100 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 8308 microseconds.
(= 8308 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 18748.1 0.008572 0.008534 0.008665
Scale: 20002.6 0.008016 0.007999 0.008069
Add: 20438.0 0.011777 0.011743 0.011796
Triad: 20500.9 0.011736 0.011707 0.011766
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
All Threads are finished.
wohlbier@r11i4n12:~/devel/oss/stream> ./bandwidth.sh stream_c.exe-hwcsamp-0.openss
[openss]: The restored experiment identifier is: -x 1

Exclusive mem_load_uops_retired:l3_miss mem_uops_retired:all_stores BW [MB/s] Function (defining location)
CPU time
in
seconds.
1.360000 34827437 596835605 3543.534234 tuned_STREAM_Triad (stream_c.exe)
1.000000 19457055 484900193 3847.940430 tuned_STREAM_Scale (stream_c.exe)
0.980000 24495728 435517030 3581.243694 tuned_STREAM_Add (stream_c.exe)
0.670000 10121302 232137037 2758.633503 __intel_avx_rep_memcpy (stream_c.exe)
0.150000 148637 18209017 933.718567 main (stream_c.exe)
0.020000 150379 2497279 1010.001373 checkSTREAMresults (stream_c.exe)
4.180000 89200538 1770096161 Report Summary

wohlbier@r11i4n12:~/devel/oss/stream> cat bandwidth.sh
#! /bin/bash

file=$1
if [ x$file == x ] ; then
echo "error: need an openss file"
exit
fi

openss -batch -f $file <<EOF
# osshwcsamp ./stream_c.exe MEM_LOAD_UOPS_RETIRED:L3_MISS,MEM_UOPS_RETIRED:ALL_STORES
expview -m time, mem_load_uops_retired:l3_miss, mem_uops_retired:all_stores, Header(\"BW [MB/s]\", Ratio(Mult(8.0,Add(Float(mem_load_uops_retired:l3_miss),Float(mem_uops_retired:all_stores))),Mult(1048576.0,time)))
# osshwcsamp ./stream_c.exe MEM_UOPS_RETIRED:ALL_LOADS,MEM_UOPS_RETIRED:ALL_STORES
#expview -m time, mem_uops_retired:all_loads, mem_uops_retired:all_stores, Header(\"BW [MB/s]\", Ratio(Mult(8.0,Add(Float(mem_uops_retired:all_loads),Float(mem_uops_retired:all_stores))),Mult(1048576.0,time)))
EOF
wohlbier@r11i4n12:~/devel/oss/stream>

Vadim_K_Intel · ‎06-30-2016

2 John:

Your code is compiled without OpenMP multithreading support. Did you do it on purpose? If not - try to add -fopenmp or -qopenmp compiler flags to your Makefile.

John_W_3 · ‎06-30-2016

I've done it both with and without OpenMP. It shouldn't really matter. I'm trying to see if I can get the hardware counters to give the same result as the STREAM output, which is obviously different with and without OpenMP.

McCalpinJohn · ‎07-01-2016

The MEM_LOAD_UOPS_RETIRED.L3_MISS counter increments when a "demand load" misses in the L3 cache. For many access patterns the data will be prefetched into the L3 or L2 cache before the "demand load" gets there. In these cases the count of "L3_MISS" events will be much lower than you might expect (and the corresponding computed bandwidth will be much lower than you might expect). The fraction of the total loads that get prefetched into the L2 or L3 before use depends on a whole lot of factors, including the frequency of the core and uncore, the number of threads in use, the DRAM configuration, the use of C1E state on the other socket(s) in the system, the "QPI snoop mode" that the system is booted in, and many other factors.

If you disable the hardware prefetchers using the instructions at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors, you should get better agreement with expectations.

John_W_3 · ‎07-02-2016

Thanks very much. I will try that.

John_W_3 · ‎02-09-2017

I am wondering about the following approximation. I have instrumented the triad function in stream to count MEM_LOAD_UOPS_RETIRED:L2_MISS and MEM_UOPS_RETIRED:ALL_STORES. When I run it on KNL and compute 64*(L2_MISS + ALL_STORES)/time I get 48.8 GB/s where the benchmark reports 84.3 GB/s. As Dr. Bandwidth pointed out apparently it's not unexpected that L2_MISS will be too low. I wondered if instead 64*2*ALL_STORES/time would be a decent approximation for this kernel. This gives in this case 85.0 GB/s. Is this just coincidence?

Recently I was reading "Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor" by Doerfler, et al. and for the stream triad they count two reads, one write allocate and one write back when calculating the Arithmetic Intensity. Is it reasonable, then, for this kernel to use the 2x ALL_STORES approximation? More generally I suppose one would have to manually count the number of stores as compared to the number of loads to make a comparable approximation, which doesn't sound awesome.

Thanks.

jgw

McCalpinJohn · ‎02-10-2017

If you know the ratio of reads and writes to/from the last-level cache, then you certainly only need accurate counters for one or the other. This is easy for STREAM, but quite difficult for substantial applications. Fortunately the memory controller read and write counters have been accurate for the last several generations of processors. (They are often inconvenient to access, especially for the "client" parts, but the results are easy to interpret.)

For the application areas that I pay attention to, many values are read in order to compute each value to be written, and the caches reduce the memory traffic for the reads more effectively than they reduce the writeback rates associated with the cache lines written by the stores. As an example, the WRF (local-area weather forecast code) benchmark that I have been working with recently shows a 5% reduction in total memory read traffic when run on two nodes rather than one node (i.e., doubling the size of the L3 cache available), while the memory write traffic is only reduced by about 1.7%. This result is not surprising, but I certainly would not have been able to predict it quantitatively, nor can I predict the magnitude of the changes in read bandwidth reduction and write bandwidth reduction as I distribute this (fixed-size) over more and more nodes.

Intel's approach to most of the performance counters in the cache hierarchy focuses on identifying high-latency operations that may cause stalls, rather than focusing on bulk traffic. That is why the MEM_LOAD_UOPS_RETIRED.* counters are looking at where demand loads find their data -- high-latency demand loads can cause core stalls, while hardware prefetches cannot. The writebacks of dirty cache lines higher-numbered cache levels or to memory also cannot directly cause stalls. (They can indirectly cause stalls by increasing the latency of memory read operations, but that can be monitored by the performance counters that look at demand loads.)

Linford__John · ‎02-23-2017

Hi Folks,

I've had some success getting TAU output to match stream on KNL, but I could use a sanity check. I cut stream down to just the copy kernel, disabled streaming stores and OpenMP and, checked how many bytes are loaded/stored per memory UOP:

NTIMES = 20
STREAM_ARRAY_SIZE = 134217728
copy_read_bytes = sizeof(double) * STREAM_ARRAY_SIZE = 8*134217728*20 = 21474836480
copy_write_bytes = sizeof(double) * STREAM_ARRAY_SIZE = 8*134217728*20 = 21474836480

PAPI_NATIVE_MEM_UOPS_RETIRED_ALL_STORES = 335560766
stored bytes: 8*134217728*20
bytes stored per UOP: 8*134217728*20 / 335560766 = 63.996863328

PAPI_NATIVE:MEM_UOPS_RETIRED:ALL_LOADS = 335575805
loaded bytes: 8*134217728*20
bytes loaded per UOP: 8*134217728*20/335575805 = 63.993995276

I'm pretty sure I'm on the right path because bytes loaded/stored per UOP is ~64, and #loads ~= #stores as expected in copy. For a single KNL core, TAU matches stream perfectly with bandwidth = (64*(PAPI_NATIVE:MEM_UOPS_RETIRED:ALL_LOADS + PAPI_NATIVE:MEM_UOPS_RETIRED:ALL_STORES)/TIME).

Function Best Rate MB/s Avg time Min time Max time
Copy: 12739.2 0.168674 0.168573 0.168955
Scale: 9220.5 0.233070 0.232902 0.233286
Add: 10160.1 0.317279 0.317047 0.317655
Triad: 9940.4 0.324792 0.324053 0.335377

vs. the attached serial_bandwith.png.

But then I set OMP_NUM_THREADS=64 and PAPI is results are high by about 50%. So I looked at how load/store counts change relative to OMP_NUM_THREADS (also attached). The load/store ratios are pretty much constant across thread counts (1/1 for copy and scale, 2/1 for add and triad), so I suspect there's something in the multi-threaded run causing more load/store ops to actually be executed than are specified in the code. Also, since both plots are fairly linear, I'm pretty confident that the counters and bandwidth formula are correct, at least when streaming stores are disabled. Does that make sense?

I feel that, from the hardware's perspective, a load/store is a load/store regardless of purpose. If I ask the hardware "how many loads/stores" did you do, it isn't going to differentiate *useful* load stores. But, stream asks "how long does it take to calculate a certain result". Put it another way, stream is reporting peak *useful* bandwidth, while PAPI is reporting peak bandwidth but not indicating how much of that bandwidth was actually useful. Is this right?

So now I wonder if I can count "non-useful" operations and subtract their bandwidth contribution from the total. And I'm still working on the streaming store aspect, hope to understand that better soon.

Is there such a thing as "useful" vs. "non-useful" bandwidth?

Thanks!

McCalpinJohn · ‎02-23-2017

In OpenMP codes the count of load instructions is typically inflated by the spin loops used when a thread is waiting for the other threads to catch up at a barrier. This makes it difficult to interpret the results.

If I need more precision, I use a version of the code with an outer parallel loop set up with one iteration per thread, and an inner loop with manually computed start and stop indices. Then I read the counters within each thread, before and after the inner loop. This makes the second performance counter read *before* the thread enters the implicit barrier at the end of the loop, so my results are not contaminated by the spin-loop. (I also read the counters after the spin loop, but the important thing is that I can distinguish between the "working" phase and the "spinning" phase.)

On most Intel processors, a carefully controlled STREAM workload will have very small imbalance across threads, but the spin-loop loads run really fast, so it only takes a tiny imbalance to create a visible difference in load counts. On KNL, the 2D mesh with edge-connected MCDRAM and DDR4 controllers, introduces a greater degree of *intrinsic* variability in sustained bandwidth across cores. It is still not a "large" difference, but it does mean that there will be a few percent of spin-loop time on some of the cores while they wait for the cores that suffer from larger average congestion in memory traffic.

If you want to measure bandwidth, you should be doing it with the uncore counters in the MCDRAM and DDR4 controllers. If these are not supported in your environment, the OFFCORE_RESPONSE counters can count L2 misses that are sourced from a variety of places. These have "tile" scope, so you only need to measure using one thread on each core pair. I don't know if PAPI knows how to do this or not....

TimP · ‎02-23-2017

If you can identify the time critical thread and filter the others out (e.g. in VTune), the spin wait counts won't be as much of a problem.

On the other hand, in MPI profiling, we often wished to set a large cutoff for spin waits so as to see them all.

Linford__John · ‎10-18-2017

Hi Folks,

It's been a while, but I wanted to report that we've worked out how to measure DRAM and MCDRAM flat/cache bandwidth with PAPI and TAU. The formulas are:

DRAM Bandwidth in MB:
RD_BW = (sum(j=0,5; knl_unc_imc::UNC_M_CAS_COUNT:RD:cpu=0) / 16384) / TIME
WR_BW = (sum(j=0,5; knl_unc_imc::UNC_M_CAS_COUNT:WR:cpu=0) / 16384) / TIME
DRAM_BW = RD_BW + WR_BW

Flat MCDRAM Bandwidth in MB:
RD_BW = (sum(i=0,7; knl_unc_edc_eclk::UNC_E_RPQ_INSERTS:cpu=0) / 16384) / TIME
WR_BW = (sum(i=0,7; knl_unc_edc_eclk::UNC_E_WPQ_INSERTS:cpu=0) / 16384) / TIME
FLAT_BW = RD_BW + WR_BW

Cache MCDRAM Bandwidth in MB:
CACHE_BW = (sum(i=0,7; knl_unc_edc_uclk::UNC_E_EDC_ACCESS:{HIT|MISS}_{CLEAN|DIRTY}:cpu=0) / 16384) / TIME

This formula for MCDRAM cache bandwidth is very different from the one given on Page 33 of "Intel Xeon Phi Processor
Performance Monitoring Reference Manual" vol 2.

P33_RD_BW = (ECLK_Events_RPQ_Inserts - UCLK_Events_EDC_Hit/Miss_MISS_CLEAN - UCLK_Events_EDC_Hit/Miss_MISS_DIRTY) * 64 / TIME
P33_WR_BW = (ECLK_Events_WPQ_Inserts - DCLK_Events_CAS_Reads) * 64 / TIME

The Page 33 formulas don't seem to work for me on any of the three KNL systems I've tried. However, since MCDRAM cache sits between DRAM counting all HIT/MISS events seems to give accurate numbers. Not sure this works in Hybrid mode. Would love to know more about the Page 33 formulas and how they're supposed to work.

Usage: to gather cache-mode bandwidth with TAU Commander, set your metrics like:

tau measurement edit profile --metrics PAPI_NATIVE:knl_unc_edc_uclk{0..7}::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0,PAPI_NATIVE:knl_unc_edc_uclk{0..7}::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0,PAPI_NATIVE:knl_unc_edc_uclk{0..7}::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0,PAPI_NATIVE:knl_unc_edc_uclk{0..7}::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0

Then after gathering the data use this derived metric formula in ParaProf:

(( PAPI_NATIVE:knl_unc_edc_uclk1::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk4::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk2::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk3::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk6::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk0::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk5::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk7::UNC_E_EDC_ACCESS:MISS_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk2::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk1::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk5::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk7::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk0::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk4::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk3::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk6::UNC_E_EDC_ACCESS:MISS_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk1::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk4::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk2::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk3::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk6::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk0::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk5::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk7::UNC_E_EDC_ACCESS:HIT_DIRTY:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk2::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk1::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk5::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk7::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk0::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk4::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk3::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 + PAPI_NATIVE:knl_unc_edc_uclk6::UNC_E_EDC_ACCESS:HIT_CLEAN:cpu=0 )/16384/ TIME )

Of course all of this requires `cat /proc/sys/kernel/perf_event_paranoid` <= 0

Cheers!

McCalpinJohn · ‎10-19-2017

The "proper" KNL formulas depend on what it is that you are trying to measure.

In cached mode, the MCDRAM is accessed for every L2 miss. I don't think that Intel has described where the tag checking is performed, but I would guess that the tag information is hidden in the ECC bits, so the full line is read for every L2 miss. If the tag information indicates that this line matches the physical address being requested, then it is a "hit" and the cache line can be forwarded to the requester. If the tag information indicates that this is a "miss", then the line in MCDRAM is invalidated (with writeback if dirty) and the original request is sent to the appropriate DDR4 controller.

The Intel formulas from the Uncore Performance Monitoring guide are intended to discount the reads of MCDRAM that correspond to misses (even though the data is actually read from the MCDRAM in these cases), giving the "effective" (or "useful") bandwidth from each level of the memory hierarchy.

The wording in the event descriptions is ambiguous, but after staring at the definitions for quite a while, I came to the conclusion that the MCDRAM counters do not provide enough information to determine how many streaming stores are accessing the MCDRAM cache.