Solved: Vtune bandwidth calculation does not match STREAM benchmark output on KNC

Michael_C_13 · ‎03-02-2017

Dear Intel Forum Gurus,

I am practicing Vtune by using the Vtune 2016 GUI to measure the bandwidth of the STREAM benchmark on KNC 5110P. I compiled stream.c with the following options:

icpc -mmic -O3 -g -qopenmp -DSTREAM_ARRAY_SIZE=64000000 -qopt-prefetch-distance=64,8 -qopt-streaming-cache-evict=0 -qopt-streaming-stores never -restrict stream.c

Streaming stores are omitted because I want to try core-event-based sampling (more on that later).

First I tried to see if I could get the GUI to give me the bandwidth directly. This (https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe/topic/518185#comment-1793935) indicates that Vtune 2013 can give the bandwidth directly, but I didn't see "Bandwidth" among my available analysis types (see first screenshot below). This (https://software.intel.com/en-us/articles/tutorial-on-intel-xeon-phi-processor-optimization) Section 6.2 indicates that Vtune 2017 will give a nice bandwidth histogram, but I didn't see the Memory Usage viewpoint within the Memory Access analysis type (second screenshot below).

Next I tried to measure the bandwidth using the formula given section 5.4 of this https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding, specifically:

Read bandwidth = (L2_DATA_READ_MISS_MEM_FILL + L2_DATA_WRITE_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED

Write bandwidth = (L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED

Bandwidth = (Read bandwidth + write bandwidth)

I compiled without streaming stores because these events do not account for streaming stores. I created a custom analysis type to record all the necessary events, and applied the formula (third screenshot below) to the Triad kernel (highlighted line). I am dividing CPU_CLK_UNHALTED by 60 in the denominator because I'm almost positive CPU_CLK_UNHALTED measures the sum of clock ticks on all 60 cores, so to get the actual wall time of the function, I need to divide by 60.

My calculation with the metrics gave 182.75 GB/s, but the actual STREAM executable's output was "Triad: 101985.9 MB/s." This is in the same ballpark but still a pretty big difference, and makes me suspicious of my calculation.

My questions are 1. Is there a way that I overlooked to get the GUI to tell me the bandwidth directly (perhaps computed under the hood using memory controller events instead of core events)? 2. Am I applying the formula using the core events correctly? If so, why is there such a large discrepancy with the output of the STREAM executable?

Thanks in advance for your help,
Michael

McCalpinJohn · ‎03-06-2017

VTune uses a sampling-based methodology to assign time or performance counter counts to specific lines of code. This is harder to do for uncore counters -- both because there is no way to filter uncore events that are due to one process vs another process and because only a few Intel processors allow the uncore performance counter to generate an interrupt to use for sampling.

For STREAM, I typically assume that the total memory traffic is:

Bytes Read = (6*NTIMES*+4)*sizeof(STREAM_TYPE)*STREAM_ARRAY_SIZE

Bytes Written = (4*NTIMES+4)*sizeof(STREAM_TYPE)*STREAM_ARRAY_SIZE

If streaming stores are not used, then the "Bytes Read" must be increased by the "Bytes Written", since each output array must be read (allocated) before being updated. The constant "4" in each case is an estimate of the memory traffic associated with instantiating the arrays, running the preliminary timing test (not part of the reported results), and reading the arrays for validation at the end. It is easy to count the latter two of these three, but the page instantiation process is mostly black magic, so it is very hard to be precise about what to expect. To avoid uncertainly associated with the details of the implementation of page instantiation, I often run with either very large NTIMES (e.g., 100 or more), or I do runs with two different values of NTIMES and take the difference of the counts and attribute those differences to the different iteration counts. (I.e., I assume that the page instantiation overhead is the same in each case, so I can subtract it away).

DRAM counters typically include all accesses -- not just the main arrays that I include in the counts above. When running STREAM on small pages, there is usually a small increase in DRAM traffic due to Page Table Walks. This is typically in the range of 1/64 to 2/64, depending on the sizes of the arrays and the specific processor in question. With large pages (including Transparent Huge Pages), the TLB walk traffic is negligible.

On KNC, the ECC data is stored in DRAM and requires additional DRAM accesses to load. It does not look like these extra ECC accesses are counted by the KNC DRAM counters. This is good news and bad news -- the counts are easier to understand, but there does not appear to be a way to get counts of these extra DRAM accesses. Most Intel processors use 64 Byte cache lines with memory controllers that are configured with a minimum access of 64 Bytes -- so each increment of the memory controller counter corresponds to 64 Bytes. KNC is a bit different, with a minimum access granularity of 32 Bytes, so two accesses are required for each cache line transfer. Looking back at some STREAM test data, it appears that the counters are set up to increment once per cache line, not once per 32-Byte burst. In that case (using streaming stores), both the read traffic and the write traffic counted by the memory controller counters were about 1.1% higher than my estimates using the formulas above. This is plenty close enough for me....

View solution in original post

McCalpinJohn · ‎03-03-2017

I can't tell if you are using the uncore event counts from "1.png" (and "2.png") in your calculations.... In my testing, these performance counters were accurate at the hardware level, but I have never tried to use them from VTune.

Your 182.75 GB/s number is closer than you might expect!

STREAM does not count "write allocates", so if you compile without streaming stores, the actual bandwidth used will be higher than the bandwidth reported by STREAM. The ratio is 3:2 for the Copy and Scale kernels and 4:3 for the Add and Triad kernels. The average of your Copy and Scale numbers multiplied by 1.5 is 184.5 GB/s, and the average of your Add and Triad numbers multiplied by 1.333 is 167.3 GB/s.

It is not clear whether the difference between the corrected STREAM values and the results your formula are significant. In any case the values are well within 10% of what I would expect to see.

Michael_C_13 · ‎03-06-2017

Hi Dr. Bandwidth,

Thank you for your reply, it's pretty encouraging. My calculation in 3.png does not use the uncore events shown in 1.png and 2.png. Rather, I use the core events shown in the highlighted line of 3.png (the function "main$omp$parallel_for@344" corresponds to the Triad kernel), and plug them into the formula that I described in text above. I couldn't get Vtune to display the uncore events by function, possibly because whatever hardware counter records them has no knowledge of the stack?

I don't think I can filter the uncore events by function, but I suppose I can use them to estimate average bandwidth for the entire activity of the STREAM executable. What approach do you recommend in this case? Can I just sum the uncore events, multiply by 64, and divide by an estimate of the copy kernels' total wall time, like this:

( UNC_F_CH0_NORMAL_READ[UNIT0] + UNC_F_CH0_NORMAL_WRITE[UNIT0] + UNC_F_CH1_NORMAL_READ[UNIT1] + UNC_F_CH0_NORMAL_WRITE[UNIT1] )*64 / ( wall time for kernels )

I can estimate "wall time for kernels" by computing (sum of CPU_CLK_UNHALTED for all copy kernels)/60cores/1.05GHz, or visually estimate it by looking at the activity timeline Vtune reports for my threads.

Also, is a write-allocate the same thing as a read-for-ownership?

Much appreciated,
Michael

McCalpinJohn · ‎03-06-2017

VTune uses a sampling-based methodology to assign time or performance counter counts to specific lines of code. This is harder to do for uncore counters -- both because there is no way to filter uncore events that are due to one process vs another process and because only a few Intel processors allow the uncore performance counter to generate an interrupt to use for sampling.

For STREAM, I typically assume that the total memory traffic is:

Bytes Read = (6*NTIMES*+4)*sizeof(STREAM_TYPE)*STREAM_ARRAY_SIZE

Bytes Written = (4*NTIMES+4)*sizeof(STREAM_TYPE)*STREAM_ARRAY_SIZE

If streaming stores are not used, then the "Bytes Read" must be increased by the "Bytes Written", since each output array must be read (allocated) before being updated. The constant "4" in each case is an estimate of the memory traffic associated with instantiating the arrays, running the preliminary timing test (not part of the reported results), and reading the arrays for validation at the end. It is easy to count the latter two of these three, but the page instantiation process is mostly black magic, so it is very hard to be precise about what to expect. To avoid uncertainly associated with the details of the implementation of page instantiation, I often run with either very large NTIMES (e.g., 100 or more), or I do runs with two different values of NTIMES and take the difference of the counts and attribute those differences to the different iteration counts. (I.e., I assume that the page instantiation overhead is the same in each case, so I can subtract it away).

DRAM counters typically include all accesses -- not just the main arrays that I include in the counts above. When running STREAM on small pages, there is usually a small increase in DRAM traffic due to Page Table Walks. This is typically in the range of 1/64 to 2/64, depending on the sizes of the arrays and the specific processor in question. With large pages (including Transparent Huge Pages), the TLB walk traffic is negligible.

On KNC, the ECC data is stored in DRAM and requires additional DRAM accesses to load. It does not look like these extra ECC accesses are counted by the KNC DRAM counters. This is good news and bad news -- the counts are easier to understand, but there does not appear to be a way to get counts of these extra DRAM accesses. Most Intel processors use 64 Byte cache lines with memory controllers that are configured with a minimum access of 64 Bytes -- so each increment of the memory controller counter corresponds to 64 Bytes. KNC is a bit different, with a minimum access granularity of 32 Bytes, so two accesses are required for each cache line transfer. Looking back at some STREAM test data, it appears that the counters are set up to increment once per cache line, not once per 32-Byte burst. In that case (using streaming stores), both the read traffic and the write traffic counted by the memory controller counters were about 1.1% higher than my estimates using the formulas above. This is plenty close enough for me....

Michael_C_13 · ‎03-06-2017

Thank you! When I compile without streaming stores and set NTIMES=100, then compute the total memory traffic via A. your formula above or B. 64*(sum of uncore events), the two numbers agree within 2%.

However, there doesn't seem to be a spectacular way to get Vtune to tell you the time over which the memory transfer takes place (the denominator of the bandwidth calculation). For example, for one test, the STREAM output gave the total runtime for all four kernels, computed via NTIMES*(sum of "Avg time" column), as 4.00 sec. Vtune gave the following results:

Total CPU time for entire executable = CPU_CLK_UNHALTED/60cores/1.05GHz = 4.50 sec

Total CPU time for stream kernels = sum of CPU_CLK_UNHALTED for kernels/60cores/1.05GHz = 3.3 sec

Eyeballing active region of "Uncore Events" timeline = ~4 sec

It appears the the method that matches up best with the STREAM executable's output is visual inspection of the uncore events timeline, which seems handwavy and probably impractical for less cut-and-dried codes. This is a minor gripe though. My goal here was to improve my understanding of vtune, and your input has certainly helped.

Regards, Michael

McCalpinJohn · ‎03-07-2017

STREAM was originally built as a very simple code that looked something like the ocean models that I was running. All of these ran a very large number of steps, so startup times were not an issue. The early versions of STREAM did not bother to check the results, so there was no extra validation time at the end. The current (5.10) version does check the results, but for some reason I did not put a "#pragma omp parallel for" with a reduction clause on the main loop of the "check_STREAM_results()" function. On systems with very large differences between single-thread and all-thread bandwidth, this can cause the validation code to take a lot longer than it needs to. The next version will include that pragma, which will definitely decrease the "tail" time on machines like KNC and KNL.

There is really nothing I can do about the data instantiation time. Every operating system is free to implement page instantiation in any way that it wants, and it is often extremely difficult to understand what is happening under the covers. At some points in history, page instantiation has been very slow on some systems. This is no longer the case on any of the systems that I use, but I have been unable to understand the details of the recent Linux implementation(s?) well enough to make sense of the performance counter results for the page instantiation phase.

I should probably add a summary statement of the total amount of data traffic that the benchmark thinks it has moved. This would help make it easier to bridge to whole-program performance counter measurements.