Bandwidth Analysis for Xeon-phi coprocessor using Intel VTune Amplifier

Saumya_B_ · ‎09-24-2016

Hi all,

I am using the following formula

UNC_F_CH0_NORMAL_READ +

UNC_F_CH0_NORMAL_WRITE+

UNC_F_CH1_NORMAL_READ+

UNC_F_CH1_NORMAL_WRITE) X 64/time to calculate memory bandwidth

VTune event summary is

Event summary

-------------
Hardware Event Type    Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
--------------------- ------------------------- -------------------------------- -----------------
INSTRUCTIONS_EXECUTED               538814808221                            269407 2000003
CPU_CLK_UNHALTED                   2798306197453                           1399151 2000003

Uncore Event summary
--------------------
Uncore Event Type              Uncore Event Count:Self
----------------------------- -----------------------
UNC_F_CH0_NORMAL_READ[UNIT0]                8153259300
UNC_F_CH0_NORMAL_WRITE[UNIT0]               4859302026
UNC_F_CH1_NORMAL_READ[UNIT0]                8147172390
UNC_F_CH1_NORMAL_WRITE[UNIT0]               4859140371

But according to this memory bandwidth would be 5.55 GB/s, wheras it should actually touch 80GB/s for 8 memory controllers. Why is bandwidth so low?

McCalpinJohn · ‎10-02-2016

Why do you think that the bandwidth should be 80 GB/s?

I don't know how VTune measures and reports the memory controller values, but in my tests using the same counters the sum of the reads and writes across the 16 channels matches expected values to within a few percent.

I recommend using something like the STREAM benchmark as a workload to generate an easy-to-compute amount of memory traffic. For this type of comparison I would use something like STREAM_ARRAY_SIZE=300000000 (300 million), so that the sum of the sizes of the three arrays will be about 6.7 GiB. I also recommend setting NTIMES=1000 so that the effects of startup and validation are negligible compared to the traffic traffic in the main loops. Compile with "-opt-streaming-stores always" and "-mcmodel=medium". For this test to run in a reasonable amount of time, the code needs to be compiled with OpenMP and run using KMP_AFFINITY=scatter and OMP_NUM_THREADS set to use all the cores (or one less than all the cores).

When using streaming stores, the four STREAM kernels generate a total of 6 8-byte reads and 4 8-byte stores per array index per iteration, so the total read traffic for 300 million elements and 1000 iterations is expected to be 300e6 elements * 1000 iterations * 6 reads/element/iteration * 8 bytes/element = 1.44e13 Bytes or 225 billion cache line reads. The write traffic is 2/3 of the read traffic: 9.6e12 Bytes or 150 billion cache line writes.

With transparent huge pages enabled, STREAM runs at a rate of up to 175 GB/s, giving a minimum execution time of 2.4e13 bytes/(175e9 bytes/second) = 137 seconds. The actual execution time will be somewhat longer -- but probably under 3 minutes. It is probably a good idea to start with NTIMES=10 just to make sure everything works and the reported bandwidths are reasonable (>150,000 MB/s) for all four kernels.