- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am using the following formula
UNC_F_CH0_NORMAL_READ +
UNC_F_CH0_NORMAL_WRITE+
UNC_F_CH1_NORMAL_READ+
UNC_F_CH1_NORMAL_WRITE) X 64/time to calculate memory bandwidth
VTune event summary is
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
--------------------- ------------------------- -------------------------------- -----------------
INSTRUCTIONS_EXECUTED 538814808221 269407 2000003
CPU_CLK_UNHALTED 2798306197453 1399151 2000003
Uncore Event summary
--------------------
Uncore Event Type Uncore Event Count:Self
----------------------------- -----------------------
UNC_F_CH0_NORMAL_READ[UNIT0] 8153259300
UNC_F_CH0_NORMAL_WRITE[UNIT0] 4859302026
UNC_F_CH1_NORMAL_READ[UNIT0] 8147172390
UNC_F_CH1_NORMAL_WRITE[UNIT0] 4859140371
But according to this memory bandwidth would be 5.55 GB/s, wheras it should actually touch 80GB/s for 8 memory controllers. Why is bandwidth so low?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why do you think that the bandwidth should be 80 GB/s?
I don't know how VTune measures and reports the memory controller values, but in my tests using the same counters the sum of the reads and writes across the 16 channels matches expected values to within a few percent.
I recommend using something like the STREAM benchmark as a workload to generate an easy-to-compute amount of memory traffic. For this type of comparison I would use something like STREAM_ARRAY_SIZE=300000000 (300 million), so that the sum of the sizes of the three arrays will be about 6.7 GiB. I also recommend setting NTIMES=1000 so that the effects of startup and validation are negligible compared to the traffic traffic in the main loops. Compile with "-opt-streaming-stores always" and "-mcmodel=medium". For this test to run in a reasonable amount of time, the code needs to be compiled with OpenMP and run using KMP_AFFINITY=scatter and OMP_NUM_THREADS set to use all the cores (or one less than all the cores).
When using streaming stores, the four STREAM kernels generate a total of 6 8-byte reads and 4 8-byte stores per array index per iteration, so the total read traffic for 300 million elements and 1000 iterations is expected to be 300e6 elements * 1000 iterations * 6 reads/element/iteration * 8 bytes/element = 1.44e13 Bytes or 225 billion cache line reads. The write traffic is 2/3 of the read traffic: 9.6e12 Bytes or 150 billion cache line writes.
With transparent huge pages enabled, STREAM runs at a rate of up to 175 GB/s, giving a minimum execution time of 2.4e13 bytes/(175e9 bytes/second) = 137 seconds. The actual execution time will be somewhat longer -- but probably under 3 minutes. It is probably a good idea to start with NTIMES=10 just to make sure everything works and the reported bandwidths are reasonable (>150,000 MB/s) for all four kernels.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page