Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
76 Views

Bandwidth tests

Hi,

I am playing with some programs to compute "bandwidth" on my system which is a Dual-Xeon Skylake Gold 6140 (2 sockets of 18 cores) with 12 DIMMS (6 per socket) of RAM at 2666 MHz for a total of 96 GB. I wrote my own "stream" benchmark, and I am surprised by some results. On this platform, Intel Advisor (the roofline) claims 207 GB/s of memory bandwidth. The Intel Memory Latency Checker gives exactly the same result for the bandwidth. Here are the results given by my program.

Bandwidth, sum += a * b        : 182.698 Gb/s
Bandwidth, a = 0.0                : 103.311 Gb/s
Bandwidth, a = 2 * a           : 128.075 Gb/s
Bandwidth, a = b               : 136.004 Gb/s
Bandwidth, a = 2 * b           : 102.294 Gb/s
Bandwidth, a += 2 * b          : 101.337 Gb/s
Bandwidth, a = 2 * b + 3 * c: 114.601 Gb/s
Bandwidth, a = b + 3 * c    : 114.525 Gb/s

I have a few questions:

1/ Is there a way to reach the peak performance of 207 GB/s with the reduction (sum += a * b) ? Can we tune prefetching to do so?

2/ Why is the bandwidth for setting a to 0.0 so low? Can we make it faster?

Best regards

 

PS: The following code has been compiled with

icpc -g -std=c++11 -O3 -xCORE-AVX512 -qopenmp -DNDEBUG main.cpp -o main

and launched with thread pinning with 1 thread per core.

export OMP_PLACES=cores
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=36
./main

Here is the full listing

#include <chrono>
#include <iostream>

int main() {
  const std::ptrdiff_t n = 1024 * 1024 * 1024;
  double *a = new double;
  double *b = new double;
  double *c = new double;
#pragma omp parallel for
  for (std::ptrdiff_t i = 0; i < n; ++i) {
    a = 0.0;
    b = 0.0;
    c = 0.0;
  }

  const std::ptrdiff_t nb_times = 20;
  double sum = 0.0;
  auto point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for reduction(+ : sum)
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      sum += a * b;
    }
    asm volatile("" : : "g"(a) : "memory");
    asm volatile("" : : "g"(b) : "memory");
  }
  auto point_end = std::chrono::high_resolution_clock::now();
  double time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, sum += a * b        : "
            << (2 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;

  point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      a = 0.0;
    }
    asm volatile("" : : "g"(a) : "memory");
  }
  point_end = std::chrono::high_resolution_clock::now();
  time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, a = 0.0                : "
            << (1 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;

  point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      a = 2 * a;
    }
    asm volatile("" : : "g"(a) : "memory");
  }
  point_end = std::chrono::high_resolution_clock::now();
  time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, a = 2 * a           : "
            << (2 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;

  point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      a = b;
    }
    asm volatile("" : : "g"(a) : "memory");
    asm volatile("" : : "g"(b) : "memory");
  }
  point_end = std::chrono::high_resolution_clock::now();
  time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, a = b               : "
            << (2 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;

  point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      a = 2 * b;
    }
    asm volatile("" : : "g"(a) : "memory");
    asm volatile("" : : "g"(b) : "memory");
  }
  point_end = std::chrono::high_resolution_clock::now();
  time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, a = 2 * b           : "
            << (2 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;

  point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      a += 2 * b;
    }
    asm volatile("" : : "g"(a) : "memory");
    asm volatile("" : : "g"(b) : "memory");
  }
  point_end = std::chrono::high_resolution_clock::now();
  time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, a += 2 * b          : "
            << (2 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;

  point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      a = 2 * b + 3 * c;
    }
    asm volatile("" : : "g"(a) : "memory");
    asm volatile("" : : "g"(b) : "memory");
    asm volatile("" : : "g"(c) : "memory");
  }
  point_end = std::chrono::high_resolution_clock::now();
  time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, a = 2 * b + 3 * c: "
            << (3 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;

  point_begin = std::chrono::high_resolution_clock::now();
  for (std::ptrdiff_t k = 0; k < nb_times; ++k) {
#pragma omp parallel for
    for (std::ptrdiff_t i = 0; i < n; ++i) {
      a = b + 3 * c;
    }
    asm volatile("" : : "g"(a) : "memory");
    asm volatile("" : : "g"(b) : "memory");
    asm volatile("" : : "g"(c) : "memory");
  }
  point_end = std::chrono::high_resolution_clock::now();
  time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(
                             point_end - point_begin)
                             .count();

  std::cout << "Bandwidth, a = b + 3 * c    : "
            << (3 * n * sizeof(double) * nb_times) /
                   (time * 1024 * 1024 * 1024)
            << " Gb/s" << std::endl;
  std::cout << "Check: " << sum << std::endl;

  delete[] c;
  delete[] b;
  delete[] a;

  return 0;
}

 

0 Kudos
8 Replies
Highlighted
Black Belt
76 Views

You might require omp parallel simd to approach rated speed.  Setting to 0 will be particularly dependent on #pragma vector nontemporal, if the compiler option settings for streaming stores don't take the desired effect. opt-report=4 ought to show the important information.  You may also gain something from optimizing unrolling or data alignment.
 

0 Kudos
Highlighted
Black Belt
76 Views

(1) The compilers generally do a very good job with reductions when they are in an OpenMP for loop with a reduction clause, but there are times when the compiler does not generate enough independent accumulators and memory concurrency is inhibited.   I have not tried SW prefetching for these cases, but it might work.  The optimum coding depends on the number of cores, the number of threads, and the number of DRAM ranks installed in each channel.

(2) Your second kernel reports 103 GB/s because you are only counting the store.  (This is consistent with how the STREAM benchmark counts bytes.)  BUT, the default processing of stores requires that the data be loaded into the cache before being updated.   If this is happening here, then your kernel is actually loading 103 GB/s and storing 103 GB/s -- consistent with the 207 GB/s best case bandwidth.    You need to convince the compiler to use streaming (non-temporal) stores to avoid these extra loads.   This can be done at a coarse level with the compiler option "-qopt-streaming-stores always" or at the loop level with "#pragma vector nontemporal" before the loop.    It is usually a good idea to check the assembly code to see how the stores are performed -- "movntpd" instructions are streaming stores, while "movupd" instructions are ordinary stores.

(3) This case seems slow -- it should do better.  An inspection of the assembly code may point to unexpected compiler choices.

(4) The reported performance is consistent with 2 reads (one for the store miss) and one writeback.  You should check to see if the loop is compiled natively or is replaced by a call to an "optimized" memcpy routine.   These replacements can be inhibited with the "-nolib-inline" compiler option.

(5) This should run at the same speed as (4).  An inspection of the assembly code may point to unexpected compiler choices.

(6) Your scaling on this one is incorrect.  The code specifies that both a[] and b[] are to be read, and a[] should be written.  With this correction, the performance increases to 151 GB/s -- still a bit slow, but not a disaster.

(7-8) Both these kernels should also include reads for the store misses (without forcing streaming stores), so the raw DRAM bandwidth is ~153 GB/s in each case. 

My experience on SKX processors is that these sorts of simple vector kernels run best with AVX512 encoding, even though that forces lower core frequencies.  With a compiler target of "-xCORE-AVX512", the compiler often chooses 256-bit vectorization in order to get higher frequencies.  Switching to "-xCOMMON-AVX512" will force 512-bit vectorization whenever it is possible.  With the latest compilers, you can add "-qopt-zmm-usage=high" to force CORE-AVX512 to use 512-bit registers whenever possible.

0 Kudos
Highlighted
Beginner
76 Views

Hi John,

Thanks for your reply. I have corrected the bug for the benchmark (a += 2 * b), added a new benchmark and compiled the code with the following command line:

icpc -g -std=c++11 -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qopenmp -qopt-streaming-stores always -nolib-inline -DNDEBUG main.cpp -o main

The performance now reads

1/Bandwidth, sum += a * b        : 177.501 Gb/s
2/Bandwidth, sum += a               : 192.005 Gb/s
3/Bandwidth, a = 0.0                : 128.591 Gb/s
4/Bandwidth, a = 2 * a           : 153.44 Gb/s
5/Bandwidth, a = b               : 151.496 Gb/s
6/Bandwidth, a = 2 * b           : 151.725 Gb/s
7/Bandwidth, a += 2 * b          : 159.564 Gb/s
8/Bandwidth, a = 2 * b + 3 * c: 158.666 Gb/s
9/Bandwidth, a = b + 3 * c    : 158.495 Gb/s

I have found that using AVX512 (through -qopt-zmm-usage=high) makes benchmark 1-2 slower and 4-9 faster, the difference being just below 5%. My feeling is that there is clearly a "reading" bandwidth (around 192 Gb/s) and a "writing" with streaming store bandwidth (around 130 Gb/s). When there is a mix of read and write, we get something in between. Would you expect some better performance?

Without streaming stores and the following options

icpc -g -std=c++11 -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qopenmp -nolib-inline -DNDEBUG main.cpp -o main

performance reads

1/Bandwidth, sum += a * b        : 175.892 Gb/s
2/Bandwidth, sum += a               : 189.298 Gb/s
3/Bandwidth, a = 0.0                : 63.2957 Gb/s
4/Bandwidth, a = 2 * a           : 125.891 Gb/s
5/Bandwidth, a = b               : 101.861 Gb/s
6/Bandwidth, a = 2 * b           : 101.797 Gb/s
7/Bandwidth, a += 2 * b          : 150.146 Gb/s
8/Bandwidth, a = 2 * b + 3 * c: 113.561 Gb/s
9/Bandwidth, a = b + 3 * c    : 113.284 Gb/s

I still have one question. It seems that in streaming store, there are 2 mechanisms:
- There is no read for ownership
- There is a mechanism that allows to drectly write "from the register to the RAM" without poluting the cache

Is that correct?

0 Kudos
Highlighted
Black Belt
76 Views

These results look more consistent, but they are not as high as one might like....

I have not run on the Xeon Gold 6140, but I did have a set of Xeon Gold 6142 processors for a week or two.  (16 core, 2.6 GHz).   Results with the Intel Memory Latency Checker showed 2:1 Read:Write performance was about 195 GB/s.  This should be similar to your case 7 (read a, read b, writeback a), but shows 30% higher bandwidth than your results.   I ran the single-socket case with 2:1 Read:Write using all six supported modes ([128-bit, 256-bit, 512-bit] x [1 thread/core, 2 threads/core]), and the variation was quite small -- 96.9 GB/s to 98.4 GB/s.

Similarly, the Intel Memory Latency Checker 1:1 Read:Write performance was about 185 GB/s -- about 45% faster than your case 4 (read a, writeback a).   Again, the single-socket performance was similar across the six modes -- 91.6 GB/s to 93.8 GB/s.

So I would definitely want to check the assembly code for these kernels to make sure that the compiler is not getting confused about aliasing or alignment or something.....

 

Streaming stores are implemented by transferring the data to a "write-combining buffer", which is eventually written directly to memory.  Best performance is obtained when all 64 Bytes of the (64-Byte-aligned) target are written as close together as possible -- e.g., 1 512-bit store or two consecutive 256-bit stores, or 4 consecutive 128-bit stores.  At some point during the process, the cache line is invalidated in all caches in the system (similar to what happens with an IO DMA write to memory).  Streaming stores are not ordered with respect to "ordinary" stores, so when the compiler generates streaming stores, it includes a memory fence when needed to establish ordering. (For example, it is necessary to ensure that the results of all streaming stores are globally visible before the (ordinary) store instruction that is used to enter the barrier at the end of an OpenMP parallel for loop.)

0 Kudos
Highlighted
Beginner
76 Views

My results are not as good as you with MLC v3.4. I have 12 DIMMS of DDR4 memory at 2666 MHz for a total of 96 GB. Is there anything I should check in the BIOS? Kernel? I am running Ubuntu 16.04 with Linux 4.4.0.

root@grisbouille:/opt/mlc-3.4/Linux# modprobe msr
root@grisbouille:/opt/mlc-3.4/Linux# ./mlc_avx512 
Intel(R) Memory Latency Checker - v3.4
Measuring idle latencies (in ns)...
		Numa node
Numa node	     0	     1	
       0	  77.3	 131.2	
       1	 130.4	  76.3	

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	207744.7	
3:1 Reads-Writes :	168331.7	
2:1 Reads-Writes :	164914.5	
1:1 Reads-Writes :	137235.9	
Stream-triad like:	148346.1	

 

0 Kudos
Highlighted
Black Belt
76 Views

Here is the top of the default output with MLC 3.4 on a Xeon Gold 6142 node. 

$ ./mlc            # note that this is not the AVX512 version!
Intel(R) Memory Latency Checker - v3.4
Measuring idle latencies (in ns)...
                Numa node
Numa node            0       1  
       0          78.8   130.1  
       1         130.6    79.1  

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      222316.3        
3:1 Reads-Writes :      198074.0        
2:1 Reads-Writes :      195759.6        
1:1 Reads-Writes :      185871.2        
Stream-triad like:      173192.3

My system was configured with 12 16GiB DIMMs (one per channel) for a total of 192 GiB.  The capacity will not make any difference with the performance, but my 16 GiB were definitely dual-rank DIMMs.  If your system is configured with single-rank DIMMs, the reduced performance you are seeing makes sense -- there are too many cores and not enough DRAM banks.   By default, Intel MLC will use all logical processors for the tests (unless you add the "-X" flag), so your "1:1 Reads-Writes" case will be generating 36 read streams plus 36 write streams in each socket, and trying to schedule those accesses into 16 DRAM banks (in the case of one single-rank DIMM per channel).   This results in closing and re-opening each DRAM page many times in order to read all the data, and this causes lots of DRAM stalls.  

On my system, same case would have 32 read streams plus 32 write streams (16 cores/socket instead of 18 cores/socket) and has to schedule those into 32 DRAM banks.  This works pretty well because the memory controller will buffer the writes and give priority to reads ("read-major-mode"), so there will be 32 read streams mapped into 32 DRAM banks, then the write buffers in the memory controller will get full and the memory controller will switch to "write-major-mode", in which it will map 32 write streams onto 32 DRAM banks.

So with MLC, you might try the "-X" flag to reduce the number of threads in use.  With your code it looks like you are already running only one thread per physical core, but you might get slightly better throughput by reducing the number of threads slightly.

0 Kudos
Highlighted
Beginner
76 Views

Hi,

I did not know that there was such a thing as dual-rank DIMMs. I bought my workstation through Dell, and they don't provide this information, They go as far as selling Xeon Skylake with 4 DIMMs per socket !!!

Thanks for your explanation. I get slightly better results with 12 cores.

root@grisbouille:/opt/mlc-3.4/Linux# ./mlc_avx512 --max_bandwidth -X 12
Intel(R) Memory Latency Checker - v3.4
Command line parameters: --max_bandwidth -X 12 

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes

Measuring Maximum Memory Bandwidths for the system
Will take several minutes to complete as multiple injection rates will be tried to get the best bandwidth
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using only one thread from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	224083.02
3:1 Reads-Writes :	171804.81
2:1 Reads-Writes :	165975.91
1:1 Reads-Writes :	137067.60
Stream-triad like:	171012.75

Could you please give us what we should care for when buying some RAM for the best bandwidth. So far, I know that :

- Put at least the same number of DIMMs per socket as the number of channels

- Use the highest frequency

Now, we should go for multi-rank DIMMs. How much rank would you recommend?

Anything else?

0 Kudos
Highlighted
Black Belt
76 Views

In most cases, the best performance is obtained with one dual-rank DIMM installed in each channel.  This provides the extra DRAM banks without decreasing the DRAM channel frequency.  (Quad-rank DIMMs are available, but they are very expensive and only make sense when you want a humongous amount of memory in the system and don't care that it will run slower.)

Current DDR4 DRAM chips are manufactured in 4Gbit and 8Gbit sizes, and are configured to provide either 4 bits of data per chip or 8 bits of data per chip.  Since the bus is 64 bits wide, a "rank" requires 16 DRAM chips if using the 4-bit-wide parts or 8 DRAM chips if using the 8-bit-wide parts.

This gives four possible configurations, each of which can be configured with either one or two ranks per DIMM:

  • 4 Gbit chips using 4-bit-wide parts = 16 chips = 8 GiB single-rank, 16 GiB dual-rank
  • 4 Gbit chips using 8-bit-wide parts = 8 chips = 4 GiB single-rank, 8 GiB dual-rank
  • 8 Gbit chips using 4-bit-wide parts = 16 chips = 16 GiB single-rank, 32 GiB dual-rank
  • 8 Gbit chips using 8-bit-wide parts = 8 chips = 8 GiB single-rank, 16 GiB dual-rank

Right now there is almost no difference in price per bit between the 4 Gbit and 8 Gbit DRAMs, and both are available at speeds up to 2666 MHz.

Note that in the 8 configurations listed above, there are 3 8 GiB configurations and 3 16 GiB configurations.  Vendors typically support only one configuration of each size, and they typically make the choice based on hypothetical arguments about error-correction capability, rather than actual data about performance.   So it is very easy to get stuck with 8 GiB single-rank DIMMs (or even 16 GiB single-rank DIMMs) if you don't demand the dual-rank versions.  

With increasing DRAM price/bit, increasing DRAM size/chip, and increasing DRAM channels per socket, it is no longer practical to double the capacity of the DIMMs to get two ranks.  (The processors support all of the configurations above -- the problem is getting the system provider to sell them to you.)

0 Kudos