STREAM benchmark test results for several Intel architectures

SergeyKostrov · ‎04-27-2017

*** STREAM benchmark test results for several Intel architectures *** This is a thread for tests results using STREAM benchmark for several Intel architectures. Results for Ivy Bridge posted first. Results for KNL will be posted some time later.

SergeyKostrov · ‎04-27-2017

[ MinGW 6.1.0 - Test 1.1 - double - OpenMP - No  ]
[ Windows 7 SP1 ]

 [ Compiler command line ]
 g++.exe stream.cpp -O3 -g0 -DNDEBUG -DSTREAM_TYPE=double -o stream.exe

 [ STREAM results ]
 RDTSC instruction latency:
        Clock cycles  (cc):              28.0000000000
        Nano seconds  (ns):               9.8967906122
        Micro seconds (mu):               0.0098967906
        Milli seconds (ms):               0.0000098968
 --------------------------------------------------------------
 STREAM version $Revision: 5.10 $
 --------------------------------------------------------------
 This system uses 8 bytes per array element.
 --------------------------------------------------------------
 Array size = 67108864 (elements), Offset = 0 (elements)
 Memory per array = 512.0 MiB (= 0.5 GiB).
 Total memory required = 1536.0 MiB (= 1.5 GiB).
 Each kernel will be executed 128 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
 --------------------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds.
 Each test below will take on the order of 60301 microseconds.
    (= 60301 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 --------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 --------------------------------------------------------------
 Function    Best Rate MB/s  Avg time     Min time     Max time
 Copy:           13250.4     0.081662     0.081035     0.082238
 Scale:          11448.2     0.093978     0.093792     0.094213
 Add:            12170.3     0.132826     0.132339     0.133739
 Triad:          12114.2     0.133324     0.132952     0.134030
 --------------------------------------------------------------
 Solution Validates: avg error less than 1.000000e-013 on all three arrays
 --------------------------------------------------------------

SergeyKostrov · ‎04-27-2017

[ MinGW 6.1.0 - Test 1.2 - double - OpenMP - Yes ]
[ Windows 7 SP1 ]

 [ Compiler command line ]
 g++.exe stream.cpp -O3 -g0 -fopenmp -DNDEBUG -DSTREAM_TYPE=double -o stream.exe

 [ STREAM results ]
 RDTSC instruction latency:
        Clock cycles  (cc):              28.0000000000
        Nano seconds  (ns):               9.8967906122
        Micro seconds (mu):               0.0098967906
        Milli seconds (ms):               0.0000098968
 --------------------------------------------------------------
 STREAM version $Revision: 5.10 $
 --------------------------------------------------------------
 This system uses 8 bytes per array element.
 --------------------------------------------------------------
 Array size = 67108864 (elements), Offset = 0 (elements)
 Memory per array = 512.0 MiB (= 0.5 GiB).
 Total memory required = 1536.0 MiB (= 1.5 GiB).
 Each kernel will be executed 128 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
 --------------------------------------------------------------
 Number of Threads requested = 4
 Number of Threads counted = 4
 --------------------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds.
 Each test below will take on the order of 62805 microseconds.
    (= 62805 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 --------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 --------------------------------------------------------------
 Function    Best Rate MB/s  Avg time     Min time     Max time
 Copy:           17726.8     0.061183     0.060572     0.065787
 Scale:          11962.6     0.090069     0.089758     0.091715
 Add:            13303.1     0.121562     0.121070     0.123730
 Triad:          13300.0     0.121484     0.121098     0.124494
 --------------------------------------------------------------
 Solution Validates: avg error less than 1.000000e-013 on all three arrays
 --------------------------------------------------------------

SergeyKostrov · ‎04-27-2017

[ MinGW 6.1.0 - Test 2.1 - float  - OpenMP - No  ]
[ Windows 7 SP1 ]

 [ Compiler command line ]
 g++.exe stream.cpp -O3 -g0 -DNDEBUG -DSTREAM_TYPE=float -o stream.exe

 [ STREAM results ]
 RDTSC instruction latency:
        Clock cycles  (cc):              28.0000000000
        Nano seconds  (ns):               9.8967906122
        Micro seconds (mu):               0.0098967906
        Milli seconds (ms):               0.0000098968
 --------------------------------------------------------------
 STREAM version $Revision: 5.10 $
 --------------------------------------------------------------
 This system uses 4 bytes per array element.
 --------------------------------------------------------------
 Array size = 67108864 (elements), Offset = 0 (elements)
 Memory per array = 256.0 MiB (= 0.3 GiB).
 Total memory required = 768.0 MiB (= 0.8 GiB).
 Each kernel will be executed 128 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
 --------------------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds.
 Each test below will take on the order of 30286 microseconds.
    (= 30286 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 --------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 --------------------------------------------------------------
 Function    Best Rate MB/s  Avg time     Min time     Max time
 Copy:           13244.9     0.040587     0.040534     0.041109
 Scale:          11472.4     0.046841     0.046797     0.047153
 Add:            12233.6     0.066042     0.065827     0.066368
 Triad:          12165.9     0.066255     0.066194     0.066631
 --------------------------------------------------------------
 Solution Validates: avg error less than 1.000000e-006 on all three arrays
 --------------------------------------------------------------

SergeyKostrov · ‎04-27-2017

[ MinGW 6.1.0 - Test 2.2 - float  - OpenMP - Yes ]
[ Windows 7 SP1 ]

 [ Compiler command line ]
 g++.exe stream.cpp -O3 -g0 -fopenmp -DNDEBUG -DSTREAM_TYPE=float -o stream.exe

 [ STREAM results ]
 RDTSC instruction latency:
        Clock cycles  (cc):              28.0000000000
        Nano seconds  (ns):               9.8967906122
        Micro seconds (mu):               0.0098967906
        Milli seconds (ms):               0.0000098968
 --------------------------------------------------------------
 STREAM version $Revision: 5.10 $
 --------------------------------------------------------------
 This system uses 4 bytes per array element.
 --------------------------------------------------------------
 Array size = 67108864 (elements), Offset = 0 (elements)
 Memory per array = 256.0 MiB (= 0.3 GiB).
 Total memory required = 768.0 MiB (= 0.8 GiB).
 Each kernel will be executed 128 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
 --------------------------------------------------------------
 Number of Threads requested = 4
 Number of Threads counted = 4
 --------------------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds.
 Each test below will take on the order of 31992 microseconds.
    (= 31992 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 --------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 --------------------------------------------------------------
 Function    Best Rate MB/s  Avg time     Min time     Max time
 Copy:           17685.1     0.030554     0.030357     0.032452
 Scale:          11986.9     0.045060     0.044788     0.045952
 Add:            13265.4     0.060855     0.060707     0.062708
 Triad:          13274.2     0.060855     0.060667     0.062189
 --------------------------------------------------------------
 Solution Validates: avg error less than 1.000000e-006 on all three arrays
 --------------------------------------------------------------

SergeyKostrov · ‎04-27-2017

[ MinGW 6.1.0 - Test 3.1 - char   - OpenMP - No  ]
[ Windows 7 SP1 ]

 [ Compiler command line ]
 g++.exe stream.cpp -O3 -g0 -DNDEBUG -DSTREAM_TYPE=char -o stream.exe

 [ STREAM results ]
 RDTSC instruction latency:
        Clock cycles  (cc):              28.0000000000
        Nano seconds  (ns):               9.8967906122
        Micro seconds (mu):               0.0098967906
        Milli seconds (ms):               0.0000098968
 --------------------------------------------------------------
 STREAM version $Revision: 5.10 $
 --------------------------------------------------------------
 This system uses 1 bytes per array element.
 --------------------------------------------------------------
 Array size = 67108864 (elements), Offset = 0 (elements)
 Memory per array = 64.0 MiB (= 0.1 GiB).
 Total memory required = 192.0 MiB (= 0.2 GiB).
 Each kernel will be executed 128 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
 --------------------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds.
 Each test below will take on the order of 63811 microseconds.
    (= 63811 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 --------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 --------------------------------------------------------------
 Function    Best Rate MB/s  Avg time     Min time     Max time
 Copy:           13033.6     0.010318     0.010298     0.010604
 Scale:          11728.3     0.011465     0.011444     0.011529
 Add:            12142.5     0.016616     0.016580     0.016676
 Triad:          12336.7     0.016423     0.016319     0.016514
 --------------------------------------------------------------
 WEIRD: sizeof(STREAM_TYPE) = 1
 Solution Validates: avg error less than 1.000000e-006 on all three arrays
 --------------------------------------------------------------

SergeyKostrov · ‎04-27-2017

[ MinGW 6.1.0 - Test 3.2 - char   - OpenMP - Yes ]
[ Windows 7 SP1 ]

 [ Compiler command line ]
 g++.exe stream.cpp -O3 -g0 -fopenmp -DNDEBUG -DSTREAM_TYPE=char -o stream.exe

 [ STREAM results ]
 RDTSC instruction latency:
        Clock cycles  (cc):              28.0000000000
        Nano seconds  (ns):               9.8967906122
        Micro seconds (mu):               0.0098967906
        Milli seconds (ms):               0.0000098968
 --------------------------------------------------------------
 STREAM version $Revision: 5.10 $
 --------------------------------------------------------------
 This system uses 1 bytes per array element.
 --------------------------------------------------------------
 Array size = 67108864 (elements), Offset = 0 (elements)
 Memory per array = 64.0 MiB (= 0.1 GiB).
 Total memory required = 192.0 MiB (= 0.2 GiB).
 Each kernel will be executed 128 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
 --------------------------------------------------------------
 Number of Threads requested = 4
 Number of Threads counted = 4
 --------------------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds.
 Each test below will take on the order of 29324 microseconds.
    (= 29324 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 --------------------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 --------------------------------------------------------------
 Function    Best Rate MB/s  Avg time     Min time     Max time
 Copy:           17344.1     0.007774     0.007739     0.008475
 Scale:          12136.4     0.011088     0.011059     0.011328
 Add:            13211.2     0.015279     0.015239     0.015913
 Triad:          13253.1     0.015220     0.015191     0.015723
 --------------------------------------------------------------
 WEIRD: sizeof(STREAM_TYPE) = 1
 Solution Validates: avg error less than 1.000000e-006 on all three arrays
 --------------------------------------------------------------

SergeyKostrov · ‎04-27-2017

Verifications with another benchmark test application...

SergeyKostrov · ‎04-27-2017

[ Test 1.1 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 16 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 4.295 GB - Completed in 0.265 sec - Bandwidth: 16.207 GB/s

[ Test 1.2 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 16 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 8.590 GB - Completed in 0.515 sec - Bandwidth: 16.679 GB/s

[ Test 1.3 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 16 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 17.180 GB - Completed in 1.061 sec - Bandwidth: 16.192 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 2.1 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 32 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 8.590 GB - Completed in 0.515 sec - Bandwidth: 16.679 GB/s

[ Test 2.2 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 32 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 17.180 GB - Completed in 1.045 sec - Bandwidth: 16.440 GB/s

[ Test 2.3 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 32 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 34.360 GB - Completed in 2.106 sec - Bandwidth: 16.315 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 3.1 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 64 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 17.180 GB - Completed in 1.014 sec - Bandwidth: 16.943 GB/s

[ Test 3.2 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 64 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 34.360 GB - Completed in 2.059 sec - Bandwidth: 16.688 GB/s

[ Test 3.3 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 64 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 68.719 GB - Completed in 4.212 sec - Bandwidth: 16.315 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 1.1 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 16 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 17.180 GB - Completed in 1.014 sec - Bandwidth: 16.943 GB/s

[ Test 1.2 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 16 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 34.360 GB - Completed in 2.059 sec - Bandwidth: 16.688 GB/s

[ Test 1.3 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 16 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 68.719 GB - Completed in 4.196 sec - Bandwidth: 16.377 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 2.1 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 32 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 34.360 GB - Completed in 2.074 sec - Bandwidth: 16.567 GB/s

[ Test 2.2 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 32 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 68.719 GB - Completed in 4.056 sec - Bandwidth: 16.943 GB/s

[ Test 2.3 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 32 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 137.439 GB - Completed in 8.143 sec - Bandwidth: 16.878 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 3.1 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 64 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 68.719 GB - Completed in 4.087 sec - Bandwidth: 16.814 GB/s

[ Test 3.2 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 64 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 137.439 GB - Completed in 8.409 sec - Bandwidth: 16.344 GB/s

[ Test 3.3 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 64 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 274.878 GB - Completed in 16.754 sec - Bandwidth: 16.407 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 1.1 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 16 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 34.360 GB - Completed in 2.090 sec - Bandwidth: 16.440 GB/s

[ Test 1.2 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 16 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 68.719 GB - Completed in 4.258 sec - Bandwidth: 16.139 GB/s

[ Test 1.3 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 16 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 137.439 GB - Completed in 8.268 sec - Bandwidth: 16.623 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 2.1 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 32 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 68.719 GB - Completed in 4.196 sec - Bandwidth: 16.377 GB/s

[ Test 2.2 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 32 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 137.439 GB - Completed in 8.393 sec - Bandwidth: 16.375 GB/s

[ Test 2.3 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 32 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 274.878 GB - Completed in 16.645 sec - Bandwidth: 16.514 GB/s

SergeyKostrov · ‎04-27-2017

[ Test 3.1 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 64 ME / Iterations = 128 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 137.439 GB - Completed in 8.393 sec - Bandwidth: 16.375 GB/s

[ Test 3.2 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 64 ME / Iterations = 256 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 274.878 GB - Completed in 16.770 sec - Bandwidth: 16.391 GB/s

[ Test 3.3 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 64 ME / Iterations = 512 ]

 Initializing...
 Starting BW Test on 4 threads
 Copied: 549.756 GB - Completed in 34.008 sec - Bandwidth: 16.165 GB/s

SergeyKostrov · ‎04-27-2017

*** Ivy Bridge Intel architecture Test results ***

 ** Dell Precision Mobile M4700 **

  Intel Core i7-3840QM ( 2.80 GHz )
  Ivy Bridge / 4 cores / 8 logical CPUs / http://ark.intel.com/products/70846
  Size of L3 Cache =   8MB ( shared between all cores for data & instructions )
  Size of L2 Cache =   1MB ( 256KB per core / shared for data & instructions )
  Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
  32GB RAM
  320GB HDD
  Windows 7 Professional 64-bit SP1
  Display resolution: 1366 x 768
  NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
  NVIDIA Driver version: 378.66
  OpenCL version - 2.0.4.0
  Vulkan version - 1.0.39.1

Test results are obtained with a Variant of the STREAM benchmark: A timing function ' double mysecond( void )' was modified to use RDTSC instruction instead of gettimeofday CRT function ).

SergeyKostrov · ‎04-27-2017

McCalpin, John wrote: >>Wed, 04/26/2017 - 10:56 >> >>The STREAM benchmark has always been intended to provide an indication of performance of simple, vectorizable code when >>run through real compilers. It is intended to be relatively easy to optimize, but performance differences between STREAM and >>"best case, hand-optimized" core are a feature, not a bug! >> >>The performance numbers of 25 GB/s peak, 13 GB/s vectorized OpenMP, and 4 GB/s non-vector, single thread are a little >>unusual -- what sort of platform did you measure those on? This is because 4 GB/s non-vector, single thread results obtained with Open Watcom C++ compiler v2.0 and it doesn't support vectorization and OpenMP processing. Results of modern MinGW C++ compiler v6.1.0 clearly demonstrate benefits of vectorization and parallelization.

SergeyKostrov · ‎04-27-2017

John, you noticed that piece of information: >>RDTSC instruction latency: >> Clock cycles (cc): 28.0000000000 >> Nano seconds (ns): 9.8967906122 >> Micro seconds (mu): 0.0098967906 >> Milli seconds (ms): 0.0000098968 I modified STREAM benchmark C codes and I used RDTSC instruction instead of C runtime function gettimeofday. Take into account that even if gettimeofday is supported on most UNIX-like operating systems some C++ compilers for Windows platforms do not support it.

SergeyKostrov · ‎04-27-2017

...
 Function    Best Rate MB/s  Avg time     Min time     Max time
 Copy:           17726.8     0.061183     0.060572     0.065787
 Scale:          11962.6     0.090069     0.089758     0.091715
 Add:            13303.1     0.121562     0.121070     0.123730
 Triad:          13300.0     0.121484     0.121098     0.124494
...

Question: John, why don't you calculate performance rates for Scale, Add and Triad tests in FLOPS instead?