- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** STREAM benchmark test results for several Intel architectures ***
This is a thread for tests results using STREAM benchmark for several Intel architectures.
Results for Ivy Bridge posted first.
Results for KNL will be posted some time later.
Link Copied
27 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW 6.1.0 - Test 1.1 - double - OpenMP - No ] [ Windows 7 SP1 ] [ Compiler command line ] g++.exe stream.cpp -O3 -g0 -DNDEBUG -DSTREAM_TYPE=double -o stream.exe [ STREAM results ] RDTSC instruction latency: Clock cycles (cc): 28.0000000000 Nano seconds (ns): 9.8967906122 Micro seconds (mu): 0.0098967906 Milli seconds (ms): 0.0000098968 -------------------------------------------------------------- STREAM version $Revision: 5.10 $ -------------------------------------------------------------- This system uses 8 bytes per array element. -------------------------------------------------------------- Array size = 67108864 (elements), Offset = 0 (elements) Memory per array = 512.0 MiB (= 0.5 GiB). Total memory required = 1536.0 MiB (= 1.5 GiB). Each kernel will be executed 128 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. -------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 60301 microseconds. (= 60301 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. -------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. -------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 13250.4 0.081662 0.081035 0.082238 Scale: 11448.2 0.093978 0.093792 0.094213 Add: 12170.3 0.132826 0.132339 0.133739 Triad: 12114.2 0.133324 0.132952 0.134030 -------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-013 on all three arrays --------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW 6.1.0 - Test 1.2 - double - OpenMP - Yes ] [ Windows 7 SP1 ] [ Compiler command line ] g++.exe stream.cpp -O3 -g0 -fopenmp -DNDEBUG -DSTREAM_TYPE=double -o stream.exe [ STREAM results ] RDTSC instruction latency: Clock cycles (cc): 28.0000000000 Nano seconds (ns): 9.8967906122 Micro seconds (mu): 0.0098967906 Milli seconds (ms): 0.0000098968 -------------------------------------------------------------- STREAM version $Revision: 5.10 $ -------------------------------------------------------------- This system uses 8 bytes per array element. -------------------------------------------------------------- Array size = 67108864 (elements), Offset = 0 (elements) Memory per array = 512.0 MiB (= 0.5 GiB). Total memory required = 1536.0 MiB (= 1.5 GiB). Each kernel will be executed 128 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. -------------------------------------------------------------- Number of Threads requested = 4 Number of Threads counted = 4 -------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 62805 microseconds. (= 62805 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. -------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. -------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 17726.8 0.061183 0.060572 0.065787 Scale: 11962.6 0.090069 0.089758 0.091715 Add: 13303.1 0.121562 0.121070 0.123730 Triad: 13300.0 0.121484 0.121098 0.124494 -------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-013 on all three arrays --------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW 6.1.0 - Test 2.1 - float - OpenMP - No ] [ Windows 7 SP1 ] [ Compiler command line ] g++.exe stream.cpp -O3 -g0 -DNDEBUG -DSTREAM_TYPE=float -o stream.exe [ STREAM results ] RDTSC instruction latency: Clock cycles (cc): 28.0000000000 Nano seconds (ns): 9.8967906122 Micro seconds (mu): 0.0098967906 Milli seconds (ms): 0.0000098968 -------------------------------------------------------------- STREAM version $Revision: 5.10 $ -------------------------------------------------------------- This system uses 4 bytes per array element. -------------------------------------------------------------- Array size = 67108864 (elements), Offset = 0 (elements) Memory per array = 256.0 MiB (= 0.3 GiB). Total memory required = 768.0 MiB (= 0.8 GiB). Each kernel will be executed 128 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. -------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 30286 microseconds. (= 30286 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. -------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. -------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 13244.9 0.040587 0.040534 0.041109 Scale: 11472.4 0.046841 0.046797 0.047153 Add: 12233.6 0.066042 0.065827 0.066368 Triad: 12165.9 0.066255 0.066194 0.066631 -------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-006 on all three arrays --------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW 6.1.0 - Test 2.2 - float - OpenMP - Yes ] [ Windows 7 SP1 ] [ Compiler command line ] g++.exe stream.cpp -O3 -g0 -fopenmp -DNDEBUG -DSTREAM_TYPE=float -o stream.exe [ STREAM results ] RDTSC instruction latency: Clock cycles (cc): 28.0000000000 Nano seconds (ns): 9.8967906122 Micro seconds (mu): 0.0098967906 Milli seconds (ms): 0.0000098968 -------------------------------------------------------------- STREAM version $Revision: 5.10 $ -------------------------------------------------------------- This system uses 4 bytes per array element. -------------------------------------------------------------- Array size = 67108864 (elements), Offset = 0 (elements) Memory per array = 256.0 MiB (= 0.3 GiB). Total memory required = 768.0 MiB (= 0.8 GiB). Each kernel will be executed 128 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. -------------------------------------------------------------- Number of Threads requested = 4 Number of Threads counted = 4 -------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 31992 microseconds. (= 31992 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. -------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. -------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 17685.1 0.030554 0.030357 0.032452 Scale: 11986.9 0.045060 0.044788 0.045952 Add: 13265.4 0.060855 0.060707 0.062708 Triad: 13274.2 0.060855 0.060667 0.062189 -------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-006 on all three arrays --------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW 6.1.0 - Test 3.1 - char - OpenMP - No ] [ Windows 7 SP1 ] [ Compiler command line ] g++.exe stream.cpp -O3 -g0 -DNDEBUG -DSTREAM_TYPE=char -o stream.exe [ STREAM results ] RDTSC instruction latency: Clock cycles (cc): 28.0000000000 Nano seconds (ns): 9.8967906122 Micro seconds (mu): 0.0098967906 Milli seconds (ms): 0.0000098968 -------------------------------------------------------------- STREAM version $Revision: 5.10 $ -------------------------------------------------------------- This system uses 1 bytes per array element. -------------------------------------------------------------- Array size = 67108864 (elements), Offset = 0 (elements) Memory per array = 64.0 MiB (= 0.1 GiB). Total memory required = 192.0 MiB (= 0.2 GiB). Each kernel will be executed 128 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. -------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 63811 microseconds. (= 63811 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. -------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. -------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 13033.6 0.010318 0.010298 0.010604 Scale: 11728.3 0.011465 0.011444 0.011529 Add: 12142.5 0.016616 0.016580 0.016676 Triad: 12336.7 0.016423 0.016319 0.016514 -------------------------------------------------------------- WEIRD: sizeof(STREAM_TYPE) = 1 Solution Validates: avg error less than 1.000000e-006 on all three arrays --------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW 6.1.0 - Test 3.2 - char - OpenMP - Yes ] [ Windows 7 SP1 ] [ Compiler command line ] g++.exe stream.cpp -O3 -g0 -fopenmp -DNDEBUG -DSTREAM_TYPE=char -o stream.exe [ STREAM results ] RDTSC instruction latency: Clock cycles (cc): 28.0000000000 Nano seconds (ns): 9.8967906122 Micro seconds (mu): 0.0098967906 Milli seconds (ms): 0.0000098968 -------------------------------------------------------------- STREAM version $Revision: 5.10 $ -------------------------------------------------------------- This system uses 1 bytes per array element. -------------------------------------------------------------- Array size = 67108864 (elements), Offset = 0 (elements) Memory per array = 64.0 MiB (= 0.1 GiB). Total memory required = 192.0 MiB (= 0.2 GiB). Each kernel will be executed 128 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. -------------------------------------------------------------- Number of Threads requested = 4 Number of Threads counted = 4 -------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 29324 microseconds. (= 29324 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. -------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. -------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 17344.1 0.007774 0.007739 0.008475 Scale: 12136.4 0.011088 0.011059 0.011328 Add: 13211.2 0.015279 0.015239 0.015913 Triad: 13253.1 0.015220 0.015191 0.015723 -------------------------------------------------------------- WEIRD: sizeof(STREAM_TYPE) = 1 Solution Validates: avg error less than 1.000000e-006 on all three arrays --------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Verifications with another benchmark test application...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 1.1 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 16 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 4.295 GB - Completed in 0.265 sec - Bandwidth: 16.207 GB/s [ Test 1.2 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 16 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 8.590 GB - Completed in 0.515 sec - Bandwidth: 16.679 GB/s [ Test 1.3 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 16 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 17.180 GB - Completed in 1.061 sec - Bandwidth: 16.192 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 2.1 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 32 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 8.590 GB - Completed in 0.515 sec - Bandwidth: 16.679 GB/s [ Test 2.2 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 32 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 17.180 GB - Completed in 1.045 sec - Bandwidth: 16.440 GB/s [ Test 2.3 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 32 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 34.360 GB - Completed in 2.106 sec - Bandwidth: 16.315 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 3.1 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 64 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 17.180 GB - Completed in 1.014 sec - Bandwidth: 16.943 GB/s [ Test 3.2 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 64 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 34.360 GB - Completed in 2.059 sec - Bandwidth: 16.688 GB/s [ Test 3.3 - MinGW 6.1.0 / char / OpenMP - Yes / Size = 64 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 68.719 GB - Completed in 4.212 sec - Bandwidth: 16.315 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 1.1 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 16 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 17.180 GB - Completed in 1.014 sec - Bandwidth: 16.943 GB/s [ Test 1.2 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 16 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 34.360 GB - Completed in 2.059 sec - Bandwidth: 16.688 GB/s [ Test 1.3 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 16 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 68.719 GB - Completed in 4.196 sec - Bandwidth: 16.377 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 2.1 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 32 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 34.360 GB - Completed in 2.074 sec - Bandwidth: 16.567 GB/s [ Test 2.2 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 32 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 68.719 GB - Completed in 4.056 sec - Bandwidth: 16.943 GB/s [ Test 2.3 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 32 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 137.439 GB - Completed in 8.143 sec - Bandwidth: 16.878 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 3.1 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 64 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 68.719 GB - Completed in 4.087 sec - Bandwidth: 16.814 GB/s [ Test 3.2 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 64 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 137.439 GB - Completed in 8.409 sec - Bandwidth: 16.344 GB/s [ Test 3.3 - MinGW 6.1.0 / float / OpenMP - Yes / Size = 64 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 274.878 GB - Completed in 16.754 sec - Bandwidth: 16.407 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 1.1 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 16 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 34.360 GB - Completed in 2.090 sec - Bandwidth: 16.440 GB/s [ Test 1.2 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 16 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 68.719 GB - Completed in 4.258 sec - Bandwidth: 16.139 GB/s [ Test 1.3 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 16 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 137.439 GB - Completed in 8.268 sec - Bandwidth: 16.623 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 2.1 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 32 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 68.719 GB - Completed in 4.196 sec - Bandwidth: 16.377 GB/s [ Test 2.2 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 32 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 137.439 GB - Completed in 8.393 sec - Bandwidth: 16.375 GB/s [ Test 2.3 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 32 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 274.878 GB - Completed in 16.645 sec - Bandwidth: 16.514 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Test 3.1 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 64 ME / Iterations = 128 ] Initializing... Starting BW Test on 4 threads Copied: 137.439 GB - Completed in 8.393 sec - Bandwidth: 16.375 GB/s [ Test 3.2 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 64 ME / Iterations = 256 ] Initializing... Starting BW Test on 4 threads Copied: 274.878 GB - Completed in 16.770 sec - Bandwidth: 16.391 GB/s [ Test 3.3 - MinGW 6.1.0 / double / OpenMP - Yes / Size = 64 ME / Iterations = 512 ] Initializing... Starting BW Test on 4 threads Copied: 549.756 GB - Completed in 34.008 sec - Bandwidth: 16.165 GB/s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** Ivy Bridge Intel architecture Test results ***
** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / http://ark.intel.com/products/70846 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) 32GB RAM 320GB HDD Windows 7 Professional 64-bit SP1 Display resolution: 1366 x 768 NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) NVIDIA Driver version: 378.66 OpenCL version - 2.0.4.0 Vulkan version - 1.0.39.1
Test results are obtained with a Variant of the STREAM benchmark: A timing function ' double mysecond( void )' was modified to use RDTSC instruction instead of gettimeofday CRT function ).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
McCalpin, John wrote:
>>Wed, 04/26/2017 - 10:56
>>
>>The STREAM benchmark has always been intended to provide an indication of performance of simple, vectorizable code when
>>run through real compilers. It is intended to be relatively easy to optimize, but performance differences between STREAM and
>>"best case, hand-optimized" core are a feature, not a bug!
>>
>>The performance numbers of 25 GB/s peak, 13 GB/s vectorized OpenMP, and 4 GB/s non-vector, single thread are a little
>>unusual -- what sort of platform did you measure those on?
This is because 4 GB/s non-vector, single thread results obtained with Open Watcom C++ compiler v2.0 and it doesn't support vectorization and OpenMP processing.
Results of modern MinGW C++ compiler v6.1.0 clearly demonstrate benefits of vectorization and parallelization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John, you noticed that piece of information:
>>RDTSC instruction latency:
>> Clock cycles (cc): 28.0000000000
>> Nano seconds (ns): 9.8967906122
>> Micro seconds (mu): 0.0098967906
>> Milli seconds (ms): 0.0000098968
I modified STREAM benchmark C codes and I used RDTSC instruction instead of C runtime function gettimeofday.
Take into account that even if gettimeofday is supported on most UNIX-like operating systems some C++ compilers for Windows platforms do not support it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
... Function Best Rate MB/s Avg time Min time Max time Copy: 17726.8 0.061183 0.060572 0.065787 Scale: 11962.6 0.090069 0.089758 0.091715 Add: 13303.1 0.121562 0.121070 0.123730 Triad: 13300.0 0.121484 0.121098 0.124494 ...
Question: John, why don't you calculate performance rates for Scale, Add and Triad tests in FLOPS instead?
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page