Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

STREAM OMP benchmark compiled with ICC

bouache
Beginner
2,505 Views

Hi Everyone,

I am looking for STREAM-omp benchmark for memory bandwidth test compiled with ICC (15/16). any help, link?

Thanks.16

0 Kudos
16 Replies
McCalpinJohn
Honored Contributor III
2,505 Views

For what operating system?  What hardware?
 

0 Kudos
bouache
Beginner
2,505 Views

Thanks for your answer. RHEL-6/ Ivybridge or Haswell (E5-26xx v2/3) Server CPU.

0 Kudos
Kittur_G_Intel
Employee
2,505 Views

Hi Mourad,
With what I gather we don't have any externally published results that are maintained per-se by the product team. That said. I guess the best approach is for you to download and run it yourself on the system with different options of interest per-se.  

Also, I noticed that John McCalpin responding to you earlier and he maintains the stream benchmark at:  https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/stream/  and should add more input to you as well!

BTW, I googled and found Intel Stream performance benchmark also at: http://www.karlrupp.net/2015/02/stream-benchmark-results-on-intel-xeon-and-xeon-phi/  which should help as well.

_Kittur

0 Kudos
McCalpinJohn
Honored Contributor III
2,505 Views

The official home page of the STREAM benchmark has not moved since 1996 -- it is at http://www.cs.virginia.edu/stream/

I apologize that I have not published the results for Xeon E5-26xx v2 and Xeon E5-26xx v3 yet -- they are in my inbox....    

I don't typically distribute binaries because part of what makes STREAM interesting is the ability to re-compile and re-run with many variations on array sizes, array alignments, instruction sets, etc, so that you can see how sensitive the performance is to various configuration parameters.

Systems will get different results depending on many factors, including (but not limited to):

  • processor model
  • BIOS settings
  • DRAM configuration (number and type of DIMMS installed in each channel)
  • run-time environment (numactl, number of threads, pinning of threads)
  • compiler used (Intel vs GNU vs ???)
  • compiler options chosen (lots of options can impact performance)
  • STREAM_ARRAY_SIZE

With that said, there is not usually a lot of difference across "well-configured" systems with a variety of processor models (as long as you stay away from the low-frequency, low-core-count models).

Ivy Bridge EP:

For the Xeon E5-2680 v2 (Ivy Bridge EP, 10 core, 2.8 GHz nominal, 256 GiB of DDR3/1866), I compiled with icc 14.0.1: "icc -O3 -openmp -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -xSSE4.1 -opt-streaming-stores always"

The code was run with 20 threads (HyperThreading disabled) and threads pinned using "KMP_AFFINITY=scatter".   A fairly typical result:

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:       88086.7152       0.0146       0.0145       0.0146
Scale:      88416.0195       0.0146       0.0145       0.0146
Add:        90535.7416       0.0213       0.0212       0.0213
Triad:      90783.7540       0.0212       0.0211       0.0214
-------------------------------------------------------------

The highest of these values is about 76% of the peak DRAM bandwidth of ~119.4 GB/s for a 2-socket system.

Haswell EP:

Dell R630 with 2 Xeon E5-2660 v3 (10 core, 2.6 GHz, 105W) & 64 GiB of DDR4/2133 (one dual-rank 16 GiB DIMM per channel).

Compiled with icc 2015: -O3 -xCORE-AVX2 -ffreestanding -openmp -DSTREAM_ARRAY_SIZE=400000000

The code was run with 20 threads (HyperThreading disabled) and threads pinned using "KMP_AFFINITY=scatter".   A fairly typical result:

-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:      109123.6528       0.0588       0.0586       0.0589
Scale:     109633.9575       0.0585       0.0584       0.0586
Add:       111862.0894       0.0859       0.0858       0.0860
Triad:     111760.2507       0.0860       0.0859       0.0861
-------------------------------------------------------------

The highest of these values is almost 82% of the peak DRAM bandwidth of ~136.5 GB/s for a 2-socket system.

 

 

0 Kudos
Kittur_G_Intel
Employee
2,505 Views

Hi John,
Thanks for the response and for the results on Ivy/HSW with ICC (interesting numbers). BTW, the latest release of ICC is 16.0 (initial release) and can be downloaded from the https://registrationcenter.intel.com/en/home/ 
Kittur 

0 Kudos
McCalpinJohn
Honored Contributor III
2,505 Views

STREAM performance should have only an extremely small dependence on the compiler version (for Intel compilers).  There are certainly no noticeable changes across 13, 14, and 15, and the performance differences that exist are mostly very subtle 2nd-order (or smaller) effects.   There is a big difference in performance between the Intel compilers and gcc, but that is because gcc does not support the generation of streaming stores.

We just upgraded to Intel 15 as the default compiler on our primary production system (Stampede) yesterday, but we will probably make version 16 available as an option within a few months.  I am especially interested in the improved functionality of VTune for MPI codes that is available in version 16.   With any luck it will also improve some Haswell optimizations that I have been struggling with, but that is a different topic....

0 Kudos
Kittur_G_Intel
Employee
2,505 Views

Thanks John, for the info. Nevertheless, it does make sense to have benchmark results for the latest release since a lot of our customers have migrated to that version as well due to enhancements (optimization) and several new features as well.

_Cheers,
Kittur 

0 Kudos
bouache
Beginner
2,505 Views

Thanks for your answers John.

For Haswell test, you said that your test was with one dual-rank 16 GiB DIMM per channel. did you try with single rank DIMM.

I am expecting that the single rank (for one DIMM/Channel) is going to be with less performance. and the question why? why better performance and cheaper compared to the dual rank? is it the latency and do you have some results for the dual vs single rank latency?

another point, compiling STREAM using the following flag: -O3 -xCORE-AVX2 -ffreestanding -openmp -DSTREAM_ARRAY_SIZE=400000000, this will give the best performance and especially with the latest ICC version.

Can you please share the STREAM binaries. I am having hard time to compile STREAM with ICC using the same flags as you?

my tests are all with GCC and I am using this STREAM for my tests: https://github.com/gregs1104/stream-scaling

Thanks.

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,506 Views

Hmmmm.... the authors of that "stream-scaling" package are a bit confused about some issues -- I will try to see if I can get that cleaned up...

I don't have any single-rank DIMMs in any of my systems, so I can't test that configuration.   Based on experience with DDR3 single-rank DIMMs, it is pretty clear that one single-rank DIMM is not a desirable configuration for performance.    DDR3 has only 8 banks per rank and the STREAM benchmark generates 1 or 2 memory read streams and 1 memory write stream per thread. So running STREAM using 8 cores will generate 16 read streams and 8 write streams.  These will want to access 24 different DRAM banks, but since there are only 8 DRAM banks in a single-rank system, the system will experience very high rates of DRAM page thrashing.

With DDR4 there are 16 banks (arranged in 4 "bank groups"), so a single rank should be able to handle up to 8 cores per socket before performance starts dropping rapidly.

There should be no difference in price between single-rank and dual-rank DIMMs if they use the same DRAM technology.   An 8 GiB single-rank DIMM uses 18 4 Gbit DRAM chips in the 1Gx4 configuration, while an 8 GiB dual-rank DIMM uses 2 ranks of 9 4Gbit DRAM chips in the 512Mx8 configuration.  Both use 18 DRAM chips and there is effectively zero difference in cost between the 1Gx4 and 512Mx8 DRAMs.   A dual-rank registered DIMM requires an extra register chip, but those only cost a few dollars each.

Some vendors don't want to support the "x8" DRAM configurations, but they work fine -- that is pretty much all we have in the ~10,000 servers here at TACC.

To compile STREAM with an array size bigger than 80 million elements you need to add "-mcmodel=medium" to the compiler flags.

0 Kudos
bouache
Beginner
2,506 Views

Hi John,

I did re-test using ICC version here what I have:

icc -O3 -xCORE-AVX2 -ffreestanding -openmp -DSTREAM_ARRAY_SIZE=400000000 -mcmodel=medium stream.c -o stream-icc16
./stream-icc16

time ./stream-icc16
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           78372.8     0.082209     0.081661     0.083570
Scale:          79166.8     0.083214     0.080842     0.092195
Add:            87008.6     0.112603     0.110334     0.120000
Triad:          87165.9     0.112757     0.110135     0.118607
------------------------------------------------------------

For GCC

time ./stream
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           60767.8     0.101035     0.092050     0.105895
Scale:          58615.0     0.100004     0.095431     0.101726
Add:            66017.2     0.127397     0.127096     0.127739
Triad:          64615.6     0.129917     0.129853     0.129986
-------------------------------------------------------------

even though the ICC version is better but I still don't see the performance you have.
 

My HW config is:  2 x Xeon E5-2680 v3 2.50GHz, 128GB (dual rank) 2133MHz. I can do the same thing with single rank.

any thoughts?

Thanks.

0 Kudos
McCalpinJohn
Honored Contributor III
2,506 Views

On multi-socket systems STREAM must be run with pinned threads.  Using KMP_AFFINITY="verbose,scatter" is the easiest way to do this.

I just ran the test on one of my Xeon E5-2680 v3 systems and got well over 100 GB/s on all four kernels using anywhere between 8 and 24 threads.

If you are running with very large array sizes it is possible to run out of memory on one node and allocate some of your pages on the wrong node. This can lead to inconsistent and confusing results.  This seems unlikely in your case, since you are only requesting about 4.5 GB/node (out of 64 GB installed), but I thought I would mention it.   You can monitor for this by running "numastat" before and after the STREAM run.  If everything is running well, the change in the "numa_miss" value on each node will be orders of magnitude smaller than the change in the "numa_hit" value on each node.   If you do see (after-before) differences "numa_miss" values that are bigger than a few percent of the corresponding (after-before) differences in "numa_hit" values, you should try dropping the caches before the STREAM run.    On Linux systems you can drop the caches with "echo 3 > /proc/sys/vm/drop_caches" (run by the root user).

0 Kudos
bouache
Beginner
2,506 Views

Thanks John,

Here we go, we all those optimizations, I got similar performance with ICC-16.

./stream-icc16
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 400000000 (elements), Offset = 0 (elements)
Memory per array = 3051.8 MiB (= 3.0 GiB).
Total memory required = 9155.3 MiB (= 8.9 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
OMP: Warning #222: OMP_NUM_THREADS: Invalid symbols found. Check the value "’48’".
Number of Threads requested = 48
Number of Threads counted = 48
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 62484 microseconds.
   (= 62484 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          104686.3     0.061232     0.061135     0.061305
Scale:         104955.6     0.061054     0.060978     0.061245
Add:           115370.8     0.083270     0.083210     0.083336
Triad:         115394.2     0.083249     0.083193     0.083467
-------------------------------------------------------------

GCC:

./stream-gcc

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           65844.3     0.098223     0.084953     0.106872
Scale:          62153.3     0.097536     0.089998     0.101590
Add:            72611.7     0.125889     0.115553     0.137524
Triad:          75404.1     0.123395     0.111274     0.130068

0 Kudos
bouache
Beginner
2,506 Views

Hi John,

did you try to measure the memory latency for IVB and HSW?

I am just trying to understand if we can go with the dual rank, are we going to get worst latency?

Thanks

0 Kudos
McCalpinJohn
Honored Contributor III
2,505 Views

Memory latency should not depend on the memory configuration unless you install so many DIMMs (or ranks) that the DRAM frequency has to be decreased.  Recent Intel server systems typically run at full speed with two dual-rank registered DIMMs per channel, so there should be no latency difference between:

  • One single-rank DIMM per channel
  • Two single-rank DIMMs per channel
  • One dual-rank DIMM per channel
  • Two dual-rank DIMMs per channel

There are differences in behavior under load, but they vary in both sign and magnitude depending on the detailed characteristics of the workload.

 

 

0 Kudos
bouache
Beginner
2,505 Views

Thanks for all your answers John.

0 Kudos
Kittur_G_Intel
Employee
2,505 Views

I second that Mourad - thanks to John for the valuable input on the benchmark.

Kittur

0 Kudos
Reply