Solved: Contradicting STREAM benchmark result

hyeongtak_ · ‎01-07-2021

Thank you for your help in advance.

Hello, even though the STREAM benchmark has been the de facto benchmark for a long time, I recently started using it and studying it. Not really sure if this is the right place to ask a question about it, since Dr. Bandwdth and other STREAM pros answers here from time to time, please let me ask mine.

This is my desktop information:

I7-8700 (6c, 12t), 16GB memory, DDR4-2666
Windows 10
Using visual studio 2019 (optimization level 2)
OMP enabled

Now my question starts. I got a binary package of STREAM for Windows, and I tested by differing OMP_NUM_THREADS option from 1 to 12. The graph below is the result:

It is not really what I learned from college. There can be a bottleneck in job distribution or data sharing, I believe the result shouldn't be like this. Fortunately, the package includes not only the binary but also the source code, I compiled it and run it on my own. and the result is quite different.

Sorry for the disagreeing x-axis. Even though the lines are fluctuating, I believe this is nearer to my knowledge (since the graph is going up till 6, which is the number of physical cores).

I know the version that I'm using is an obsolete one(5.8). However, as far as I checked 5.8 and 5.10 were not that much different in key parts, also I followed the STREAM_ARRAY_SIZE rules. What am I missing? and what have I done wrong? I wish someone can explain to me why the first graph is showing that downward result.

McCalpinJohn · ‎01-19-2021

The Windows executable was donated -- I have never had access to a "real" compiler on a Windows system -- so I don't have any of the details on how it was compiled. Note that the file you downloaded was from the "Obsolete" sub-directory -- I will probably nuke most of that in the next web site update....

Except for the issue of streaming stores (*), the default version of STREAM is mostly insensitive to compiler technology on modern processors. Versions of STREAM in C that use dynamic memory allocation often require either a smart compiler or additional annotation (like the "restrict" keyword) so the compiler can assume that there is no aliasing. (That is the main reason that the default version of stream.c uses static global array declarations.)

Depending on the hardware, STREAM performance can be dependent on the relative alignment of the arrays. Different compilers will often generate different alignments, causing small (~1%-2%) differences in performance. These differences usually disappear if you look at the statistics of performance across an ensemble of array sizes.

(*) Streaming stores are an implementation-dependent feature that allow full-cacheline stores that miss in the cache to be written (more-or-less) directly to memory. This bypasses the initial read of the target cache line that is required with the normal store instructions that miss in the cache, leaving more DRAM bandwidth available for the required read and write operations. The Intel compilers will generate streaming store instructions automagically when appropriate, while the GNU compilers do not generate streaming store instructions. I don't think that the CLANG/LLVM combination generates streaming stores, but I have not done much work with these.

View solution in original post

McCalpinJohn · ‎01-08-2021

I recommend trying the Intel Memory Latency Checker. https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html

The performance numbers here are consistent with having only one populated DRAM channel...

hyeongtak_ · ‎01-10-2021

Thank you for your reply!

However, what I was wondering was the different result between (.exe in the package) vs (.exe made by the code in the package).

Is there any other recommendation such as to use a specific compiler?

McCalpinJohn · ‎01-19-2021

The Windows executable was donated -- I have never had access to a "real" compiler on a Windows system -- so I don't have any of the details on how it was compiled. Note that the file you downloaded was from the "Obsolete" sub-directory -- I will probably nuke most of that in the next web site update....

Except for the issue of streaming stores (*), the default version of STREAM is mostly insensitive to compiler technology on modern processors. Versions of STREAM in C that use dynamic memory allocation often require either a smart compiler or additional annotation (like the "restrict" keyword) so the compiler can assume that there is no aliasing. (That is the main reason that the default version of stream.c uses static global array declarations.)

Depending on the hardware, STREAM performance can be dependent on the relative alignment of the arrays. Different compilers will often generate different alignments, causing small (~1%-2%) differences in performance. These differences usually disappear if you look at the statistics of performance across an ensemble of array sizes.

(*) Streaming stores are an implementation-dependent feature that allow full-cacheline stores that miss in the cache to be written (more-or-less) directly to memory. This bypasses the initial read of the target cache line that is required with the normal store instructions that miss in the cache, leaving more DRAM bandwidth available for the required read and write operations. The Intel compilers will generate streaming store instructions automagically when appropriate, while the GNU compilers do not generate streaming store instructions. I don't think that the CLANG/LLVM combination generates streaming stores, but I have not done much work with these.

hyeongtak_ · ‎01-19-2021

Thank you again, that helped me a lot!