Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

Modifying stream benchmark to report read bandwidth

pfay1
Beginner
716 Views

Hello,

I'm using Dr. McCalpin's stream to measure bandwidth on various servers. It works great.

I know that we can compile it with non-temporal stores and get higher reported bandwidth (since the read-for-ownership isn't done). But trying to explain this to other people who use stream and are not familiar with RFOs and non-temporal vs temporal and actual vs reported bw is hard. 

And I know I can multiply the reported bw by a factor (4/3 for triad) and get a peak 'actual (as in "what linux perf would report as seen by dram")' bw... but this starts making some folks eyes glaze over.

Other bw benchmarks (such as Intel's mlc) report pure read bw.

Stream can be easily modified to do a pure read test (I changed the ARRAY_type to 'int' from 'double'). (and yes, I'm checking perf bw to verify that what stream_read reports is actually appearing as memory traffic). And yes I use a much larger size array to ensure I'm not hitting cache (I get about 99.5+ % L3 miss rate).

        for (j=0; j<STREAM_ARRAY_SIZE; j++)
            iaccum += a[j];

Is adding a stream_read subtest something Dr. McCalpin might consider for stream?

In general (for the servers I'm profiling anyway) about 10% of the theoretical bw is lost off the top (due to memory refresh, memory scrubbing and other stuff (that Dr Bandwidth know much better than I)).
Next my stream_read (and Intel mlc read bw) bandwidth can hit about 87-92% of theoretical bw.
Non-temporal store stream_triad can hit about the same 87-90% levels.
Then temporal store stream_triad reports a value of 55%-62% of theoretical bw. The actual bw is 4/3* reported bw (so actual (as measured by perf dram bw) is about 75%-82% of theoretical.

It would be just so much easier (for me) if stream reported "Read:" bw in addition to the Triad, etc.
We report mem bw (via perf) on our cloud servers and we want to make sure users have the correct understanding of where their mem bw usage fits in the %theoretical bw curve.

I know this might not be the right place for a "mainly stream" question but I wasn't sure how to contact Dr Bandwidth.

Long time since I've posted here or talked with Dr. McCalpin.
Patrick Fay

0 Kudos
4 Replies
pfay1
Beginner
676 Views

Here is a sample stream_read benchmark. It just adds a Read subtest to the other 4 subtests.
The read subtest just reads 1 double value from each cacheline (so every cacheline gets loaded from memory... but this is a departure from other subtests which processes all the doubles in a cacheline).
I compiled it with gcc cmd below (for a 16 GB per array size and use openmp):

gcc stream_read.c -O3 -march=native -fno-builtin -DSTREAM_ARRAY_SIZE=2147483648 -mcmodel=medium -DNTIMES=20 -DOFFSET=0 -DSTREAM_TYPE=double -fopenmp -o stream_read.x

Run it with:
export OMP_NUM_THREADS=64
export GOMP_CPU_AFFINITY=0-63:1
./stream_read.x

It has output like (the "Read:" line is shows the read bandwidth.

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          108351.0     0.317234     0.317115     0.317402
Scale:         108182.5     0.317923     0.317609     0.318356
Add:           121554.2     0.424257     0.424005     0.424597
Triad:         121882.0     0.423358     0.422865     0.423755
Read:          180597.7     0.095204     0.095128     0.095340

McCalpinJohn
Black Belt
658 Views

I certainly have a number of versions of STREAM that include read-only kernels (typically using either 1 array for DSUM or 2 arrays for DDOT).   They have never migrated into the standard version of STREAM because lots of compilers have trouble with optimization of the sum reductions.  This leads to lower performance than expected, and raises as many questions as it answers.

It would be easier to add a DAXPY kernel -- like Triad, but updating one of the two input arrays.  This would get rid of the write-allocate traffic and perhaps make it more clear to folks what is going on....

pfay1
Beginner
650 Views

Thanks John,
I'd love to see a dsum, daxpy or ddot that does pure read bandwidth.
That would be more in keeping with your other kernels (which do real-world-type work on the whole arrays).
I've attached a new version of stream_read.
The file stream_read_v03.c aliases the double array as an integer array and sums all the integer elements.
(I had a stream_read_v02.c attached but I've replaced that with stream_read_v03.c which adds a reduction() clause to the openmp parallel for).
So the integer operations are not in keeping with the spirit of your floating point kernels but it does process the whole array.
And it scales well across 32/48/80/96 cpu systems.
This version also prints the "based on a variant of the STREAM benchmark code" message as your license requires.
Pat

pfay1
Beginner
428 Views

I've posted the stream_read.c and a run script to github: https://github.com/patinnc/stream_read

Reply