We have an Intel Sandy Bridge E5 4640 machine. It has 4 sockets and specs say QPI should provide 16GB/sec from one NUMA node to another. I run a stream benchmark by allocating all the memory in node-1 and running all the 16 threads in node-0, the bandwidth should normally correspond to the QPI number in my opinion. However, I see a bandwidth of ~4GB/sec. Also the local memory bandwidth is limited to ~30GB/sec where it should have been ~50GB/sec. Are there any ideas why these might be the case?
(1) You need to be careful to increase the problem size appropriately for that system -- remember that it has 80 MB of L3 cache if you use all the cores!
According to the STREAM run rules, each array should be at least 4 times the aggregate size of the last level cache, so ~320 MB per array, or N=40,000,000. (I don't worry about the <5% difference between 320 x10^6 Bytes and 320 x 2^20 Bytes.)
The reported results on 32 cores with all local memory allocation are ~75 GB/s for the Copy and Scale kernels and ~83 GB/s for the Add and Triad kernels.
If I run with the default N=2,000,000, the arrays take up only 16 MB each, so the three arrays are fully L3-contained, and I get reported performance of up to 160 GB/s. This is not a valid STREAM result, but it is a useful measure of L3 cache bandwidth.
I just re-ran with a slightly newer version of the compiler and got numbers in the 84-85 GB/s range for the Add and Triad kernels for array sizes between 30M and 10,000M elements per array.
(2) Running an all 32 cores with "numactl --membind=1" gave results in the range of 12-13 GB/s for the four kernels with problem sizes between 100M and 1000M elements per array.
On the other hand, when I ran with 8 threads bound to chip 0 and the memory bound to chip 1 or chip 3, I got the same ~4 GB/s that you are seeing. Performance was a bit lower when the data was bound to chip 2 (a bit under 4 GB/s). Changing the thread count did not help -- this seems to be a "feature".
Hello Jim and Cabrigal,
I've run stream on 4 socket snb-ex box with 4 Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz processors. I think the box had 32 DIMMs with 1333 MHz. There are 64 logical threads, 32 cores. The first 32 cpus (as seen in /proc/cpuinfo) are the first logical thread on each core. So cpus 0-7 are the first logical thread on socket 0, cpus 8-15 are the 1st logical thread on socket 1, etc.
If I compile stream.c with OpenMP support (in the Makefile) CFLAGS = -O3 -fopenmp -DSTREAM_ARRAY_SIZE=40000000. I didn't have the g77 compiler installed so I just used the gcc compiler.
Then set the OMP env variables:
and run the job:
Function Best Rate MB/s Avg time Min time Max time
Copy: 87191.2 0.007420 0.007340 0.007561
Scale: 86020.5 0.007542 0.007440 0.007646
Add: 94172.5 0.010289 0.010194 0.010404
Triad: 94833.4 0.010224 0.010123 0.010350
If I run with pcm.x to monitor the bandwidth (./pcm.x 1 -ns -nc) then pcm shows an average of 85 GB/s reads and 35 GB/s writes. So (as Jim knows) we see that there is actually more memory bw used than stream reports.
A bit of a follow-up....
I ran STREAM on a Xeon E5-2680 system (a compute node in the "Stampede" system at TACC) with 4-8 threads bound to chip 0 and the memory bound to chip 1. Performance was ~14.5 GB/s for the Add and Triad kernels for various array sizes when the binaries were compiled to use streaming stores.
On my Xeon E5-4650 system (a "large memory" node in the "Stampede" system), I got ~4.4 GB/s when the threads were on chip 0 and the data was on chip 1 or chip 3 (both directly connected to chip 0), and about 3.8 GB/s when the data was on chip 2 (two hops away).
I expected to see a factor of 2 decrease in chip-to-chip bandwidth because the Xeon E5-2600 connects both QPI links between the two chips, while the Xeon E5-4600 has only one QPI link for each chip-to-chip connection. This reduces the observed ~14.5 GB/s to an expected value of ~7.2 GB/s.
If (like the Nehalem EX) the Xeon E5-4600 performs "write allocates" (i.e. reads the target cache lines before writing to them) *even* when using streaming stores, then I expect another 25% decrease for the Add and Triad kernels (4 data items moved per iteration instead of 3). This reduces the expected "cross-chip" STREAM performance performance to about 5.5 GB/s.
So the observed value of ~4.4 GB/s is ~20% lower than what I would expect based on simple scaling. I don't know if this difference is due to extra coherence traffic that may be required on a 4-socket system, or due to increased memory latency, or due to some other less obvious factor.
Can some one explain what formula Intel Vtune uses internally to calculate Memory Bandwidth Bound metric. Also if an application is bound on memory bandwidth, would it not saturating the memory bandwidth ?