Solved: A curious bandwidth result

HUIZHAN_Y_ · ‎06-11-2015

Now I am testing Stream Benchmark on a Numa System including two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz. The compiler is GCC. I use a OpenMP version with 12 threads. I know that Numa system allocate pages with first touch. So I try to allocate the three arrays (a/b/c) aligned in 2MB, and select a proper N, let the threads running both nodes. As a result, all thread can access local data(I indeed check all pages location with migrate_pages), I check the bandwidth is 20GB/s for Triad. I want to check if all thread access remote data, maybe the bandwidth is worst. So I try to let data allocate in remote memory node when the data is initialized, and all threads will compute completely with remote data. But the result is not so worst as expected. The bandwidth is 19GB/s for Triad. So I think maybe the remote access will not hurt bandwidth. But when I running with numactl(numactl -m 0 -N 1), I allocate all memory on memory node 0 and all threads are running in node 1. I only can get a bandwidth 5GB/s. I think I should get a bandwidth ~10GB since I can use half of memory devices. But why the result is so poor?

I use vtune check the case of the remote access with threads running in both nodes (node 0 and 1), and I found that except the first iteration, the reduced iterations have not used a lot of QPI bandwith (Generally 2~3GB/s) . But for the case of remote access with threads running in a single node (node 1), the QPI seems to be used aggressively (Generally 3~5GB/s). I don't know what tricky things in the test.

Thomas_W_Intel · ‎06-12-2015

Might it be that you have automatic NUMA balancing enabled on your system? If so, the kernel would migrate threads or memory after a few iterations.

Kind regards

Thomas

View solution in original post

Thomas_W_Intel · ‎06-12-2015

Might it be that you have automatic NUMA balancing enabled on your system? If so, the kernel would migrate threads or memory after a few iterations.

Kind regards

Thomas

Patrick_F_Intel1 · ‎06-12-2015

Hey Huizhan,

I'm guessing that Dr. McCalpin will comment on this since he is the author of stream. By ark.intel.com (google 'Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz' and then click on the ark.intel.com link) a single one of these processors has a max memory bandwidth of 59 GB/s. The best number you show is 20 GB/s.

On my simple pure read memory bw test I get 117 GB/s on a dual socket 'Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz' (max bw of this cpu is 68 GB/s). I pinned 1 thread per logical cpu (HT enabled, 18 cores/socket, 2 sockets).

So I don't know what to make of your 20 GB/s number. It seems very low. Until we know what is going on with your peak bw number trying to understand anything else will just get more confusing.

Pat

HUIZHAN_Y_ · ‎06-12-2015

Thomas, You get the point! After I close the automatic numa balancing, the case of remote access running in both nodes is 10GB/s for Triad, and it is the double of the case of remote access running in a single nodes.

It seems that the kernel automatically migrates the pages, but when I check the pages location (use migrate_pages), all pages show their original locations.

HUIZHAN_Y_ · ‎06-12-2015

Patrik, I think 20GB/s is not problem, since for each memory node the server only use a single memory slot installed by a size16GB. Although it can have big bandwidth, a lot of slots are empty. See the following data:

Speed: 2133 MHz
Configured Clock Speed: 1867 MHz
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: 2133 MHz
Configured Clock Speed: 1867 MHz
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown
Speed: Unknown
Configured Clock Speed: Unknown

McCalpinJohn · ‎06-12-2015

I am not sure what to make of several parts of this, but I have noticed poor performance on a Xeon E5 v3 with a single "Home Agent" -- and I see that the Xeon E5-2620 v3 used here also has a single Home Agent.

The single Home Agent system that I tested was populated with the Xeon E5-2603 v3 processor, which is a very slow model, but the performance was much lower than I expected even accounting for the lower DRAM, Core, and Uncore frequencies. The processor was particularly slow with the streaming/non-temporal stores used by the STREAM benchmark and for remote accesses.

It would probably be helpful to run the Intel Memory Latency Checker (https://software.intel.com/en-us/articles/intelr-memory-latency-checker) on this system. Using the bandwidth tests with the various -W options, my system gave results ranging from 23 GB/s to 35 GB/s (out of 51.2 GB/s peak) for local accesses, and results ranging from 3.7 GB/s to 5.6 GB/s for remote accesses. The tests that look like STREAM Triad (-W7 and -W10) were the worst in both cases.

McCalpinJohn · ‎06-12-2015

If the system has only one DIMM channel populated, then the local bandwidth values are less surprising. The Xeon E5-2620 v3 has a maximum DRAM transfer rate of 1866 MT/s, so a single channel has a peak bandwidth of 15.7 GB/s.

With the icc compilers, I typically see 83% to 85% DRAM utilization on STREAM, but these depend on streaming/non-temporal stores that the gcc compiler does not support.

For the STREAM Triad kernel compiled with gcc, the actual memory traffic is 4/3 as large as what I count because the output array has to be read before being overwritten, so the hardware is doing 3 reads and 1 write instead of 2 reads and 1 write. If we take the 10 GB/s and multiply it by 4/3 we get a raw DRAM bandwidth of 13.33 GB/s. Interestingly, this is 85% of the peak bandwidth of 1 DDR4/1866 channel -- exactly matching the DRAM utilization that I would have expected.

HUIZHAN_Y_ · ‎06-12-2015

John D. McCalpin wrote:

I am not sure what to make of several parts of this, but I have noticed poor performance on a Xeon E5 v3 with a single "Home Agent" -- and I see that the Xeon E5-2620 v3 used here also has a single Home Agent.

The single Home Agent system that I tested was populated with the Xeon E5-2603 v3 processor, which is a very slow model, but the performance was much lower than I expected even accounting for the lower DRAM, Core, and Uncore frequencies. The processor was particularly slow with the streaming/non-temporal stores used by the STREAM benchmark and for remote accesses.

It would probably be helpful to run the Intel Memory Latency Checker (https://software.intel.com/en-us/articles/intelr-memory-latency-checker) on this system. Using the bandwidth tests with the various -W options, my system gave results ranging from 23 GB/s to 35 GB/s (out of 51.2 GB/s peak) for local accesses, and results ranging from 3.7 GB/s to 5.6 GB/s for remote accesses. The tests that look like STREAM Triad (-W7 and -W10) were the worst in both cases.

John, What means a single "Home Agent"? It is a single Dimm Channel?

HUIZHAN_Y_ · ‎06-12-2015

John D. McCalpin wrote:

If the system has only one DIMM channel populated, then the local bandwidth values are less surprising. The Xeon E5-2620 v3 has a maximum DRAM transfer rate of 1866 MT/s, so a single channel has a peak bandwidth of 15.7 GB/s.

With the icc compilers, I typically see 83% to 85% DRAM utilization on STREAM, but these depend on streaming/non-temporal stores that the gcc compiler does not support.

For the STREAM Triad kernel compiled with gcc, the actual memory traffic is 4/3 as large as what I count because the output array has to be read before being overwritten, so the hardware is doing 3 reads and 1 write instead of 2 reads and 1 write. If we take the 10 GB/s and multiply it by 4/3 we get a raw DRAM bandwidth of 13.33 GB/s. Interestingly, this is 85% of the peak bandwidth of 1 DDR4/1866 channel -- exactly matching the DRAM utilization that I would have expected.

You are right! I see the bandwidth is 13GB/s for a single socket from vtune bandwidth analysis.

By the way, how you get peak bandwidth of 15.7 GB/s from 1866 MT/s?

McCalpinJohn · ‎06-13-2015

I get 15.7 GB/s from 1866 MT/s by typing 1966 * 8 into my calculator.

Ooops.

It was not too terribly far off....