Solved: Hi Jim,

Zekun_Y_ · ‎08-15-2016

I have tested the memory bandwidth using STREAM both the DDR4 and MCDRAM, but I found something that I really can't understand.

I try to run STREAM test using one thread, the bandwidth is 16GB/s. It seems correct. But when I increase the thread number to two, the stream result is still about 16GB/s. And there are six channels on the DDR4 of KNL, I have to use 12 or more threads to gain the best bandwidth performance. But why? Even I use the KMP_AFFINITY=scatter to indicate the two threads running on different cores. The result is still the same.

Does it have anything to do with the mesh architecture? There are two cores in one tile. Is there any chance the two cores in one tile maybe share one channel?

Does anyone know something about this? I hope I could get some answers.

THANK YOU

jimdempseyatthecove · ‎08-18-2016

On KNL two cores share L2. Try place_threads or places, or

scatter, start 3 threads, do tests on omp thread numbers 0 and 2 (i.e. have omp thread 1 bypass the test)

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎08-18-2016

On KNL two cores share L2. Try place_threads or places, or

scatter, start 3 threads, do tests on omp thread numbers 0 and 2 (i.e. have omp thread 1 bypass the test)

Jim Dempsey

Zekun_Y_ · ‎08-22-2016

Hi Jim,

Thank you for reply. I agree that the shared L2 cache in one tile may cause this. I tried to use KMP_AFFINITY to bind two threads into different tiles, the memory bandwidth was nearly doubled.

jimdempseyatthecove · ‎08-22-2016

Hi Zekun,

This is part of the leaning experience with KNL. I'm still getting up to speed on KNL myself. I've been using KNL for only a month now, so I am still learning the nuances of running on KNL, especially when to decide to use the MCDRAM as cache or as 1 NUMA, 2 NUMA, 4 NUMA nodes.

I suspect that what works best with STREAM is not necessarily what will work best with your application(s).

Jim Dempsey

Zekun_Y_ · ‎08-23-2016

Hi Jim,

Yeah, there are several combinations of cluster modes and memory modes on KNL. We have done some tests on different combinations but we are still not clear about which combination to use for different types of applications.

The application currently I'm working on is memory bound, so I want to use the full bandwidth of mcdram or ddr4 with less threads. So I think STREAM test may be helpful for this certain application.

Thank you so much for discussion.

Zekun

McCalpinJohn · ‎08-23-2016

As you have already seen, you will get the best bandwidth for low thread counts if you spread them out to one thread per tile.

I ran STREAM using 16 threads (one per tile) and tried 20 different random selections of the tiles to see if there was any performance impact from the selection of tiles used, and found no evidence of systematic variation for different tile sets. The overall variation across the 60 results (3 trials each of 20 permutations) was only about 1%. There was also no significant difference between just using the first 16 tiles (cores 0,2,4,...,30) and using a random selection of 16 tiles -- the Triad values from using the first 16 tiles were close to the median of the distribution of the results from 16 randomly selected tiles.

For 16 threads and one thread per tile, my test case gave STREAM Triad values in the 211 GB/s range. This is 13 GB/s per tile -- only a very small drop compared to the 13.8 GB/s that I get with a single thread with this same code, same array size, and same alignments.

The performance definitely drops when running the 16 threads on 8 tiles -- down to about 139 GB/s. This is 17 GB/s per tile or 8.5 GB/s per core. The scaling by tile is probably better because I am only using 8 tiles instead of 16.

If you can use significantly more than 16 threads then the next obvious stop is one thread per tile, so there is no longer any need to worry about whether some subsets of the tiles perform better than others, since you will be using all of them. Running one thread per tile (34 threads on a Xeon Phi 7250), I get STREAM Triad values in the 393 GB/s range. This is 11.6 GB/s per tile -- only about an 11% slowdown per tile compared to the 16-tile case, and only about 16% slower per tile than the single-thread case.

For comparison, this particular selection of array sizes and alignments gives 427 GB/s for STREAM Triad using 68 threads on 34 tiles (this is on a Xeon Phi 7250 in Flat-Quadrant mode with transparent huge pages disabled). Performance should be about 10% higher in the default configuration with transparent huge pages enabled -- I got 471 GB/s for this binary (same array sizes and alignments) using 68 threads on 34 tiles with transparent huge pages enabled.

Zekun_Y_ · ‎08-24-2016

Hi John,

Thanks for the reply. The test results and analysis are very helpful for understanding the KNL memory architure and getting better performance especically for memory bound applications. I think the combinations of momory modes and cluster modes should be discussed next. I have not seen any performance white paper about this.

Best regards,

Zekun

Loc_N_Intel · ‎08-24-2016

Hi Zekun,

For your information, in the Intel(R) Xeon Phi(TM) processor landing page (https://software.intel.com/en-us/xeon-phi/x200-processor), under the Recipes and Benchmarks section there is a growing number of recipe whitepapers.

In particular the "Optimizing Memory Bandwidth on Intel(R) Xeon Phi(TM) processors on Stream Triad" whitepaper (https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-in-knights-landing-on-stream-triad) shows how to obtain peak memory bandwidth performance using STREAM.

Thank you.

Zekun_Y_ · ‎08-25-2016

Hi Loc,

It's so nice of you to provide these information. I do need these recipes and benchmarks to understand optimization techniques on KNL.

Thank you very much

Zekun

Test memory bandwidth on KNL