I am doing development on a 24-core machine (E5-2697-v2). When I launch a single DGEMM where the matrices are large (m=n=k=15,000), the performance improves as I increase the number of threads used, which is expected. For reference, I get about 467 GFLOPs/sec using 24 cores.
Next, in an OpenMP parallel region, I have each thread launch an independent call to DGEMM where the matrices are large (m=n=k=15,000). Each thread has its own matrices which are used in its DGEMM. In this case, the overall performance improves as I increase the number of threads, up to a point. With higher numbers of threads, the overall performance decreases. What hardware limitation could be causing this? For reference, here are the performance results I got:
#threads Compute Speed Overall (GFLOP/sec) 1 26.3 2 52.6741 3 76.6518 4 102.413 5 124.401 6 148.394 7 168.022 8 190.557 9 210.165 10 232.156 11 249.77 12 271.149 13 291.211 14 313.747 15 327.467 16 349.917 17 361.444 18 377.498 19 346.558 20 368.453 21 356.597 22 319.446 23 301.81 24 277.273
When you ask for more resources than are available, whether it is threads, memory or CPU time being requested, the OS has to do more work to allocate resources, suspend threads to disk, etc., so other processes can get a chance to progress. The overhead of doing all this management can slow a program if there is too much contention for limited resources.
MKL libraries come in different versions -- sequential and threaded. If your own program, which calls MKL routines, is also multi-threaded, problems can arise if you use the threaded version of MKL.
As mecj asked, could you please provide us the link command line so we may know mkl sequential or mkl threaded are used, and what is compiler version etc?
From your discription, each thread has its own matrices which are used in its DGEMM in your openMP test case while it is a single DGEMM where the matrices are large (m=n=k=15,000), so there are many hardware OS required than single DGEMM as mecij explain, especially, the memory bandwidth limitation.
I simplified my testing routine so that you can see exactly what I am doing. The source code (parallel_issue.cpp) is attached to this post. My compiler is:
icc version 12.1.4 (gcc version 4.4.7 compatibility)
I'm using whichever version of MKL that comes with the above. My machine has 2 x E5-2697 v2 @ 2.70GHz processors; each processor has 12 cores, so in total, I have 24 CPU cores. I compile the code using:
icc -mkl -openmp parallel_issue.cpp -o parallel_issue
I put the results of my run in a file called "results.txt", also attached to this post. I understand that I am pushing the limits of the hardware and I expect to see performance "noise" as I increase the number of threads. The issue is that in the software that I produce, I have several segments of code similar to the "Multiple MKL DGEMMs called in parallel" section in the testing routine. For my 24-core machine, I seem to get the best performance when I use fewer than 24 threads.
I have tried several different arrangements of threads to see if I can ever get peak performance with 24 threads. One arrangement is similar to the "Multiple MKL DGEMMs called in nested parallel" section in the testing routine. It seems like in every arrangement of threads, the maximum number of threads (24) gives the WORST performance.
Of course, I would welcome any recommendations on how to get better performance with 24 threads. Other than that, I would like some kind of description of what hardware limitation I am encountering. Then, I can explain succinctly to users why I might recommend that if they have a 24-core machine, they perhaps might want to use fewer than 24 cores to get the best performance of the production software.
Doesn't OpenMP 4 support setting the number of threads at each level under OMP_NESTED? e.g. OMP_NUM_THREADS=6,4 (using 24 threads total) Then it should help to set OMP_PROC_BIND=spread,close. You would expect best results with each instance of DGEMM getting its own group of cores (restricted to one CPU). Does adding KMP_AFFINITY=verbose environment setting give you the expected listing of where the threads are assigned?
The failure to scale up performance past some number of cores is typical of failing to set appropriate affinity. It's likely to be important to use contiguous blocks of cores rather than random scattering among the 2 CPUs, as well as to encourage each thread to stick on one core and associate cache.
Thanks, Tim Prince. Your comment about OpenMP 4 was very useful. I have been compiling with a version of MKL that does not have OpenMP 4.0. I compiled "parallel_issue.cpp" with a new version:
icc version 14.0.3 (gcc version 4.4.7 compatibility)
and my new results are quite favorable (see attached).
Unfortunately, for my production code, I cannot rely on a user to set OMP_PROC_BIND appropriately. So, I used the clause "proc_bind" in:
#pragma omp parallel proc_bind(spread)
and my production code's performance is dramatically improved. I see a runtime routine called "omp_get_proc_bind", but I don't see an "omp_set_proc_bind". Will there eventually be an "omp_set_proc_bind" capability?