I am trying to test the scalability of ZGEMM in two machines with different Intel processors: Xeon Gold 6240 (with 36 threads) and Xeon Platinum 8160 (with 48 threads).
The size of the matrix is M=1281, N=1281 and K=38400.
In the Xeon Gold 6240, I get an speedup that arrives to 21 with 36 processors (58%), while in the Xeon Platinum 8160, I get an speedup of 39 for 48 processes (81%).
Why do I have this poor scalability in the Xeon Gold 6240? Is this the normal behaviour?
I've been looking at the hardware differences in both processors and I can't see what can be the key factor of this difference.
Thanks for reaching out to us.
Could you please provide us with the complete sample reproducer code and commands used to compile the code so that we can do a quick check from our end as well?
Also please do let us know the MKL version that you are using and your OS environment details.
>>In the Xeon Gold 6240, I get an speedup that arrives to 21 with 36 processors (58%), while in the Xeon Platinum 8160, I get an speedup of 39 for 48 processes (81%).
Could you please help us with the details on how you are calculating the speedup and share with us the timings that you are getting on both processors?
Please get back to us with all the necessary details so that it would help us to proceed further in this case.
Thanks for answering
Find attached the executed experiment (file zgemm.F90).
This is what I have done in the Xeon Gold 6240:
$> source /mnt/beegfs2018/app/intel-oneapi-2021/compiler/2021.4.0/env/vars.sh
$> source /mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/env/vars.sh
ifort -O3 -qopenmp -I/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/include/ -I/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/include/intel64/lp64 -o zgemm_ex zgemm.F90 -L/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/lib/intel64/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
M=1281 N=1281 K=38400 ./zgemm_ex &> out.6240.txt
- Results file: times.6240.txt
There are small differences for the Xeon Platinum 8160:
$> module load intel/2020.1 mkl/2021.4
ifort -O3 -qopenmp -I/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//include/ -I/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//include/intel64/lp64 -o zgemm_ex zgemm.F90 -L/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//lib/intel64/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
module load intel/2020.1 mkl/2021.4
M=1281 N=1281 K=38400 ./zgemm_ex &> out.8160.txt"
- Results file: times.8160.txt
The speedup computation is independent in every machine:
SpeedUp( X threads ) = Time( 1 thread )/Time( X threads )
So, a better scalability doesn't means a lower execution time. In fact, if we use the same amount of threads, we always get a better times in the Xeon Gold 6240:
If we recall the scalability graph, the behaviour of the Xeon Platinum 8160 looks better.
Looking at these two graphics, I have a pair of questions:
- Why is zgemm going much more faster in the sequential case. The frequency difference does not explain so much difference.
- Why do we have a better speedup in the 8160? I assume that the experiment in the 6240 stress more the cache hierarchy and the main memory.
Please, let me know if do you need more information.
Thanks for providing the details.
>>Why is zgemm going much more faster in the sequential case
Could you please provide us with the timings that you are getting in the sequential case here?
Also, I'm attaching the results that I got when tried running the code at my end on Gold and platinum processors.
You may find the all the timings in the attached files in my previous post (Included the time for one thread). In your experiment the time with one thread is equal to the sequential time. what did you do to get the sequential time?
These are the times of my experiments with one thread:
- Gold 6240: 4.7412997500000005
- Platinum 8160: 8.15284851
This plots your timings together with mines (logarithmic scale at your right):
And the speedup plot:
What I can see is that the speedup behaviour of your experiments are similar to what I have found in the Platinum 8160. In the other hand, Gold 6240 is the fastest processor when the number of threads are under 10, but it have problems when the number of threads is greater.
Thanks for the details.
I tried running the code on Xeon Gold processor 6346 (32 threads) and Xeon Platinum 8360 (72 threads) and as per the calculation formula (please let me know if the calculation is incorrect), it gives the speedup of 27 with 32 threads (84%) in gold and a speedup of 53 with 72 threads (74%).
Could you please let us know if this is what you are expecting with a Xeon gold processor?
>>In your experiment the time with one thread is equal to the sequential time. what did you do to get the sequential time?
There is a quick linking option for MKL with Intel compilers
-qmkl=sequential >>Tells the compiler to link using the sequential libraries in oneMKL.
-qmkl=parallel >> Tells the compiler to link using the threaded libraries in oneMKL
These are my calculations. Efficiency:
- Gold 6240: 62.26% with 36 threads
- Platinum 8160: 80.83% with 48 threads
- Gold 6346: 84.46% with 32 threads
- Platinum 8360: 72.97% with 72 threads
The efficiency of Gold 6240 is poor compared with all the other processors. In the other hand, it is faster than the Gold 6346 when the number of threads is under 10.
I am wondering, if this is a normal behaviour of the Gold 6240 or maybe I am having a trouble pinning the threads to the available processors.
>>maybe I am having a trouble pinning the threads to the available processors.
If that is the case, maybe you can refer to the below link which has details about binding threads to the CPU cores.
As we haven't heard back from you, we are closing this thread for now. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.
Recently I had the chance to run the experiment in a Intel Xeon 6148 and I get similar results to Intel Gold 6240:
- Very good performance and scalability when the number of threads is under 10.
- Bad scalability beyond that point
Anyway, I don't know the reason of these different behaviour.
Could you please try running the code with a larger work size and let us know the results?
I'm attaching the results that i got on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
And here are the comments from the developement team regarding your issue
"There are many factors to understand this behavior. To keep same efficiency with twice larger number of cores,
> Core frequency should remain same.
> Required cache/memory bandwidth will be twice, because performance will be twice faster.
> Each core still have large enough workload.
In this case, some conditions may not be met and saw unexpected efficency.
Activating larger number of cores needs more power, but power is limited by TDP (Thermal Design Power). Core frequency will be reduced by increasing number of active cores.
Bandwidth of main memory and between processors are limited and may be saturated. Then performance will be limited by bandwidth rather than core performance.
Customer's problem size is too small to work on large number of cores. In addition, it requires more bandwidth because cached data will not be reused.
Behaviors are different based on power budget (TDP), available memory bandwidth and number of active cores."
So activating larger number of cores, bandwidth and too small size work on large number of cores, all of the them would cause poor scalability.
I hope this answers your question.
> Could you please try running the code with a larger work size and let us know the results?
> I'm attaching the results that i got on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
> CPU model.
I have repeated your experiment (M=10000, N=10000 and K=10000) and I get very similar results:
> Activating larger number of cores needs more power, but power is limited by TDP (Thermal Design Power). Core frequency will be
> reduced by increasing number of active cores.
It makes sense, but is the first time that I read about this kind of behaviour. Is this behaviour documented anywhere? Do we have the same kind of limitation in the "Platinium family"? Is there any way to check the current frequency of the CPU ?
Please refer to the below link for more details regarding the Intel core processors
Hope the information provided above helps.
Please do let us know if we can close this thread from our end if there are no issues.
As we haven't heard back from you, we are closing this thread. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.