- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
I am trying to test the scalability of ZGEMM in two machines with different Intel processors: Xeon Gold 6240 (with 36 threads) and Xeon Platinum 8160 (with 48 threads).
The size of the matrix is M=1281, N=1281 and K=38400.
In the Xeon Gold 6240, I get an speedup that arrives to 21 with 36 processors (58%), while in the Xeon Platinum 8160, I get an speedup of 39 for 48 processes (81%).
Why do I have this poor scalability in the Xeon Gold 6240? Is this the normal behaviour?
I've been looking at the hardware differences in both processors and I can't see what can be the key factor of this difference.
Enlace copiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
Thanks for reaching out to us.
Could you please provide us with the complete sample reproducer code and commands used to compile the code so that we can do a quick check from our end as well?
Also please do let us know the MKL version that you are using and your OS environment details.
>>In the Xeon Gold 6240, I get an speedup that arrives to 21 with 36 processors (58%), while in the Xeon Platinum 8160, I get an speedup of 39 for 48 processes (81%).
Could you please help us with the details on how you are calculating the speedup and share with us the timings that you are getting on both processors?
Please get back to us with all the necessary details so that it would help us to proceed further in this case.
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Dear Vidya
Thanks for answering
Find attached the executed experiment (file zgemm.F90).
This is what I have done in the Xeon Gold 6240:
- Compilation:
 $> source /mnt/beegfs2018/app/intel-oneapi-2021/compiler/2021.4.0/env/vars.sh 
 $> source /mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/env/vars.sh$> make 
 ifort -O3 -qopenmp -I/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/include/ -I/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/include/intel64/lp64 -o zgemm_ex zgemm.F90 -L/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/lib/intel64/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
- Execution:
 #!/bin/bash
 #SBATCH --job-name=zgemm
 #SBATCH --time=0-12:00:00
 #SBATCH --partition=cpu36memory384
 #SBATCH --ntasks=1
 #SBATCH --cpus-per-task=36
 #SBATCH --exclusive
 cd ${SLURM_SUBMIT_DIR}
 source /mnt/beegfs2018/app/intel-oneapi-2021/compiler/2021.4.0/env/vars.sh
 source /mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/env/vars.sh
 M=1281 N=1281 K=38400 ./zgemm_ex &> out.6240.txt
- Results file: times.6240.txt
There are small differences for the Xeon Platinum 8160:
- Compilation:
 $> module load intel/2020.1 mkl/2021.4 
 $> make
 ifort -O3 -qopenmp -I/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//include/ -I/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//include/intel64/lp64 -o zgemm_ex zgemm.F90 -L/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//lib/intel64/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
- Execution:
 #!/bin/bash 
 #SBATCH --job-name=zgemm
 #SBATCH --output=par.out
 #SBATCH --error=par.err
 #SBATCH --ntasks=1
 #SBATCH --cpus-per-task=48
 #SBATCH --time=00:30:00
 #SBATCH --qos=debug
 #SBATCH --exclusive
 module load intel/2020.1 mkl/2021.4
 M=1281 N=1281 K=38400 ./zgemm_ex &> out.8160.txt"
- Results file: times.8160.txt
The speedup computation is independent in every machine:
SpeedUp( X threads ) = Time( 1 thread )/Time( X threads )
So, a better scalability doesn't means a lower execution time. In fact, if we use the same amount of threads, we always get a better times in the Xeon Gold 6240:
If we recall the scalability graph, the behaviour of the Xeon Platinum 8160 looks better.
Looking at these two graphics, I have a pair of questions:
- Why is zgemm going much more faster in the sequential case. The frequency difference does not explain so much difference.
- Why do we have a better speedup in the 8160? I assume that the experiment in the 6240 stress more the cache hierarchy and the main memory.
Please, let me know if do you need more information.
Cheers
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
Thanks for providing the details.
>>Why is zgemm going much more faster in the sequential case
Could you please provide us with the timings that you are getting in the sequential case here?
Also, I'm attaching the results that I got when tried running the code at my end on Gold and platinum processors.
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Vidya
You may find the all the timings in the attached files in my previous post (Included the time for one thread). In your experiment the time with one thread is equal to the sequential time. what did you do to get the sequential time?
These are the times of my experiments with one thread:
- Gold 6240: 4.7412997500000005
- Platinum 8160: 8.15284851
This plots your timings together with mines (logarithmic scale at your right):
And the speedup plot:
What I can see is that the speedup behaviour of your experiments are similar to what I have found in the Platinum 8160. In the other hand, Gold 6240 is the fastest processor when the number of threads are under 10, but it have problems when the number of threads is greater.
Regards
Rogeli
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
Thanks for the details.
I tried running the code on Xeon Gold processor 6346 (32 threads) and Xeon Platinum 8360 (72 threads) and as per the calculation formula (please let me know if the calculation is incorrect), it gives the speedup of 27 with 32 threads (84%) in gold and a speedup of 53 with 72 threads (74%).
Could you please let us know if this is what you are expecting with a Xeon gold processor?
>>In your experiment the time with one thread is equal to the sequential time. what did you do to get the sequential time?
There is a quick linking option for MKL with Intel compilers
-qmkl=sequential >>Tells the compiler to link using the sequential libraries in oneMKL.
-qmkl=parallel >> Tells the compiler to link using the threaded libraries in oneMKL
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Dear Vidya
These are my calculations. Efficiency:
- Gold 6240: 62.26% with 36 threads
- Platinum 8160: 80.83% with 48 threads
- Gold 6346: 84.46% with 32 threads
- Platinum 8360: 72.97% with 72 threads
The efficiency of Gold 6240 is poor compared with all the other processors. In the other hand, it is faster than the Gold 6346 when the number of threads is under 10.
I am wondering, if this is a normal behaviour of the Gold 6240 or maybe I am having a trouble pinning the threads to the available processors.
Cheers
Rogeli
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
>>maybe I am having a trouble pinning the threads to the available processors.
If that is the case, maybe you can refer to the below link which has details about binding threads to the CPU cores.
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
As we haven't heard back from you, could you please provide us with an update regarding the issue?
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
As we haven't heard back from you, we are closing this thread for now. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Dear Vidya
Recently I had the chance to run the experiment in a Intel Xeon 6148 and I get similar results to Intel Gold 6240:
- Very good performance and scalability when the number of threads is under 10.
- Bad scalability beyond that point
Anyway, I don't know the reason of these different behaviour.
Cheers
Rogeli
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
Could you please try running the code with a larger work size and let us know the results?
I'm attaching the results that i got on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
CPU model.
And here are the comments from the developement team regarding your issue
"There are many factors to understand this behavior. To keep same efficiency with twice larger number of cores,
> Core frequency should remain same.
> Required cache/memory bandwidth will be twice, because performance will be twice faster.
> Each core still have large enough workload.
In this case, some conditions may not be met and saw unexpected efficency.
Activating larger number of cores needs more power, but power is limited by TDP (Thermal Design Power). Core frequency will be reduced by increasing number of active cores.
Bandwidth of main memory and between processors are limited and may be saturated. Then performance will be limited by bandwidth rather than core performance.
Customer's problem size is too small to work on large number of cores. In addition, it requires more bandwidth because cached data will not be reused.
Behaviors are different based on power budget (TDP), available memory bandwidth and number of active cores."
So activating larger number of cores, bandwidth and too small size work on large number of cores, all of the them would cause poor scalability.
I hope this answers your question.
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Dear Vidya
> Could you please try running the code with a larger work size and let us know the results?
> I'm attaching the results that i got on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
> CPU model.
I have repeated your experiment (M=10000, N=10000 and K=10000) and I get very similar results:
> Activating larger number of cores needs more power, but power is limited by TDP (Thermal Design Power). Core frequency will be
> reduced by increasing number of active cores.
It makes sense, but is the first time that I read about this kind of behaviour. Is this behaviour documented anywhere? Do we have the same kind of limitation in the "Platinium family"? Is there any way to check the current frequency of the CPU ?
Regards
Rogeli
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
As we haven't heard back from you, could you please provide us with an update regarding the information provided in the previous post?
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
Please refer to the below link for more details regarding the Intel core processors
Hope the information provided above helps.
Please do let us know if we can close this thread from our end if there are no issues.
Regards,
Vidya.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Hi Rogeli,
As we haven't heard back from you, we are closing this thread. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.
Regards,
Vidya.
 
					
				
				
			
		
- Suscribirse a un feed RSS
- Marcar tema como nuevo
- Marcar tema como leído
- Flotar este Tema para el usuario actual
- Favorito
- Suscribir
- Página de impresión sencilla