Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
7236 Discussions

ZGEMM scalability in Xeon Gold 6240 and Xeon Platinum 8160

rgrima
Beginner
5,523 Views

I am trying to test the scalability of ZGEMM in two machines with different Intel processors: Xeon Gold 6240 (with 36 threads) and Xeon Platinum 8160 (with 48 threads).

The size of the matrix is M=1281, N=1281 and K=38400. 

In the Xeon Gold 6240, I get an speedup that arrives to 21 with 36 processors (58%), while in the Xeon Platinum 8160, I get an speedup of 39 for 48 processes (81%).

Why do I have this poor scalability in the Xeon Gold 6240? Is this the normal behaviour? 

I've been looking at the hardware differences in both  processors and I can't see what can be the key factor of this difference.

0 Kudos
15 Replies
VidyalathaB_Intel
Moderator
5,490 Views

Hi Rogeli,


Thanks for reaching out to us.


Could you please provide us with the complete sample reproducer code and commands used to compile the code so that we can do a quick check from our end as well?

Also please do let us know the MKL version that you are using and your OS environment details.

>>In the Xeon Gold 6240, I get an speedup that arrives to 21 with 36 processors (58%), while in the Xeon Platinum 8160, I get an speedup of 39 for 48 processes (81%).

Could you please help us with the details on how you are calculating the speedup and share with us the timings that you are getting on both processors?


Please get back to us with all the necessary details so that it would help us to proceed further in this case.


Regards,

Vidya.


0 Kudos
rgrima
Beginner
5,450 Views

Dear Vidya

 

Thanks for answering

 

Find attached the executed experiment (file zgemm.F90).

 

This is what I have done in the Xeon Gold 6240:

  • Compilation:

    $> source /mnt/beegfs2018/app/intel-oneapi-2021/compiler/2021.4.0/env/vars.sh
    $> source /mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/env/vars.sh

    $> make
    ifort -O3 -qopenmp -I/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/include/ -I/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/include/intel64/lp64 -o zgemm_ex zgemm.F90 -L/mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/lib/intel64/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread

  • Execution:
    #!/bin/bash
    #SBATCH --job-name=zgemm
    #SBATCH --time=0-12:00:00
    #SBATCH --partition=cpu36memory384
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=36
    #SBATCH --exclusive
    cd ${SLURM_SUBMIT_DIR}
    source /mnt/beegfs2018/app/intel-oneapi-2021/compiler/2021.4.0/env/vars.sh
    source /mnt/beegfs2018/app/intel-oneapi-2021/mkl/2021.4.0/env/vars.sh
    M=1281 N=1281 K=38400 ./zgemm_ex &> out.6240.txt
  • Results file: times.6240.txt

There are small differences for the Xeon Platinum 8160:

  • Compilation:

    $> module load intel/2020.1 mkl/2021.4
    $> make
    ifort -O3 -qopenmp -I/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//include/ -I/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//include/intel64/lp64 -o zgemm_ex zgemm.F90 -L/apps/INTEL/oneapi/2021.4/mkl/2021.4.0//lib/intel64/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread

  • Execution:

    #!/bin/bash
    #SBATCH --job-name=zgemm
    #SBATCH --output=par.out
    #SBATCH --error=par.err
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=48
    #SBATCH --time=00:30:00
    #SBATCH --qos=debug
    #SBATCH --exclusive
    module load intel/2020.1 mkl/2021.4
    M=1281 N=1281 K=38400 ./zgemm_ex &> out.8160.txt"

  • Results file: times.8160.txt

The speedup computation is independent in every machine:

SpeedUp( X threads ) = Time( 1 thread )/Time( X threads )

So, a better scalability doesn't means a lower execution time. In fact, if we use the same amount of threads, we always get a better times in the Xeon Gold 6240:

time.png

If we recall the scalability graph, the behaviour of the Xeon Platinum 8160 looks better.

speedup.png

Looking at these two graphics, I have a pair of questions:

  • Why is zgemm going much more faster in the sequential case. The frequency difference does not explain so much difference.
  • Why do we have a better speedup in the 8160? I assume that the experiment in the 6240 stress more the cache hierarchy and  the main memory.

 

Please, let me know if do you need more information.

 

Cheers

0 Kudos
VidyalathaB_Intel
Moderator
5,409 Views

Hi Rogeli,

 

Thanks for providing the details.

>>Why is zgemm going much more faster in the sequential case

Could you please provide us with the timings that you are getting in the sequential case here?

Also, I'm attaching the results that I got when tried running the code at my end on Gold and platinum processors.

 

Regards,

Vidya.

 

0 Kudos
rgrima
Beginner
5,402 Views

Hi Vidya

You may find the all the timings in the attached files in my previous post (Included the time for one thread). In your experiment the time with one thread is equal to the sequential time. what did you do to get the sequential time?

These are the times of my experiments with one thread:

  • Gold 6240:       4.7412997500000005
  • Platinum 8160: 8.15284851

This plots your timings together with mines (logarithmic scale at your right):

time.png time_log.png

 

And the speedup plot:

speedup.png

What I can see is that the speedup behaviour of your experiments are similar to what I have found in the Platinum 8160. In the other hand, Gold 6240 is the fastest processor when the number of threads are under 10, but it have problems when the number of threads is greater.

Regards

Rogeli

0 Kudos
VidyalathaB_Intel
Moderator
5,316 Views

Hi Rogeli,

 

Thanks for the details.

 

I tried running the code on Xeon Gold processor 6346 (32 threads) and Xeon Platinum 8360 (72 threads) and as per the calculation formula (please let me know if the calculation is incorrect), it gives the speedup of 27 with 32 threads (84%) in gold and a speedup of 53 with 72 threads (74%).

Could you please let us know if this is what you are expecting with a Xeon gold processor?

 

>>In your experiment the time with one thread is equal to the sequential time. what did you do to get the sequential time?

There is a quick linking option for MKL with Intel compilers

-qmkl=sequential >>Tells the compiler to link using the sequential libraries in oneMKL.

-qmkl=parallel >> Tells the compiler to link using the threaded libraries in oneMKL

https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/advanced-optimization-options/qmkl-qmkl.html

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/linking-your-application-with-onemkl/linking-quick-start/using-the-qmkl-compiler-option.html

 

Regards,

Vidya.

 

0 Kudos
rgrima
Beginner
5,309 Views

Dear Vidya

 

These are my calculations. Efficiency:

  • Gold 6240: 62.26% with 36 threads
  • Platinum 8160: 80.83% with 48 threads
  • Gold 6346: 84.46% with 32 threads
  • Platinum 8360: 72.97% with 72 threads

The efficiency of Gold 6240 is poor compared with all the other processors. In the other hand, it is faster than the Gold 6346 when the number of threads is under 10.

 

I am wondering, if this is a normal behaviour of the Gold 6240 or maybe I am having a trouble pinning the threads to the available processors.

 

Cheers

 

Rogeli

0 Kudos
VidyalathaB_Intel
Moderator
5,239 Views

Hi Rogeli,


>>maybe I am having a trouble pinning the threads to the available processors.

If that is the case, maybe you can refer to the below link which has details about binding threads to the CPU cores.

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/managing-multi-core-performance.html


Regards,

Vidya.


0 Kudos
VidyalathaB_Intel
Moderator
5,073 Views

Hi Rogeli,


As we haven't heard back from you, could you please provide us with an update regarding the issue?


Regards,

Vidya.


0 Kudos
VidyalathaB_Intel
Moderator
4,961 Views

Hi Rogeli,


As we haven't heard back from you, we are closing this thread for now. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.


Regards,

Vidya.


0 Kudos
rgrima
Beginner
4,952 Views

Dear Vidya

 

Recently I had the chance to run the experiment in a Intel Xeon 6148 and I get similar results to Intel Gold 6240:

  • Very good performance and scalability when the number of threads is under 10.
  • Bad scalability beyond that point

Anyway, I don't know the reason of these different behaviour.

 

Cheers

 

Rogeli

0 Kudos
VidyalathaB_Intel
Moderator
4,781 Views

Hi Rogeli,

 

Could you please try running the code with a larger work size and let us know the results?

I'm attaching the results that i got on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

CPU model.

 

And here are the comments from the developement team regarding your issue

 

"There are many factors to understand this behavior. To keep same efficiency with twice larger number of cores, 

 

> Core frequency should remain same.

> Required cache/memory bandwidth will be twice, because performance will be twice faster.

> Each core still have large enough workload.

 

In this case, some conditions may not be met and saw unexpected efficency.

 

Activating larger number of cores needs more power, but power is limited by TDP (Thermal Design Power). Core frequency will be reduced by increasing number of active cores.

Bandwidth of main memory and between processors are limited and may be saturated. Then performance will be limited by bandwidth rather than core performance.

Customer's problem size is too small to work on large number of cores. In addition, it requires more bandwidth because cached data will not be reused.

Behaviors are different based on power budget (TDP), available memory bandwidth and number of active cores."

 

So activating larger number of cores, bandwidth and too small size work on large number of cores, all of the them would cause poor scalability.

 

I hope this answers your question.

 

Regards,

Vidya.

 

0 Kudos
rgrima
Beginner
4,740 Views

Dear Vidya

 

> Could you please try running the code with a larger work size and let us know the results?

> I'm attaching the results that i got on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

> CPU model.

I have repeated your experiment (M=10000, N=10000 and K=10000) and I get very similar results:

rgrima_1-1668698796677.pngrgrima_0-1668698752776.png

 

> Activating larger number of cores needs more power, but power is limited by TDP (Thermal Design Power). Core frequency will be

> reduced by increasing number of active cores.

 

It makes sense, but is the first time that I read about this kind of behaviour. Is this behaviour documented anywhere? Do we have the same kind of limitation in the "Platinium family"? Is there any way to check the current frequency of the CPU ?

 

Regards

 

Rogeli

 

 

 

0 Kudos
VidyalathaB_Intel
Moderator
4,747 Views

Hi Rogeli,


As we haven't heard back from you, could you please provide us with an update regarding the information provided in the previous post?


Regards,

Vidya.


0 Kudos
VidyalathaB_Intel
Moderator
4,715 Views

Hi Rogeli,


Please refer to the below link for more details regarding the Intel core processors

https://www.intel.com/content/www/us/en/support/articles/000007359/processors/intel-core-processors.html


Hope the information provided above helps.

Please do let us know if we can close this thread from our end if there are no issues.


Regards,

Vidya.


0 Kudos
VidyalathaB_Intel
Moderator
4,658 Views

Hi Rogeli,


As we haven't heard back from you, we are closing this thread. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.


Regards,

Vidya.


0 Kudos
Reply