Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2168 Discussions

3rd gen Xeon showed slower performance with intel MPI library

Kuni
New Contributor I
7,382 Views

Now we are studying network traffic of HPC use. For this, we are using Intel MPI Library (latest - Intel HPC kit at 12/10/2022) and Nas Parallel Benchmark (3.4.2). Before measuring network traffic, I measured the performance without using network traffic.  We used following platform:  

 

machine1. Xeon Silver 4310 server 8ch 64GB RAM, Hyper thread on, CentOS 7.9, Turbo ON

machine 2. Xeon Silver 4214 server 6ch 96GB RAM Hyper thread on CentOS 7.9, no Turbo

machine 3.  4 core 8GB RAM virtual machine on machine 1. CentOS 7.9

machine 4.  4 core 8GB RAM vitual machine on machine 2. CentOS 7.9

 

Results: 

Test 1.  mpirun -n 4 ./bin/bt.B.x (4 process smaller array - 102 x 102 x 102)

machine 1.  49.87 sec

machine 2. 62.02 sec

machine 3. 43.92 sec

machine 4. 63.11 sec

 

Test 2. mpirun -n 4 ./bin/bt.C.x  (4 process larger array - 162 x 162 x 162)

machine 1. 388.57 sec

machine 2. 253.40 sec

machine 3. 201.79 sec

machine 4. 256.78 sec

 

In case of the above test 1, the result was understandable and performance diffrence was not strange and expected results were shown.

 

However, 2nd test. I saw very strange results. There is two unexped things.

1. Newer (3rd) generation of Xeon showed much slower result than older (2nd) generation of Xeon on real machine.

2. Newer (3rd) generation of Xeon showed big improvement , if the benchmark was executed on the virtual machine. 

 

In case of the memory of the machine 1 and the machine 2, machine 2's memory is 1/3 x bigger, however, the using memory of the test 2 (bt.C.x) only consume 4GB (free command result), then it the memory size difference might not make such big effects to execution results. 

 

I also executed the tests with openmpi 4.1 the following is the results:

Test 1.  mpirun -np 4 ./bin/bt.B.x (4 process smaller array)

machine 1.  52.31 sec

machine 2.  61.73 sec

 

Test 2. mpirun -np 4 ./bin/bt.C.x  (4 process larger array)

machine 1. 198.70 sec

machine 2. 252.31 sec

 

Then it seems that Intel MPI and 3rd Gen Xeon and some large array treatment may cause performance down.  Then it seems that I can not use Intel MPI with  3rd Gen Xeon. But Intel MPI is much easier to specify fabric and then I want to use it our network traffic evaluation if possible.  Then, I want to know following things to use Intel MPI library:

 

1. Why 3rd Gen Xeon showed slow performance? Why it was not shown with my vitrual machine case even with 3rd Gen Xeon?  

2. Why the performance down is shown with Intel MPI library?

3. Is there any way to make performance up with Intel MPI and 3rd Gen Xeon? 

 

Please help!.

 

K. Kunita

0 Kudos
36 Replies
Kuni
New Contributor I
2,623 Views

I am sorry, in case of MPICC, I made mistake since I modified your recommended setting in the last reply. And "undefined reference" is due to that I forgot to do "make clean".  Now I can build with mpiifort.

 

However, the mpirun -n 4 ./bin/bt.C.x result showed following:

 

on the real Gen 3th Xeon machine:  261.22 sec,  on virtual machine (KVM) on same machine as the machine which resulted 261.22sec :   201.xx sec. 

 

It is strange that the benchmark result showed faster execution on virtual machine than the result on  real machine. Normally, the result of the virtual machine should show a little bit slower performance than the non-virtual case. 

 

Also your result case,  bt.A.x is faster on 3rd Gen Xeon, but bt.C.x is slower on 3rd Gen Xeon. That means that you also show what I saw and there is some issue on IntelMPI/3rd Gen Xeon convination.  And I think that you can reproduce my issue. 

 

What do you think?

 

 

0 Kudos
Kuni
New Contributor I
2,621 Views

Sorry, I clcked Post Reply button before I  get the accurate same condtion value.  201.xx sec is some condition difference is. The execution time on the virtula machine in the same environment was 220.18 sec. 

 

Regards, K. Kunita

0 Kudos
Kuni
New Contributor I
2,591 Views

Also I tried to use devcloud. I could not find the way to use 3rd/4th Gen. Xeon SP. How can I setup devcloud to use 3rd/4th Gen Xeon SP?   

Anyway, I think that you can reproduce my issue in your environment even with new compiler. (Your results show that 3rd Gen Xeon SP is slower than 2nd Xeon in Class C result. In the same your environment, Class B result is faster with 3rd Gen Xeon SP than 2nd Gen Xeon SP. Normal expectation is newer Gen CPU need to show faster results in most of case. Then the result might not be good for Intel too. That is what I want you to investigate. (Also I think that you can see same result as me if you tried with vitrual machine. Virtual machine result is faster than real machine result and the result might be the one what I want to see with real machine ). 

Regards, Kuni 

0 Kudos
Kuni
New Contributor I
2,569 Views

Do you have any counter reply to my previous communication? 

0 Kudos
Kuni
New Contributor I
2,552 Views
0 Kudos
Kuni
New Contributor I
2,527 Views

Could you provide your comment to my communication. I am waiting for more than a month.

0 Kudos
Kuni
New Contributor I
2,410 Views

One thing I should approgize. I over looked about your response about Devcloud. Now I could do NPB with 3rd Gen Xeon SP.  The result what I saw was same as you.   I think that it reproduced my issue. - 3rd Gen Xeon is slower than 2nd Gen Xeon SP (What you did is 1st Gen Xeon SP). Also I did with mpiicc/mpiifort with my environment. In my environment , real Gen 3th Xeon machine:  261.22 sec,  on virtual machine (KVM) on same machine resulted about  201 sec. That means that something wrong is happen on non-virtual 3rd Gen Xeon with Intel MPI. I tryied Openmpi with no-virtual 3rd Xeon. At the case, result is similar to virtual 3rd Xeon SP with  Intel MPI. 3rd Xeon SP showed better result than 2nd Gen Xeon SP. 

Also I tried with all npb basic test. The result of the test is following:

 Gold 6128Gold 6348Silver 4310Silver 4310-VMSilver 4214RSilver 4214R-VM
bt.B.x n=449.76 sec 45.59 sec49.87 sec50.56 sec61.20 sec62.37 sec
bt.C.x n=4214.96 sec236.99 sec261.87 sec205.56 sec249.57 sec254.04 sec
ft.B.x n=49.40 sec11.74 sec11.37 sec8.65 sec10.11 sec10.44 sec
ft.C.x n=440.59 sec48.76 sec35.71 sec36.32 sec42.75 sec43.51 sec
lu.B.x n=424.95 sec24.07 sec28.75 sec29.59 sec35.64 sec36.59 sec
lu.C.x n=4104.15 sec100.05 sec140.86 sec123.15 sec145.38 sec148.16 sec
is.B.x n=40.68 sec0.80 sec0.46 sec0.44 sec0.44 sec0.52 sec
is.C.x n=42.77 sec2.64 sec1.51 sec1.64 sec1.76 sec1.88 sec
is.D.x n=449.59 sec45.65 sec30.03 sec29.08 sec32.97 sec33.55 sec
cg.B.x n=48.52 sec8.46 sec8.78 sec10.55 sec9.84 sec11.78 sec
cg.C.x n=425.29 sec33.32 sec23.33 sec25.63 sec27.63 sec33.46 sec
cg.D.x n=42224.97 sec1475.95 sec1872.25 sec1916.41 sec3448.39 sec3437.87 sec
ep.B.x n=49.37 sec10.98 sec12.97 sec13.15 sec14.43 sec14.58 sec
ep.C.x n=437.44 sec41.70 sec51.61 sec52.22 sec57.67 sec58.24 sec
sp.B.x n=436.22 sec45.42 sec33.51 sec34.31 sec37.94 sec38.76 sec
sp.C.x n=4176.85 sec282.39 sec143.23 sec146.02 sec168.69 sec171.56 sec
mg.B.x n=41.30 sec2.04 sec1.36 sec1.41 sec1.46 sec1.58 sec
mg.C.x n=411.36 sec15.38 sec11.38 sec11.87 sec12.79 sec13.02 sec
dt.B.x n=43 BH1.54 sec1.17 sec1.00 sec4.64 sec1.31 sec5.38 sec
dt.B.x n=192 SH7.16 sec6.76 sec4.46 sec22.73 sec5.65 sec27.72 sec
dt.C.x n=85 BH26.15 sec18.78 sec18.98 secerror(memory?)19.44 secerror (memory?)

 

In the test, I executed each test more than 2 times and selected the minimum execution time. And if the time was less than 30 sec, I executed more than 3 times. Exception is cg.D.x. It was executed once because it takes much long time.

In the above results, bt.C.x, ft.B.x and lu.C.x showed worse results on 3rd Gen Xeon SP real machine. However the virtual machine results are better results with 3rd Gen Xeon.

Also I executed Geekbench5 on those 2 devcloud hosts for reference purpose. Result of sigle thread is following and all tests except Text compression are faster on Gold 6348. However above results show some performance degradation with 3rd Gen Xeon SP. It means that some issue in Intel MPI library for 3rd gen Xeon SP or some special option might be needed with 3rd Gen to execute intel mpi.

 Gold 6128Gold 6348
Single-Core Score10161152
Crypto Score 13552250
Integer Score 9611043
Floating Point Score 10861205
AES-XTS13552250
Text Compression1035840
Image Compression10071082
Navigation806853
HTML59071187
SQLite10131076
PDF Rendering9471130
Text Rendering8811075
Clang10751125
Camera10081077
N-Body Physics9541083
Rigid Body Physics10671110
Gaussian Blur718791
Face Detection9831037
Horizon Detection826968
Image Inpainting18412131
HDR19172140
Ray Tracing13661557
Structure from Motion9381062
Speech Recognition10041099
Machine Learning918984

 

What do you think?

 

Regards, K. Kunita

0 Kudos
ShivaniK_Intel
Moderator
2,297 Views

Hi,


We have escalated this issue to the development team and will get back to you soon.


Thanks & Regards

Shivani



0 Kudos
Kuni
New Contributor I
2,217 Views

I tried with RoCE communication with  4 workstations. bt.D.x test is the only the test in NPB show but performance degradation with 3rd Gen Xeon. In case of OpenMPI showed about 2 x speed.

 

The execution command lines are following.

Case 1:

Intel MPI 2nd Gen. Xeon SP with Nvidia ConnectX-5: mpirun -n 36 -ppn 9 -host svr0-100g,svr1-100g,svr2-100g,svr3-100g ./bin/bt.D.x

Result: 626.03 sec

Case 2:

Intel MPI 3rd Gen. Xeon SP with Nvidia ConnectX-6: mpirun -n 36 -ppn 9 -host svr4-100g,svr5-100g,svr6-100g,svr7-100g ./bin/bt.D.x

Result: 1011.72 sec

Case 3:

OpenMPI 2nd Gen Xeon SP with Nvidia ConnectX-5: mpirun -np 36 -host svr0-100g:9,svr1-100g:9,svr2-100g:9,svr3-100g:9 ./bin/bt.D.x

Result: 602.64 sec

Case 4:

OpenMPI 3rd Gen. Xeon SP with Nvidia ConnectX-6: mpirun -np 36 -host svr4-100g:9,svr5-100g:9,svr6-100g:9,svr7-100g:9 ./bin/bt.D.x

Result: 497.51 sec

 

Other test than BT  of NPB did not show such performance degradation with Intel MPI and 3rd Gen Xeon. 

 

Regards, K. Kunita

0 Kudos
Kuni
New Contributor I
2,121 Views

There is one typo "show but perfomance degradation" should be "which showed performance degradation".

 

By the way, is there any update?

0 Kudos
Homer-b83fb642db9bb85
790 Views

I have the same issue, The 4309Y single core integer benchmark is lower than 4314, 6142, E5 2643v3....too bad.

the others(Xeon 43xx) is fine.  just 4309Y.

0 Kudos
TobiasK
Moderator
775 Views

@Homer-b83fb642db9bb85 

I don't think this is the right place for you. 
You might want to check which performance profile your 4309Y is running:

https://www.intel.com/content/www/us/en/products/sku/215275/intel-xeon-silver-4309y-processor-12m-cache-2-80-ghz/specifications.html

Intel® Speed Select Technology - Performance Profile (Intel® SST-PP)

Config Active Cores Base Frequency TDP Description
4309Y(0) 8 2.8 GHz 105W  
4309(1) 8 2.6 GHz 95W  
4309(2) 8 2.3 GHz 85 W

 

 

If you think it's related to Intel MPI, please provide a reproducer and detailed information on your environment apart from the CPU SKU.

 

Best 

0 Kudos
Homer-b83fb642db9bb85
448 Views

@TobiasK  Thanks. 

Unusual performance issues, low-clock CPUs may perform better, e.g.: Xeon 4314 , just single core test and the same BIOS/OS setting.  

This rule does not apply to many other generations of Xeon CPUs.    How strange. 

0 Kudos
TobiasK
Moderator
311 Views

@Homer-b83fb642db9bb85 
did you take a look at the actual frequencies using tools like turbostat?

 

0 Kudos
Homer-b83fb642db9bb85
151 Views

@TobiasK Thanks. 

Due to other reasons, I was unable to switch operating system versions, so testing was conducted on different operating system versions. The test results were consistent when tested on the same operating system previously.  

Turbostaus info in the attachment.   

 

Xeon 4314, el7 gcc version 9.1.1 20190605, 3.10.0-1160.49.1.el7.x86_64

Xeon 4309Y, el9 gcc version gcc version 11.3.1 20221121, 5.14.0-284.18.1.el9_2.x86_64

 

Xeon 4309Y, gcc version 11.3.1 20221121 5.14.0-284.18.1.el9_2.x86_64

Output: 91.6857  <------------ higher CPU MHz and take more time
sum = 9048129480000

 

Test command: 

g++ -O2 test.cpp -o add

numactl -C 3 ./add

#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
    const unsigned arraySize = 1048576;
    int data[arraySize];
    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 90000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}

 

lscpu

lscpu 
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    BIOS Model name:     Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    Stepping:            6
    BogoMIPS:            5600.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology n
                         onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dn
                         owprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512
                         f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
                         da arat pln pts hwp_epp avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   768 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    20 MiB (16 instances)
  L3:                    24 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected



Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    BIOS Model name:     Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    Stepping:            6
    BogoMIPS:            5600.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology n
                         onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dn
                         owprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512
                         f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
                         da arat pln pts hwp_epp avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   768 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    20 MiB (16 instances)
  L3:                    24 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

 

 

0 Kudos
Homer-b83fb642db9bb85
188 Views

@TobiasK Thanks

 

I'm not sure what's going on, but my previous replies haven't been showing up. I'll try again.

For other reasons, I'm unable to switch operating systems. Currently, I'm comparing performance across different operating systems. However, according to previous testing results, this test script actually performs better on higher versions of GCC. Even when switching to the same operating system, this behavior persists.

 

 

 

1x Xeon 4314   gcc version 9.1.1 20190605              3.10.0-1160.49.1.el7.x86_64

numactl -C 3 ./add  

65.21
sum = 9048129480000

 

2x Xeon 4309Y   gcc version 11.3.1 20221121     5.14.0-284.18.1.el9_2.x86_64

numactl -C 3 ./add

91.6857  <-----------------------take more time in the higher CPU frequency
sum = 9048129480000

 

here is the test.cpp code

g++ -O2 test.cpp -o add

 

#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
    const unsigned arraySize = 1048576;
    int data[arraySize];
    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 90000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}

 

 

0 Kudos
Reply