Re: 3rd gen Xeon showed slower performance with intel MPI library - Page 2

Kuni · ‎12-18-2022

Now we are studying network traffic of HPC use. For this, we are using Intel MPI Library (latest - Intel HPC kit at 12/10/2022) and Nas Parallel Benchmark (3.4.2). Before measuring network traffic, I measured the performance without using network traffic. We used following platform:

machine1. Xeon Silver 4310 server 8ch 64GB RAM, Hyper thread on, CentOS 7.9, Turbo ON

machine 2. Xeon Silver 4214 server 6ch 96GB RAM Hyper thread on CentOS 7.9, no Turbo

machine 3. 4 core 8GB RAM virtual machine on machine 1. CentOS 7.9

machine 4. 4 core 8GB RAM vitual machine on machine 2. CentOS 7.9

Results:

Test 1. mpirun -n 4 ./bin/bt.B.x (4 process smaller array - 102 x 102 x 102)

machine 1. 49.87 sec

machine 2. 62.02 sec

machine 3. 43.92 sec

machine 4. 63.11 sec

Test 2. mpirun -n 4 ./bin/bt.C.x (4 process larger array - 162 x 162 x 162)

machine 1. 388.57 sec

machine 2. 253.40 sec

machine 3. 201.79 sec

machine 4. 256.78 sec

In case of the above test 1, the result was understandable and performance diffrence was not strange and expected results were shown.

However, 2nd test. I saw very strange results. There is two unexped things.

1. Newer (3rd) generation of Xeon showed much slower result than older (2nd) generation of Xeon on real machine.

2. Newer (3rd) generation of Xeon showed big improvement , if the benchmark was executed on the virtual machine.

In case of the memory of the machine 1 and the machine 2, machine 2's memory is 1/3 x bigger, however, the using memory of the test 2 (bt.C.x) only consume 4GB (free command result), then it the memory size difference might not make such big effects to execution results.

I also executed the tests with openmpi 4.1 the following is the results:

Test 1. mpirun -np 4 ./bin/bt.B.x (4 process smaller array)

machine 1. 52.31 sec

machine 2. 61.73 sec

Test 2. mpirun -np 4 ./bin/bt.C.x (4 process larger array)

machine 1. 198.70 sec

machine 2. 252.31 sec

Then it seems that Intel MPI and 3rd Gen Xeon and some large array treatment may cause performance down. Then it seems that I can not use Intel MPI with 3rd Gen Xeon. But Intel MPI is much easier to specify fabric and then I want to use it our network traffic evaluation if possible. Then, I want to know following things to use Intel MPI library:

1. Why 3rd Gen Xeon showed slow performance? Why it was not shown with my vitrual machine case even with 3rd Gen Xeon?

2. Why the performance down is shown with Intel MPI library?

3. Is there any way to make performance up with Intel MPI and 3rd Gen Xeon?

Please help!.

K. Kunita

Kuni · ‎02-15-2023

I am sorry, in case of MPICC, I made mistake since I modified your recommended setting in the last reply. And "undefined reference" is due to that I forgot to do "make clean". Now I can build with mpiifort.

However, the mpirun -n 4 ./bin/bt.C.x result showed following:

on the real Gen 3th Xeon machine: 261.22 sec, on virtual machine (KVM) on same machine as the machine which resulted 261.22sec : 201.xx sec.

It is strange that the benchmark result showed faster execution on virtual machine than the result on real machine. Normally, the result of the virtual machine should show a little bit slower performance than the non-virtual case.

Also your result case, bt.A.x is faster on 3rd Gen Xeon, but bt.C.x is slower on 3rd Gen Xeon. That means that you also show what I saw and there is some issue on IntelMPI/3rd Gen Xeon convination. And I think that you can reproduce my issue.

What do you think?

Kuni · ‎02-15-2023

Sorry, I clcked Post Reply button before I get the accurate same condtion value. 201.xx sec is some condition difference is. The execution time on the virtula machine in the same environment was 220.18 sec.

Regards, K. Kunita

Kuni · ‎02-26-2023

Also I tried to use devcloud. I could not find the way to use 3rd/4th Gen. Xeon SP. How can I setup devcloud to use 3rd/4th Gen Xeon SP?

Anyway, I think that you can reproduce my issue in your environment even with new compiler. (Your results show that 3rd Gen Xeon SP is slower than 2nd Xeon in Class C result. In the same your environment, Class B result is faster with 3rd Gen Xeon SP than 2nd Gen Xeon SP. Normal expectation is newer Gen CPU need to show faster results in most of case. Then the result might not be good for Intel too. That is what I want you to investigate. (Also I think that you can see same result as me if you tried with vitrual machine. Virtual machine result is faster than real machine result and the result might be the one what I want to see with real machine ).

Regards, Kuni

Kuni · ‎03-15-2023

Do you have any counter reply to my previous communication?

Kuni · ‎03-22-2023

Any update?

Kuni · ‎04-05-2023

Could you provide your comment to my communication. I am waiting for more than a month.

Kuni · ‎05-31-2023

One thing I should approgize. I over looked about your response about Devcloud. Now I could do NPB with 3rd Gen Xeon SP. The result what I saw was same as you. I think that it reproduced my issue. - 3rd Gen Xeon is slower than 2nd Gen Xeon SP (What you did is 1st Gen Xeon SP). Also I did with mpiicc/mpiifort with my environment. In my environment , real Gen 3th Xeon machine: 261.22 sec, on virtual machine (KVM) on same machine resulted about 201 sec. That means that something wrong is happen on non-virtual 3rd Gen Xeon with Intel MPI. I tryied Openmpi with no-virtual 3rd Xeon. At the case, result is similar to virtual 3rd Xeon SP with Intel MPI. 3rd Xeon SP showed better result than 2nd Gen Xeon SP.

Also I tried with all npb basic test. The result of the test is following:

	Gold 6128	Gold 6348	Silver 4310	Silver 4310-VM	Silver 4214R	Silver 4214R-VM
bt.B.x n=4	49.76 sec	45.59 sec	49.87 sec	50.56 sec	61.20 sec	62.37 sec
bt.C.x n=4	214.96 sec	236.99 sec	261.87 sec	205.56 sec	249.57 sec	254.04 sec
ft.B.x n=4	9.40 sec	11.74 sec	11.37 sec	8.65 sec	10.11 sec	10.44 sec
ft.C.x n=4	40.59 sec	48.76 sec	35.71 sec	36.32 sec	42.75 sec	43.51 sec
lu.B.x n=4	24.95 sec	24.07 sec	28.75 sec	29.59 sec	35.64 sec	36.59 sec
lu.C.x n=4	104.15 sec	100.05 sec	140.86 sec	123.15 sec	145.38 sec	148.16 sec
is.B.x n=4	0.68 sec	0.80 sec	0.46 sec	0.44 sec	0.44 sec	0.52 sec
is.C.x n=4	2.77 sec	2.64 sec	1.51 sec	1.64 sec	1.76 sec	1.88 sec
is.D.x n=4	49.59 sec	45.65 sec	30.03 sec	29.08 sec	32.97 sec	33.55 sec
cg.B.x n=4	8.52 sec	8.46 sec	8.78 sec	10.55 sec	9.84 sec	11.78 sec
cg.C.x n=4	25.29 sec	33.32 sec	23.33 sec	25.63 sec	27.63 sec	33.46 sec
cg.D.x n=4	2224.97 sec	1475.95 sec	1872.25 sec	1916.41 sec	3448.39 sec	3437.87 sec
ep.B.x n=4	9.37 sec	10.98 sec	12.97 sec	13.15 sec	14.43 sec	14.58 sec
ep.C.x n=4	37.44 sec	41.70 sec	51.61 sec	52.22 sec	57.67 sec	58.24 sec
sp.B.x n=4	36.22 sec	45.42 sec	33.51 sec	34.31 sec	37.94 sec	38.76 sec
sp.C.x n=4	176.85 sec	282.39 sec	143.23 sec	146.02 sec	168.69 sec	171.56 sec
mg.B.x n=4	1.30 sec	2.04 sec	1.36 sec	1.41 sec	1.46 sec	1.58 sec
mg.C.x n=4	11.36 sec	15.38 sec	11.38 sec	11.87 sec	12.79 sec	13.02 sec
dt.B.x n=43 BH	1.54 sec	1.17 sec	1.00 sec	4.64 sec	1.31 sec	5.38 sec
dt.B.x n=192 SH	7.16 sec	6.76 sec	4.46 sec	22.73 sec	5.65 sec	27.72 sec
dt.C.x n=85 BH	26.15 sec	18.78 sec	18.98 sec	error(memory?)	19.44 sec	error (memory?)

In the test, I executed each test more than 2 times and selected the minimum execution time. And if the time was less than 30 sec, I executed more than 3 times. Exception is cg.D.x. It was executed once because it takes much long time.

In the above results, bt.C.x, ft.B.x and lu.C.x showed worse results on 3rd Gen Xeon SP real machine. However the virtual machine results are better results with 3rd Gen Xeon.

Also I executed Geekbench5 on those 2 devcloud hosts for reference purpose. Result of sigle thread is following and all tests except Text compression are faster on Gold 6348. However above results show some performance degradation with 3rd Gen Xeon SP. It means that some issue in Intel MPI library for 3rd gen Xeon SP or some special option might be needed with 3rd Gen to execute intel mpi.

	Gold 6128	Gold 6348
Single-Core Score	1016	1152
Crypto Score	1355	2250
Integer Score	961	1043
Floating Point Score	1086	1205
AES-XTS	1355	2250
Text Compression	1035	840
Image Compression	1007	1082
Navigation	806	853
HTML5	907	1187
SQLite	1013	1076
PDF Rendering	947	1130
Text Rendering	881	1075
Clang	1075	1125
Camera	1008	1077
N-Body Physics	954	1083
Rigid Body Physics	1067	1110
Gaussian Blur	718	791
Face Detection	983	1037
Horizon Detection	826	968
Image Inpainting	1841	2131
HDR	1917	2140
Ray Tracing	1366	1557
Structure from Motion	938	1062
Speech Recognition	1004	1099
Machine Learning	918	984

What do you think?

Regards, K. Kunita

ShivaniK_Intel · ‎07-10-2023

Hi,

We have escalated this issue to the development team and will get back to you soon.

Thanks & Regards

Shivani

Kuni · ‎08-23-2023

I tried with RoCE communication with 4 workstations. bt.D.x test is the only the test in NPB show but performance degradation with 3rd Gen Xeon. In case of OpenMPI showed about 2 x speed.

The execution command lines are following.

Case 1:

Intel MPI 2nd Gen. Xeon SP with Nvidia ConnectX-5: mpirun -n 36 -ppn 9 -host svr0-100g,svr1-100g,svr2-100g,svr3-100g ./bin/bt.D.x

Result: 626.03 sec

Case 2:

Intel MPI 3rd Gen. Xeon SP with Nvidia ConnectX-6: mpirun -n 36 -ppn 9 -host svr4-100g,svr5-100g,svr6-100g,svr7-100g ./bin/bt.D.x

Result: 1011.72 sec

Case 3:

OpenMPI 2nd Gen Xeon SP with Nvidia ConnectX-5: mpirun -np 36 -host svr0-100g:9,svr1-100g:9,svr2-100g:9,svr3-100g:9 ./bin/bt.D.x

Result: 602.64 sec

Case 4:

OpenMPI 3rd Gen. Xeon SP with Nvidia ConnectX-6: mpirun -np 36 -host svr4-100g:9,svr5-100g:9,svr6-100g:9,svr7-100g:9 ./bin/bt.D.x

Result: 497.51 sec

Other test than BT of NPB did not show such performance degradation with Intel MPI and 3rd Gen Xeon.

Regards, K. Kunita

Kuni · ‎09-24-2023

There is one typo "show but perfomance degradation" should be "which showed performance degradation".

By the way, is there any update?

Homer-b83fb642db9bb85 · ‎04-28-2024

I have the same issue, The 4309Y single core integer benchmark is lower than 4314, 6142, E5 2643v3....too bad.

the others(Xeon 43xx) is fine. just 4309Y.

TobiasK · ‎04-29-2024

@Homer-b83fb642db9bb85

I don't think this is the right place for you.
You might want to check which performance profile your 4309Y is running:

https://www.intel.com/content/www/us/en/products/sku/215275/intel-xeon-silver-4309y-processor-12m-cache-2-80-ghz/specifications.html

Intel® Speed Select Technology - Performance Profile (Intel® SST-PP)

Config	Active Cores	Base Frequency	TDP
4309Y(0)	8	2.8 GHz	105W
4309(1)	8	2.6 GHz	95W
4309(2)	8	2.3 GHz	85 W

If you think it's related to Intel MPI, please provide a reproducer and detailed information on your environment apart from the CPU SKU.

Best

Homer-b83fb642db9bb85 · ‎05-05-2024

@TobiasK Thanks.

Unusual performance issues, low-clock CPUs may perform better, e.g.: Xeon 4314 , just single core test and the same BIOS/OS setting.

This rule does not apply to many other generations of Xeon CPUs. How strange.

TobiasK · ‎05-08-2024

@Homer-b83fb642db9bb85
did you take a look at the actual frequencies using tools like turbostat?

Homer-b83fb642db9bb85 · ‎05-11-2024

@TobiasK Thanks.

Due to other reasons, I was unable to switch operating system versions, so testing was conducted on different operating system versions. The test results were consistent when tested on the same operating system previously.

Turbostaus info in the attachment.

Xeon 4314, el7 gcc version 9.1.1 20190605, 3.10.0-1160.49.1.el7.x86_64

Xeon 4309Y, el9 gcc version gcc version 11.3.1 20221121, 5.14.0-284.18.1.el9_2.x86_64

Xeon 4309Y, gcc version 11.3.1 20221121 5.14.0-284.18.1.el9_2.x86_64

Output: 91.6857 <------------ higher CPU MHz and take more time
sum = 9048129480000

Test command:

g++ -O2 test.cpp -o add

numactl -C 3 ./add

#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
    const unsigned arraySize = 1048576;
    int data[arraySize];
    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 90000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}

lscpu

lscpu 
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    BIOS Model name:     Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    Stepping:            6
    BogoMIPS:            5600.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology n
                         onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dn
                         owprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512
                         f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
                         da arat pln pts hwp_epp avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   768 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    20 MiB (16 instances)
  L3:                    24 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected



Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    BIOS Model name:     Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    Stepping:            6
    BogoMIPS:            5600.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology n
                         onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dn
                         owprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512
                         f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
                         da arat pln pts hwp_epp avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   768 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    20 MiB (16 instances)
  L3:                    24 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Homer-b83fb642db9bb85 · ‎05-11-2024

@TobiasK Thanks

I'm not sure what's going on, but my previous replies haven't been showing up. I'll try again.

For other reasons, I'm unable to switch operating systems. Currently, I'm comparing performance across different operating systems. However, according to previous testing results, this test script actually performs better on higher versions of GCC. Even when switching to the same operating system, this behavior persists.

1x Xeon 4314 gcc version 9.1.1 20190605 3.10.0-1160.49.1.el7.x86_64

numactl -C 3 ./add

65.21
sum = 9048129480000

2x Xeon 4309Y gcc version 11.3.1 20221121 5.14.0-284.18.1.el9_2.x86_64

numactl -C 3 ./add

91.6857 <-----------------------take more time in the higher CPU frequency
sum = 9048129480000

here is the test.cpp code

g++ -O2 test.cpp -o add

#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
    const unsigned arraySize = 1048576;
    int data[arraySize];
    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 90000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}