Re: OneAPI and WRF: running with ten thousands cores

Jervie · ‎04-11-2021

Hi,

With WRF model compiled using Intel 2021.2.0, I tried to run a large case using 11200 cores. Followed are some environmental variables I set:

export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=mlx
export I_MPI_OFI_LIBRARY_INTERNAL=1
export I_MPI_PIN_RESPECT_HCA=enable

I tried to compile WRF using Intel 2018.3.22, and to run a large case using Intel 2018.3.22 mpirun and Intel 2021.2.0 mpirun with 5600 and 11200 cores. They both completed successfully, but the performance degraded using 11200 cores compared to using 5600 cores, which should not be as WRF has relatively good scalability.

I suspected I may miss some environmental settings when running WRF compiled with Intel 2021.2.0. It really confused me!

Followed are some information about the platform I used:

IB driver: UCX 1.10.

$ucx_info -d | grep Transport
# Transport: posix
# Transport: sysv
# Transport: self
# Transport: tcp
# Transport: tcp
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: dc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: cma
# Transport: knem
# Transport: xpmem

$ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.2002
Hardware version: 0
Node GUID: 0x0c42a10300387d0c
System image GUID: 0x0c42a10300387d0c
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 749
LMC: 0
SM lid: 1988
Capability mask: 0x2651e848
Port GUID: 0x0c42a10300387d0c
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.2002
Hardware version: 0
Node GUID: 0x0c42a10300387d38
System image GUID: 0x0c42a10300387d38
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x0c42a10300387d38
Link layer: InfiniBand

Thank you!

SantoshY_Intel · ‎04-12-2021

Hi,

Thanks for reaching out to us.

>> They both completed successfully although the performance degraded using 11200 cores compared to using 5600 cores, which should not be as WRF has relatively good scalability.

-- Are you able to run with 11200 cores by using both 2018.3.22 and 2021.2.0 successfully?

If you are getting an error while using mlx, then try to use the below environment variables:

UCX_TLS=rc,ud,sm,self

You can also try to use the FI provider "verbs" by using the below command:

export FI_PROVIDER=verbs

Warm Regards,

Santosh

Jervie · ‎04-12-2021

Hi Santosh,

Thank you for your reply. I just found the reason for the failure in the WRF run using 11200 cores. It was because I used 11200 processes, which exceeds the maximum processes WRF could use. I did run the same case with 11200 cores by using both 2018.3.22 and 2021.2.0 successfully, however, both of which used 2800 MPI processes and 4 OpenMP threads. (I delete this issue in my original post since I may not make my question clear to read)

The second issue still confused me. I still have no idea about the degrading performance when scale WRF from 5600 cores to 11200 cores. I also compared the result of 4984 cores and 9968 cores, and 9968 cores shows worse performance than 4984 cores. WRF is compiled with OneAPI in the runs mentioned above. But I run the same large case on another cluster with WRF being compiled with Intel 2018 before, WRF showed very good scalability from 4,000 to 16,000 cores.

Compared to the cluster we used before, on the current cluster we used new compiler OneAPI, we also used Mellanox ConnectX-6 and IB driver UCX 1.10. So it kind of make the situation complicated.

I tried to use the FI provider "verbs", but the WRF run failed. Therefore, I kept using "mlx", which works.

Best,

Jervie

SantoshY_Intel · ‎04-15-2021

Hi Jervie,

We would like to have some more information from you so that we can investigate well.

Could you please provide the information for the following questions?

We need CPU information & skew metrics
What is the Version and workload of WRF you are using?
What option did you choose during the configuration of WRF? Example: 57.(smpar) 58. (dmpar) 59. (dm+sm)
Since you used OpenMP & MPI distribution, did you enabled multithreading? Please specify the MPI & OpenMP distribution.
Since you are asking about performance degrading, Can you provide us the numbers or the percentage difference in performance?
Why you can scale up to 16k cores in Intel 18, but only up to 11k in Intel 2021? Why can't you scale up to 16k cores in 2021? As you said 11.2k is the maximum for WRF, how can you use 16k in 2018?

Awaiting your reply.

Thanks & Regards,

Santosh

Jervie · ‎04-16-2021

Hi Santosh,

Please see below:

Reply to Q1 (We need CPU information & skew metrics)

$lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 1
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz
Stepping: 7
CPU MHz: 999.975
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 5400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0-27
NUMA node1 CPU(s): 28-55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni spec_ctrl intel_stibp flush_l1d arch_capabilities

$ibstat

CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.2002
Hardware version: 0
Node GUID: 0x0c42a10300387d0c
System image GUID: 0x0c42a10300387d0c
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 749
LMC: 0
SM lid: 1988
Capability mask: 0x2651e848
Port GUID: 0x0c42a10300387d0c
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.2002
Hardware version: 0
Node GUID: 0x0c42a10300387d38
System image GUID: 0x0c42a10300387d38
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x0c42a10300387d38
Link layer: InfiniBand

$ucx_info -v

# UCT version=1.10.0 revision a212a09
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --with-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2

Reply to Q2 (What is the Version and workload of WRF you are using?)

I'm using WRF3.9.1. The workload has 1100 grids from west to east and 1800 grids from south to north with a resolution of 1 km, and 34 vertical layers.

Reply to Q3 (What option did you choose during the configuration of WRF? Example: 57.(smpar) 58. (dmpar) 59. (dm+sm))

I used dm+sm.

Reply to Q4 (Since you used OpenMP & MPI distribution, did you enabled multithreading? Please specify the MPI & OpenMP distribution)

Yes, I enabled MPI+OMP. I used Intel OneAPI MPI. I'm not sure about the version of OpenMP, but I'm using gcc 4.8.5. The WRF model was compiled using Intel icc, ifort and OneAPI MPI compilers.

Reply to Q5 (Since you are asking about performance degrading, Can you provide us the numbers or the percentage difference in performance)

Comparing to using 2492 MPI*2 OMP (4984 cores), performance of using 4984 MPI*2OMP (9968 cores) degraded by 10%.

Reply to Q6 (Why you can scale up to 16k cores in Intel 18, but only up to 11k in Intel 2021? Why can't you scale up to 16k cores in 2021? As you said 11.2k is the maximum for WRF, how can you use 16k in 2018)

I may not make myself clear before. My point is WRF has relatively good scalability. I scaled WRF up to 16k cores using Intel 2018 on 6248R CPU before, and WRF showed better performance with the increase of CPU cores. Therefore, it may not be reasonable that the performance of the same WRF workload degraded when scaling from 4984 cores to 9968 cores using OneAPI on 6258R.

clevels · ‎07-08-2021

Hello Jervie,

Can you please try to find with how many nodes it starts to fail? It is hard to investigate this issue for 11200 cores (200 nodes). I see they tried 1 and 4 nodes. Could they try 8, 18, 32, 64, 128 nodes?
Then, when they find the smallest scale when it fails:
2. Run simple application (for example hostname) for that scale to see if all processes were launched.
3. Run IMB-MPI1 for that scale: mpiexec -n <num hosts> -ppn <ppn> -hosts <hosts> IMB-MPI1 -npmin 1000000 barrier
4. Run WRF with I_MPI_DEBUG=4 variable.

SantoshY_Intel · ‎04-21-2021

Hi,

Thanks for the response.

We are investigating your issue and will be back soon.

Thanks & Regards,

Santosh

clevels · ‎08-13-2021

Hey Jervie - Please let me know if there is anything else I can do for you pertaining to this issue, otherwise I will have to close this ticket.