PZGESVD sporadically stalls when using multiple compute nodes

John_Young · ‎10-05-2022

Hi,

We have been seeing calls to PZGESVD sporadically stalls. If the stall occurs, it only occurs for distributed matrices that are assigned mpi processes spanning multiple compute nodes. The stall never occurs if the distributed matrix is assigned mpi processes where all mpi processes are on the same compute node. Even if the distributed matrix is assigned mpi processes spanning multiple compute nodes, PZGESVD may not stall (in fact, it usually doesn't stall). However, when the stall occurs, it is perfectly repeatable. We have observed the stall on three separate clusters. We have not tried PCGESVD.

Attached is a simple test case which produces the stall on the three clusters we mentioned. The test case uses 60 mpi processes. The PXGESVD call is made for mpi processes 39 to 44. We ran the case using 1, 2, 3, 4, 5, 6, 10, and 12 compute nodes, respectively. The stall occurs for 3, 6, 10, and 12 compute nodes but does not occur for 1, 2, 4, and 5 compute nodes. For the 1, 2, 4, and 5 compute node cases, mpi processes 39 to 44 all reside on the same compute node. For the 3, 6, 10, and 12 node cases, mpi processes 39 to 44 span two compute nodes. Screen output is included for the case with 3 compute nodes. Tests were performed with Intel oneAPI 2022.2. If we use a reference MPI/BLACS/SCALAPACK library instead of the Intel libraries, the stall in PXGESVD is never observed, so we think the issue is probably in the Intel MPI/BLACS/SCALAPACK stack.

John

ShivaniK_Intel · ‎10-07-2022

Hi,

Thanks for posting in the Intel forums.

Thanks for providing the details.

Could you please provide the steps to reproduce the error at our end?

Could you also please provide the OS details and let us know the reference MPI/BLACS/ScaLAPACK library you are referring to?

Thanks & Regards

Shivani

John_Young · ‎10-07-2022

Hi Shivani,

Thank you for looking at this for us. I stated in my original post that the test case failed on three clusters. Actually, this turned out not to be true. The test case failed on one cluster as stated. On the other two clusters, the test case only failed for 10 compute nodes and 60 mpi processes for a matrix of size 51x45 (in groups.txt change 10000 to 51 and 179 to 45).

Since there was some discrepancy on the failure among the three clusters we tested, I am attaching a new test case (test_case2.zip) that stalls on all three clusters we tried using the same input parameter set. This test case stalls using 3 physical compute nodes with 60 mpi processes (20 processes per compute node assigned sequentially). To reproduce the problem, you can compile test.F90 with the compile.sh script to generate the executable . You may have to modify the modules loaded to match your system. I dispatch the test on our cluster using the 'batch_mpi.sh" which dispatches the 'intel_mpi.sh' script. Again, you may have to make some modifications to these two scripts for your test cluster. In batch_mpi.sh, the number of compute nodes is the variable 'pnodes' and the number of mpi processes is set by the variable 'vnodes'. The 'screen.txt' file has the screen output exhibiting the stall on one of our clusters when using the Intel libraries (2022.2).

I will have to check on the OS version of two of the clusters. The primary cluster I tested on uses CentOS Linux 7 (Core) and kernel version Linux 3.10.0-957.21.3.el7.x86_64 .

For the test where I replaced the Intel libraries with reference mpi/blacs/scalapack (which did not exhibit any stall), I used OpenMPI 4.1.4 (https://www.open-mpi.org/software/ompi/v4.1/) , Scalapack 2.2.0 (https://netlib.org/scalapack/#_scalapack_version_2_2_0), and lapack 3.10.1 (https://netlib.org/lapack/) . I used the blacs that were included with the Scalapack library. All libraries were compiled with icc and ifort from Intel oneAPI 2022.2 .

John_Young · ‎10-07-2022

To follow up. The other two clusters we have tested on have:

CentOS Linux release 7.9.2009 (Core)

kernel: Linux 3.10.0-1160.76.1.el7.x86_64 x86_64.

ShivaniK_Intel · ‎10-12-2022

Hi,

Could you please export I_MPI_HYDRA_TOPOLIB=hwloc and FI_PROVIDER=tcp/verbs/mlx and try running your application?

If you still face any issues, could you please provide the complete debug log exporting FI_LOG_LEVEL=debug?

Could you also please provide the details of lscpu command?

Thanks & Regards

Shivani

John_Young · ‎10-17-2022

Shivani,

I apologize for not replying sooner but our cluster was down for maintenance last week. Using I_MPI_HYDRA_TOPOLIB=hwloc did not resolve the stall we are seeing. The stall occurs when using the verbs and mlx provider but does not occur for the tcp provider. The choice of 'hwloc' or 'ipl' does not affect whether the stall occurs or not. I have attached the output with FI_LOG_LEVEL=debug for all three providers in the screen_output.zip file.

The output of lspcu for one of our compute nodes is below.

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 1
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Stepping: 7
CPU MHz: 2100.000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp_epp pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

ShivaniK_Intel · ‎10-19-2022

Hi,

Thanks for providing the details.

Could you please let us know whether you are facing a similar issue with FI_PROVIDER=mlx while running the MPI benchmark?

Command: mpirun -n <no. of processes> -ppn 1 -f hostfile IMB-MPI1 allreduce

Thanks & Regards

Shivani

John_Young · ‎10-19-2022

Shivani,

I ran

mpirun -n 60 -ppn 1 IMB-MPI1 allreduce

across using 3 compute nodes with the FI_PROVIDER=mlx, and the test completed successfully without stalling.

Thanks,

John

ShivaniK_Intel · ‎10-21-2022

Hi,

Thank you for your patience and for providing us with the details.

Could you please provide us with the logs using the Intel cluster checker?

For more details regarding the Intel cluster checker please refer to the below link

https://www.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/getting-started.html

Could you also please use the -check_mpi flag with mpirun run command and provide us with the output logs?

Thanks & Regards

Shivani

John_Young · ‎10-21-2022

Shivani,

I will send the Intel cluster checker logs later.

Attached is the log with "-check_mpi" for the test case (3 compute nodes, 60 MPI processes) that has been hanging. When using "-check_mpi" the simulation hangs for a bit and then actually completes. On the nodes that are hanging, memory overlap warnings are emitted (log file has complete warning messages)

[42] WARNING: LOCAL:MEMORY:OVERLAP: warning
[42] WARNING: New send buffer overlaps with currently active receive buffer at address 0x1904798.
[42] WARNING: Control over active buffer was transferred to MPI at:
[42] WARNING: MPI_Irecv(*buf=0x1904798, count=1, datatype=0xcc000000, source=0, tag=1, comm=0xffffffffc400004c CREATE DUP CREATE COMM_WORLD [39:44], *request=0x7fff05c65f70)
[42] WARNING: Cigamx2d (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2)
[42] WARNING: Csgamn2d (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2)
[42] WARNING: Cdgamn2d (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2)
[42] WARNING: Csgerv2d (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2)
[42] WARNING: pcagemv_ (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_scalapack_lp64.so.2)
[42] WARNING: pzpoequ_ (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_scalapack_lp64.so.2)
[42] WARNING: MAIN__ (/mnt/gpfs2_4m/scratch/xxxxxxx/data/code/pxgesvd/to_intel/test.F90:437)
[42] WARNING: main (/mnt/gpfs2_4m/scratch/xxxxxxx/data/code/pxgesvd/to_intel/a.out)
[42] WARNING: __libc_start_main (/usr/lib64/libc-2.17.so)
[42] WARNING: Control over new buffer is about to be transferred to MPI at:
[42] WARNING: MPI_Send(*buf=0x1904798, count=1, datatype=0xcc000000, dest=0, tag=0, comm=0xffffffffc400004c CREATE DUP CREATE COMM_WORLD [39:44])
[42] WARNING: Csgamx2d (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2)
[42] WARNING: Cdgamn2d (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2)
[42] WARNING: Csgerv2d (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2)
[42] WARNING: pcagemv_ (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_scalapack_lp64.so.2)
[42] WARNING: pzpoequ_ (/opt/ohpc/pub/intel/oneapi/mkl/2022.2.0/lib/intel64/libmkl_scalapack_lp64.so.2)
[42] WARNING: MAIN__ (/mnt/gpfs2_4m/scratch/xxxxxxx/data/code/pxgesvd/to_intel/test.F90:437)
[42] WARNING: main (/mnt/gpfs2_4m/scratch/xxxxxxx/data/code/pxgesvd/to_intel/a.out)
[42] WARNING: __libc_start_main (/usr/lib64/libc-2.17.so)

and then the simulation finishes.

John_Young · ‎10-21-2022

Shivani,

Attached are the Intel cluster checker logs run on 3 compute nodes. I do not know why the cluster check complains about 'lscpu' as it is definitely on our cluster and working.

Thanks,

John

John_Young · ‎10-21-2022

I also attached to one of the stalled processes (for the 3 node , 60 mpi process test case) using gdb. The backtrace from gdb is in the attached screenshot.

ShivaniK_Intel · ‎11-03-2022

Hi,

Thanks for providing the details.

We are working on it and will get back to you.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎11-10-2022

Hi,

Could you please let us know whether you are facing a similar issue with FI_PROVIDER=mlx while running the MPI benchmark, do not

restrict to -ppn 1, so basically just replace "a./out" in batch_mpi.sh script with "IMB-MPI1".?

Could you also please try exporting I_MPI_FABRICS=ofi and let us know the output?

Command: export I_MPI_FABRICS=ofi

Could you please update the UCX/Mellanox driver stack, and kernel and try running the test case?

Thanks & Regards

Shivani

John_Young · ‎11-12-2022

Shivani,

Thank you for continuing to look into this.

1. Changing the batch_mpi.sh script to IMB-MPI1 runs fine without any issue (no hangs).

2. I will have to ask our cluster admin to update the ucx/Mellanox stack (although I believe this may have been done in the past few months anyway).

3. If I run the test case that was hanging using only the "ofi" fabric instead of the "shm:ofi" fabric, the test case completes normally without hanging. This seems like progress on the issue. I will run our full code next week and see if using only "ofi" prevents the hang. The questions then are (1) what does it mean that "shm:ofi" hanging and not "ofi" and (2) is there a performance penalty to using only "ofi" instead of "shm:ofi".

Thanks,

John

ShivaniK_Intel · ‎11-15-2022

Hi,

Could you please export I_MPI_SHM_EAGER_NUM=1 and clear the I_MPI_FABRICS=ofi setting again? Let us know if this resolves your issue

or not?

Thanks & Regards

Shivani

John_Young · ‎11-16-2022

Shivani,

For the test case we have been looking at (attached to this thread), here are the results

FABRIC I_MPI_SHM_EAGER_NUM=1 STALLS

shm:ofi NO YES

shm:ofi YES NO

ofi NO NO

ofi YES NO

We have also tested our production code on one of the problems that was stalling for "shm:ofi". Using only "ofi" or using "shm:ofi" with the eager setting as suggested prevents the stall from happening and the program completes normally.

It seems like we have a couple of solutions now. So that we can better understand the problem and solution, is there any explanation on what is happening?

Thank you for your help with this problem.

John

ShivaniK_Intel · ‎11-27-2022

Hi,

We are working on it and will get back to you.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎11-30-2022

Hi,

We are internally escalating this issue to the concerned team.

Thanks & Regards

Shivani

VeenaJ_Intel · ‎12-22-2023

Hi,

Fix is available in 2021.11. Please download and let us know if this resolves your issue. We will be closing this thread from our side, If the issue still persists with new release then create a new thread for us to investigate.

Regards,

Veena