While testing scalability of the Quantum Espresso HPC software package, I stumbled on a very strange and annoying problem: when the jobs are run on too many cores, they undergo "sudden death" at some point. "Sudden death" means the job stops with no error message at all, and no core dump. "Too many" and "some point" mean: if the job is run with parallelization parameters above a given limit, it will stop during a given cycle; the higher the parameters, the sooner it will stop. The -np option is the most influential parameter.
I'm compiling and running the PWscf software from Quantum Espresso 6.2.1. The server is one NUMA node with 4 sockets equipped with "Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz" and, initially, 1 DIMM of 16 GB on channel 0 of each socket.
Initially I used Parallel Studio 2019 Initial Release to compile and run, with only Intel MPI parallelization (no threading). Years ago we added these variables to the starting script: OMPI_MCA_mpi_yield_when_idle=1 OMPI_MCA_mpi_paffinity_alone=1 OMP_NUM_THREADS=1 (I guess they're irrelevant, but they're here).
To ensure isolation of the PWscf tasks from other software running on the node, I use the "cset" package tools (Linux Mint 19, clone of Ubuntu 18.04). Sockets 1-3 are devoted to PWscf, socket 0 is devoted to OS and other software. All tests are done with a multiple-of-3 number N of tasks, globally tied to N cores evenly distributed on sockets 1-3 (I don't do anything special to bind each task to a specific core).
I have a reproducible test case: repetition of a few examples showed that the running time is precise to approx. ±10s (it ranges from approx. 1h30 to 3h depending on N) for jobs that complete. For failing jobs, the failure always occurs in the same cycle, ±1, for given values of the -np mpirun option and of the -ndiag PWscf option.
First series: all jobs with -np values ≥51 fail, irrespective of -ndiag; they fail in the 16th or 17th cycle, irrespective of -ndiag.
We then upgraded the node to 3 DIMMs of 16, 8, and 8 GB on channels 0, 1, and 2 of each socket.
Second series: timings are better (confirming our hypothesis of bandwidth limiting performance, and of our provider misconfiguring the node by populating only 1 DIMM per socket). Jobs complete with -np values up to 54 or 57, depending on the -ndiag value. Failing jobs fail sooner when -np is higher. E.g. -np 57 leads to failure in the 59th cycle, -np 72 to failure in the 44th cycle (for a given -ndiag).
Thus it seems like the problem has to do with the amount of memory.
I then thought I'd try with an updated Parallel Studio, but stumbled on this bug I reported. With the workaround suggested there (I_MPI_HYDRA_TOPOLIB=ipl), I ran a third series, PWscf compiled with PS 2019 update 3: things are worse, the program outputs nothing at all, although the tasks use 100% CPU!
Fourth series: PWscf compiled with PS 2019 update 1: with I_MPI_HYDRA_TOPOLIB=ipl the program runs as usual, but the sudden deaths are still here and now they look quite unpredictable (yet still reproducible): -np 33 fails but -np 57 (or 60) completes (for a given -ndiag). I haven't done extensive tests for all -np values, but I'm not very inclined to do so since this series seems to have worse outcomes.
Please advise on what to do next. Thanks.
Can you run with debug configuration of Intel MPI Library? Probably we will get some assert there.
$ . mpivars.sh debug
BTW, how many sockets are on your system?
Best regards, Yury
Intel(R) Parallel Studio XE 2019 Update 4 Composer Edition only has been released last week. Professional and Cluster Editions will be released soon.
Sorry I can't share the pre-release SW.
Best regards, Yury.
Thanks to Anatoliy in the other thread, I can "reliably" run MPI tasks compiled with the Parallel Studio 2019 updates by using the legacy mpiexec.hydra.
So I recompiled the PWscf program with PS 2019 Update 4 and the same thing as with Update 3 happens: the program seems to run, occupying 100 % of allocated CPUs, but outputs nothing (it should be verbose on stdout and produce many output files)!
What do you advise?
Some more findings. As in the other thread, the problem is linked to using CPU sets (which I use to isolate tasks CPU-wise and RAM-wise, tell me if there's another way), but the mere fact of being in a CPU set is not enough: see the following detailed test.
The system has four sockets of 24 cores each. To isolate tasks I create two CPU sets: "system" for socket 0 (CPUs 0, 4, 8, etc. and memory node 0) and "test" for sockets 1 to 3 (CPUs 1-3, 5-7, etc. and memory nodes 1-3).
Running PWscf works outside CPU sets, or in set "system", or in set "system" without CPU 0 (that is, CPUs 4, 8, etc.), or in set "system" with additional CPUs 1-3 (that is, CPUs 0-4, 8, etc.), also with unrestricted memory (that is, CPUs 0-4, 8, etc. and memory nodes 0-3). The CPU set then contains the following tasks: mpiexec.hydra, pmi_proxy, and three times as many "pw.x" as specified by the -np argument to mpiexec.hydra (e.g. -np 54 → 162); "ps H" says:
10354 pts/5 Rl 2:00 /home/lucas/q-e-qe-6.2.1/bin/pw.x 10354 pts/5 Sl 0:00 /home/lucas/q-e-qe-6.2.1/bin/pw.x 10354 pts/5 Sl 0:00 /home/lucas/q-e-qe-6.2.1/bin/pw.x (etc. "-np" times, total 3×"-np" lines)
Running in set "test" does not work. Does not either when memory is unrestricted (set with CPUs 1-3, 5-7, etc. and all memory nodes). In these cases, the program stalls before outputing anything. The CPU set then contains the following tasks: mpiexec.hydra, pmi_proxy, and only one time as many "pw.x" as specified by the -np argument to mpiexec.hydra ; "ps H" says:
10709 pts/5 R 0:10 /home/lucas/q-e-qe-6.2.1/bin/pw.x (etc. "-np" times)
Running in set "test"+CPU 0 (that is, CPU 0-3, 5-7, etc.) "nearly works": the program starts, outputs some information, then stalls before doing any computation. The CPU set then contains as many tasks as in the working case.
Any help is still welcome.
Sorry for the delay. Could you set I_MPI_DEBUG=10 variable? After that you should see some additional debug output. Please send it me.
Best regards, Anatoliy.