bug: mpiexec segmentation fault

Levrel__Lucas · ‎04-08-2019

Hello,

Starting from Parallel Studio 2019 Update 1, mpiexec fails to run any executable. Example: "mpiexec -np 1 /bin/ls". Any call to mpiexec (except calls like "mpiexec -help") results in Segmentation fault.

Please help. I can provide additional information if necessary. However testing is a bit complicated because I had to revert to the Initial release, and Updates cannot be installed concurrently AFAIK, so please request testing only if absolutely necessary.

Note 1: It is on Linux Mint 19. As you may know, this distribution is heavily based on Ubuntu 18.04. By "heavily" I mean that only cosmetic packages differ, like the desktop environment packages. System packages (libc and the like) are taken directly from the Ubuntu repositories.

Note 2: This problem was originally reported in the C++ compiler forum, here. It was spotted on Opensuse (which shares most code with SLES, a distribution completely independent of Ubuntu).

Maksim_B_Intel · ‎04-08-2019

Please try

I_MPI_HYDRA_TOPOLIB=ipl mpiexec.hydra --verbose -n 1 /bin/ls

and if it fails too, could you provide a coredump or run

( export I_MPI_HYDRA_TOPOLIB=ipl; gdb --args mpiexec.hydra --verbose -n 1 /bin/ls )

and post a backtrace?

Levrel__Lucas · ‎04-11-2019

Hi,

Thank you for answering

Maksim B. (Intel) wrote:
Please try
I_MPI_HYDRA_TOPOLIB=ipl mpiexec.hydra --verbose -n 1 /bin/ls

It works and outputs:

[mpiexec@maxwell] Launch arguments: /opt/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-fd 7 --pgid 0 --proxy-id 0 --node-id 0 --launcher ssh --base-path /opt/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin/ --subtree-size 1 --tree-width 16 --tree-level 1 --time-left -1 --debug /opt/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1

<directory contents>

So what now?

Maksim_B_Intel · ‎04-11-2019

This means default machine topology detection library, hwloc crashes on your system. You can switch to Update 3, which contains related fixes, if convenient, or

export I_MPI_HYDRA_TOPOLIB=ipl;

at some point before starting mpirun.

Levrel__Lucas · ‎04-11-2019

I'm already running Update 3!

Thank you. Is there anything to do at compile time?

Maksim_B_Intel · ‎04-11-2019

No, compile-time switches do not impact process manager.

Levrel__Lucas · ‎04-19-2019

Hi,

It looks like this impacts performance. Is this expected?

I compared the running times of the Quantum Espresso "pw" binary: (i) compiled with 2019 initial release and launched with the default topo lib, and (ii) compiled with 2019 update 1 and launched with the ipl topo lib. Results on three test cases, which differ only by parallelization parameters:

- case 1: (i) 2h24 (ii) 2h08

- case 2: (i) 2h29 (ii) 3h00

- case 3: (i) 1h20 (ii) 1h26

(note: the variability of these times is less than 1 minute so the differences are significant)

I should add that I'm setting CPU and memory affinity using the "cset" package tools, to ensure the software is running alone on reserved sockets of the server.

Levrel__Lucas · ‎06-04-2019

Hello,

(Not sure if I should have started a new thread, forum moderators feel free to split this.)

Update on the situation. I have just installed PS XE 2019 Update 4 Cluster Edition and it is worse.

"raw" mpiexec still does not work
using I_MPI_HYDRA_TOPOLIB=ipl now fails for some -n values, but not all! Some tests:

(cd to some dir containing only a dir called "rpm")

> I_MPI_HYDRA_TOPOLIB=ipl mpiexec.hydra --verbose -n 1 /bin/ls
[mpiexec@maxwell] Launch arguments: /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host maxwell --upstream-port 45565 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
rpm

> I_MPI_HYDRA_TOPOLIB=ipl mpiexec.hydra  -n 2 /bin/ls
malloc(): memory corruption
[mpiexec@maxwell] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:538): downstream from host maxwell was killed by signal 6 (Aborted)
[mpiexec@maxwell] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2041): assert (exitcodes != NULL) failed

(same for -n ranging from 3 to 5; from 10 to 12; from 17 to 24)

> I_MPI_HYDRA_TOPOLIB=ipl mpiexec.hydra  -n 6 /bin/ls
rpm
rpm
rpm
rpm
rpm
rpm

(OK as well for -n from 7 to 9, and from 25 upwards AFAICT)

> I_MPI_HYDRA_TOPOLIB=ipl mpiexec.hydra  -n 13 /bin/ls
rpm
rpm
rpm
rpm
rpm
rpm
rpm
rpm
rpm
double free or corruption (out)
[mpiexec@maxwell] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:538): downstream from host maxwell was killed by signal 6 (Aborted)
[mpiexec@maxwell] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2041): assert (exitcodes != NULL) failed

(same kind of failure with partial execution for -n from 14 to 16)

See also that bug report I made, which is why I installed Update 4 (supposedly contains a fix for 4 socket systems).

Please help.

Anatoliy_R_Intel · ‎06-04-2019

Hi, Lucas.

Could you set HYDRA_BSTRAP_VALGRIND=1 variable, run failed case and provide me the output? It can help to understand where we have memory corruption and double free or corruption.

Also from the letter above, I see that you use cset for setting CPU and memory affinity. Could you run without "cset" to check that it does not affect?

--

Best regards, Anatoliy.

Levrel__Lucas · ‎06-06-2019

Hi Anatoliy,

Thank you for helping.

Anatoliy R. (Intel) wrote:
Could you set HYDRA_BSTRAP_VALGRIND=1 variable, run failed case and provide me the output? It can help to understand where we have memory corruption and double free or corruption.

> HYDRA_BSTRAP_VALGRIND=1 I_MPI_HYDRA_TOPOLIB=ipl mpiexec.hydra  -n 2 /bin/ls >& /tmp/log.log

log.log

Also from the letter above, I see that you use cset for setting CPU and memory affinity. Could you run without "cset" to check that it does not affect?

You hit the spot, it does affect. No error when run "outside" the cpusets (i.e. in the root cpuset). But I need to be able to isolate tasks; is there a way other than cpusets, if the mpiexec bug cannot be fixed quickly?

Anatoliy_R_Intel · ‎06-07-2019

I did not find something wrong in valgrind output.

We can also try HYDRA_BSTRAP_XTERM=1 variable, that will run xterm windows with launched gdb.

Please set this variable and run mpirun. After that you will see xterm windows with launched gdb. Then type `run` in each windows and you will see fail in one of the windows. Then type `bt`, it will show backtrace. Please send me this backtrace.

But I need to be able to isolate tasks; is there a way other than cpusets, if the mpiexec bug cannot be fixed quickly?

You can specify on which cpus to run mpi processes via I_MPI_PIN* variables. For example I_MPI_PIN_PROCESSOR_LIST=0,1 will run rank 0 on cpu 0 and rank 1 on cpu 1:

$ I_MPI_PIN_PROCESSOR_LIST=0,1 mpiexec -n 2 -genv I_MPI_DEBUG=4 ./test.exe
..
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 64298 host 0
[0] MPI startup(): 1 64299 host 1

Levrel__Lucas · ‎06-07-2019

Anatoliy R. (Intel) wrote:
I did not find something wrong in valgrind output.
We can also try HYDRA_BSTRAP_XTERM=1 variable, that will run xterm windows with launched gdb.
Please set this variable and run mpirun. After that you will see xterm windows with launched gdb. Then type `run` in each windows and you will see fail in one of the windows. Then type `bt`, it will show backtrace. Please send me this backtrace.

The "-n 2" run failed in the second xterm. Backtrace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff704b801 in __GI_abort () at abort.c:79
#2  0x00007ffff7094897 in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7ffff71c1b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff709b90a in malloc_printerr (
    str=str@entry=0x7ffff71bfe0e "malloc(): memory corruption") at malloc.c:5350
#4  0x00007ffff709f994 in _int_malloc (av=av@entry=0x7ffff73f6c40 <main_arena>, 
    bytes=bytes@entry=8) at malloc.c:3738
#5  0x00007ffff70a20fc in __GI___libc_malloc (bytes=8) at malloc.c:3057
#6  0x0000000000435b47 in ipl_domain_ordering (info=0x2, ord=0x7fffffffa930, lord=0)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:1505
#7  0x000000000043a02d in ipl_create_domains (pi=0x2, scale=-22224)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2233
#8  0x00000000004345e7 in ipl_one_to_many_pinning (info=0x2)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2281
#9  0x0000000000444fa4 in i_mpi_bind_init (
    binding=0x2 <error: Cannot access memory at address 0x2>, 
    bindlib=0x7fffffffa930 "", map=0x0, nrank=2)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/i_mpi_bind.c:372
#10 0x000000000040a563 in launch_processes ()
    at ../../../../../src/pm/i_hydra/proxy/proxy.c:387
#11 0x0000000000408785 in main (argc=2, argv=0x7fffffffa930)
    at ../../../../../src/pm/i_hydra/proxy/proxy.c:895

But I need to be able to isolate tasks; is there a way other than cpusets, if the mpiexec bug cannot be fixed quickly?

You can specify on which cpus to run mpi processes via I_MPI_PIN* variables. For example I_MPI_PIN_PROCESSOR_LIST=0,1 will run rank 0 on cpu 0 and rank 1 on cpu 1:

$ I_MPI_PIN_PROCESSOR_LIST=0,1 mpiexec -n 2 -genv I_MPI_DEBUG=4 ./test.exe
..
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 64298 host 0
[0] MPI startup(): 1 64299 host 1

I'd like the process to have exclusive access to some sockets (CPU+memory); with cset I can both pin the computation task to sockets 1-3 and pin all other tasks (user and system tasks) to socket 0. If I pin the computation task to sockets 1-3 with your method, I suppose the CPU allocator of the OS will "naturally" move other tasks to socket 0 (if I disable hyperthreading); but what will the RAM allocator do? We are investigating RAM bandwidth issues, so I have to prevent other tasks from using the RAM of sockets 1-3.

Thanks again for your help.

Anatoliy_R_Intel · ‎06-07-2019

Thank you for backtrace. I will see what can be wrong.

As a workaround you can use legacy hydra process manager.

Please try to run `PATH=${I_MPI_ROOT}/intel64/bin/legacy:${PATH} mpiexec.hydra ...`

--

Best regards, Anatoliy.

Levrel__Lucas · ‎06-07-2019

Anatoliy R. (Intel) wrote:
As a workaround you can use legacy hydra process manager.
Please try to run `PATH=${I_MPI_ROOT}/intel64/bin/legacy:${PATH} mpiexec.hydra ...`

Great, it works! And I don't have to set the "topolib" anymore.

Anatoliy_R_Intel · ‎06-07-2019

Yes, ipl is the default topolib in legacy hydra.

I will create a ticket for new hydra process manager.

--

Best regards, Anatoliy.

Anatoliy_R_Intel · ‎06-11-2019

Hi, Lucas

Could you also run lscpu?

--

Best regards, Anatoliy.

Levrel__Lucas · ‎06-11-2019

Hi Anatoliy,

Here you go

> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           4
NUMA node(s):        4
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Stepping:            4
CPU MHz:             3000.000
BogoMIPS:            5400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792
NUMA node0 CPU(s):   0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,
84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,
172,176,180,184,188
NUMA node1 CPU(s):   1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,
85,89,93,97,101,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169
173,177,181,185,189
NUMA node2 CPU(s):   2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,
86,90,94,98,102,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,
174,178,182,186,190
NUMA node3 CPU(s):   3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,
87,91,95,99,103,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,
175,179,183,187,191
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb
stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap
clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1
xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln
pts pku ospke md_clear flush_l1d

(Edit : last lines looked truncated, I wrapped them.)

Levrel__Lucas · ‎06-11-2019

Also, please have a look at my recent comment in the other thread, because my other problem turns out to be linked to CPU sets as well, and I have done some detailed testing: some CPU sets work but not all!

Anatoliy_R_Intel · ‎09-16-2019

Hi Lucas,

Could you check 2019 Update 5? The issue bellow should be fixed there.

The "-n 2" run failed in the second xterm. Backtrace:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff704b801 in __GI_abort () at abort.c:79
#2 0x00007ffff7094897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff71c1b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff709b90a in malloc_printerr ( str=str@entry=0x7ffff71bfe0e "malloc(): memory corruption") at malloc.c:5350
#4 0x00007ffff709f994 in _int_malloc (av=av@entry=0x7ffff73f6c40 <main_arena>, bytes=bytes@entry=8) at malloc.c:3738
#5 0x00007ffff70a20fc in __GI___libc_malloc (bytes=8) at malloc.c:3057
#6 0x0000000000435b47 in ipl_domain_ordering (info=0x2, ord=0x7fffffffa930, lord=0) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:1505
#7 0x000000000043a02d in ipl_create_domains (pi=0x2, scale=-22224) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2233
#8 0x00000000004345e7 in ipl_one_to_many_pinning (info=0x2) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2281
#9 0x0000000000444fa4 in i_mpi_bind_init ( binding=0x2 <error: Cannot access memory at address 0x2>, bindlib=0x7fffffffa930 "", map=0x0, nrank=2) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/i_mpi_bind.c:372
#10 0x000000000040a563 in launch_processes () at ../../../../../src/pm/i_hydra/proxy/proxy.c:387
#11 0x0000000000408785 in main (argc=2, argv=0x7fffffffa930) at ../../../../../src/pm/i_hydra/proxy/proxy.c:895

Levrel__Lucas · ‎09-16-2019

Hello Anatoliy,

Thank you very much for your followup. I keep this topic bookmarked and will test as soon as possible, but it might have to wait for one or two months (the server is currently loaded for production, not the right time to run tests :-) ).

HPC-UGent · ‎03-24-2020

We're seeing segmentation faults occurring consistently when running from a multi-core Slurm job (so cgroups) using even the simplest MPI test program with Intel MPI 2019 update 6.

Environment: a 4-core cgroup on a dual-socket system with 32 cores in total:

$ taskset -p $$
pid 12400's current affinity mask: 55

$ mpirun -np 1 hostname
/software/impi/2019.6.166-iccifort-2020.0.166/intel64/bin/mpirun: line 103: 13127 Segmentation fault      mpiexec.hydra "$@" 0<&0

$ mpiexec -np 1 hostname
Segmentation fault

Here's a backtrace from GDB using a simple MPI test program (test/test.c from the Intel MPI installation):

(gdb) run -np 1 ./mpitest
Starting program: /software/impi/2019.6.166-iccifort-2020.0.166/intel64/bin/mpiexec -np 1 ./mpitest
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
ipl_detect_machine_topology () at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1625
1625    ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64
(gdb) bt
#0  ipl_detect_machine_topology () at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1625
#1  0x0000000000448c02 in ipl_processor_info (info=0x6dd6a0, pid=0x6, detect_platform_only=4) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1864
#2  0x000000000044afb2 in ipl_entrance (detect_platform_only=7198368) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_main.c:24
#3  0x000000000041e722 in i_read_default_env () at ../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec_params.h:276
#4  0x000000000041bfc9 in mpiexec_get_parameters (t_argv=0x6dd6a0) at ../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1380
#5  0x00000000004049f5 in main (argc=7198368, argv=0x6) at ../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1743

The problem disappears when setting $I_MPI_HYDRA_TOPOLIB to ipl:

$ I_MPI_HYDRA_TOPOLIB=ipl mpiexec -np 1 ./mpitest
Hello world: rank 0 of 1 running on node1234.example