- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I try to use rr (record and replay) in a larger MPI project and have issues where I have some challenging bugs in a parallel execution.
The project uses Intel MPI and I have used rr successfully before - when I run only one MPI process. This time I want to record each node individually and replay only some of them.
I already opened a bug report at the rr project and ad MPICH and I can now use rr successfully with OpenMPI and MPICH, but not with Intel MPI.
I have this simple MPI test program:
#include <mpi.h>
#include <iostream>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
std::cout << "Hello from rank " << rank << " of " << size << std::endl;
MPI_Finalize();
return 0;
}
That I compiled with:
$ mpicxx -O0 -g -std=c++17 hello.cpp -o mpi_hello
To disable SHM I tried:
$ export I_MPI_FABRICS=ofi
$ export FI_PROVIDER=sockets
$ export I_MPI_OFI_PROVIDER=sockets
$ export FI_SOCKET_IFACE=eth0
$ export FI_SOCKETS_IFACE=eth0
$ export I_MPI_SHM=offAs documented here. Since here also FI_SOCKET_IFACE is mentioned (not _FI_SOCKETSIFACE) i simply set both. And according to the document here setting I_MPI_SHM to off should be enough to disable SHM completely.
Replaying node 0 with rr like:
$ mpirun \
-np 2 \
rr ./mpi_hellogets stuck somewhere in MPI_Init(). If I attach gdb ultimatively before the hang a backtrace reveals:
#0 0x00007fd0b609318e in ?? () from /usr/lib/libc.so.6
#1 0x00007fd0b60931b4 in ?? () from /usr/lib/libc.so.6
#2 0x00007fd0b610da2e in read () from /usr/lib/libc.so.6
[GdbServerConnection] raw request m7ffeeaa663d0,8
[GdbServerConnection] debugger requests memory (addr=0x7ffeeaa663d0, len=8)
[GdbServerConnection] write_flush: '$3400000000000000#07'
#3 0x00007fd0b6f755c0 in PMIU_readline (fd=9, buf=buf@entry=0x7ffeeaa663d0 "4", maxlen=maxlen@entry=4096) at ../../src/pmi/simple/simple_pmiutil.c:142
[GdbServerConnection] raw request m7fd0b747a58e,8
[GdbServerConnection] debugger requests memory (addr=0x7fd0b747a58e, len=8)
[GdbServerConnection] write_flush: '$636d643d62617272#a5'
[GdbServerConnection] raw request m7fd0b747a596,8
[GdbServerConnection] debugger requests memory (addr=0x7fd0b747a596, len=8)
[GdbServerConnection] write_flush: '$6965725f696e0a00#d9'
[GdbServerConnection] raw request m7fd0b747a59e,8
[GdbServerConnection] debugger requests memory (addr=0x7fd0b747a59e, len=8)
[GdbServerConnection] write_flush: '$626172726965725f#7f'
[GdbServerConnection] raw request m7fd0b747a5a6,8
[GdbServerConnection] debugger requests memory (addr=0x7fd0b747a5a6, len=8)
[GdbServerConnection] write_flush: '$6f757400636d643d#d7'
#4 0x00007fd0b6f73cce in GetResponse (request=0x7fd0b747a58e "cmd=barrier_in\n", expectedCmd=0x7fd0b747a59e "barrier_out", checkRc=0) at ../../src/pmi/simple/simple_pmi.c:1095
#5 0x00007fd0b6e78e94 in MPIR_pmi_barrier () at ../../src/util/mpir_pmi.c:517
#6 0x00007fd0b6e7b343 in optional_bcast_barrier (domain=<optimized out>) at ../../src/util/mpir_pmi.c:1307
#7 MPIR_pmi_bcast (buf=0x7f200022c300, bufsize=bufsize@entry=25, domain=domain@entry=MPIR_PMI_DOMAIN_LOCAL) at ../../src/util/mpir_pmi.c:1373
#8 0x00007fd0b6e13ec9 in MPIDU_Init_shm_init () at ../../src/mpid/common/shm/mpidu_init_shm.c:179
#9 0x00007fd0b6b6ff0d in MPID_Init (requested=<optimized out>, provided=provided@entry=0x7fd0c00fe198 <MPIR_ThreadInfo>) at ../../src/mpid/ch4/src/ch4_init.c:1712
#10 0x00007fd0b6e75641 in MPII_Init_thread (argc=argc@entry=0x7ffeeaa67b4c, argv=argv@entry=0x7ffeeaa67b40, user_required=user_required@entry=0, provided=provided@entry=0x7ffeeaa67af4,
p_session_ptr=p_session_ptr@entry=0x0) at ../../src/mpi/init/intel/initthread.h:117
#11 0x00007fd0b6e754ae in MPIR_Init_impl (argc=0x7ffeeaa67b4c, argv=0x7ffeeaa67b40) at ../../src/mpi/init/mpir_init.c:143
#12 0x00007fd0b69d29b0 in internal_Init (argc=0x9, argv=0x7fd0c08498f0) at ../../src/binding/c/c_binding.c:43336
#13 PMPI_Init (argc=0x9, argv=0x7fd0c08498f0) at ../../src/binding/c/c_binding.c:43362
#14 0x000055aff54e920c in main (argc=1, argv=0x7ffeeaa67c88) at hello.cpp:13I faced similar issues with MPICH (see again my bug report there) and could fix the problem by using OFI with sockets provider and the MPICH environment variable
$ export MPIR_CVAR_NOLOCAL=1I also skipped through Intels available environment variables:
$ impi_info -a -eand tried really a lot of different settings - but nothing works, MPI_Init() always seems to use SHM which breaks my rr replay session.
Can anybody here assist and help me with disabling SHM properly?
Thanks a lot and many greetings
Mathias
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page