- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have come across a bug in Intel MPI when testing in a docker container with no numa support. It appears that the case of no numa support is not being handled correctly. More details below
Thanks
Jamil
icc --version
icc (ICC) 17.0.6 20171215
gcc --version
gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)
uname -a
Linux centos7dev 4.9.60-linuxkit-aufs #1 SMP Mon Nov 6 16:00:12 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
bug.c
#include "mpi.h"
int main (int argc, char *argv[])
{
MPI_Init(&argc,&argv);
}
I_MPI_CC=gcc mpicc -g bug.c -o bug
gdb ./bug
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b64f45 in __I_MPI___intel_sse2_strtok () from /opt/intel/compilers_and_libraries_2017.6.256/linux/mpi/intel64/lib/libmpifort.so.12
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-16.el7_4.2.x86_64 numactl-devel-2.0.9-6.el7_2.x86_64
(gdb) bt
#0 0x00007ffff7b64f45 in __I_MPI___intel_sse2_strtok () from /opt/intel/compilers_and_libraries_2017.6.256/linux/mpi/intel64/lib/libmpifort.so.12
#1 0x00007ffff70acab1 in MPID_nem_impi_create_numa_nodes_map () at ../../src/mpid/ch3/src/mpid_init.c:1355
#2 0x00007ffff70ad994 in MPID_Init (argc=0x1, argv=0x7ffff72a2268, requested=-148233624, provided=0x1, has_args=0x0, has_env=0x0)
at ../../src/mpid/ch3/src/mpid_init.c:1733
#3 0x00007ffff7043ebb in MPIR_Init_thread (argc=0x1, argv=0x7ffff72a2268, required=-148233624, provided=0x1) at ../../src/mpi/init/initthread.c:717
#4 0x00007ffff70315bb in PMPI_Init (argc=0x1, argv=0x7ffff72a2268) at ../../src/mpi/init/init.c:253
#5 0x00000000004007e8 in main (argc=1, argv=0x7fffffffcd58) at bug.c:6
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jamil,
Can you show
$ numactl -H
?
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dimitry
In a Centos 6 Docker container
> numactl -H
available: 0 nodes ()
libnuma: Warning: Cannot parse distance information in sysfs: No such file or directory
No distance information available.
In Centos 7 the output is - numa is not supported on this system
Jamil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dimitri, Jamil,
Just ran into the same issue today. The system I compile on has no NUMA support
numactl --show physcpubind: 0 1 2 3 4 5 6 7 No NUMA support available on this system.
When running a program only containing MPI_Init, I similarly get the segfault:
[0,1] (mpigdb) run [0,1] Continuing. [0,1] [0,1] Program received signal SIGSEGV, Segmentation fault. [0] 0x00007f4876dc0805 in __I_MPI___intel_sse2_strtok () [1] 0x00007fe059bc0805 in __I_MPI___intel_sse2_strtok () [0,1] from /opt/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12 [0,1] (mpigdb) bt [0] #0 0x00007f4876dc0805 in __I_MPI___intel_sse2_strtok () [1] #0 0x00007fe059bc0805 in __I_MPI___intel_sse2_strtok () [0,1] from /opt/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12 [0] #1 0x00007f4876c7ce91 in MPID_nem_impi_create_numa_nodes_map () [1] #1 0x00007fe059a7ce91 in MPID_nem_impi_create_numa_nodes_map () [0,1] at ../../src/mpid/ch3/src/mpid_init.c:1355 [0] #2 0x00007f4876c7dd74 in MPID_Init (argc=0x1, argv=0x7f4876e2bdb4, [1] #2 0x00007fe059a7dd74 in MPID_Init (argc=0x1, argv=0x7fe059c2bdb4, [0] requested=1994571188, provided=0x1, has_args=0x0, has_env=0x2) [1] requested=1505934772, provided=0x1, has_args=0x0, has_env=0x2) [0,1] at ../../src/mpid/ch3/src/mpid_init.c:1760 [0] #3 0x00007f4876c1eaeb in MPIR_Init_thread (argc=0x1, argv=0x7f4876e2bdb4, [1] #3 0x00007fe059a1eaeb in MPIR_Init_thread (argc=0x1, argv=0x7fe059c2bdb4, [0] required=1994571188, provided=0x1) at ../../src/mpi/init/initthread.c:717 [1] required=1505934772, provided=0x1) at ../../src/mpi/init/initthread.c:717 [0] #4 0x00007f4876c0c07b in PMPI_Init (argc=0x1, argv=0x7f4876e2bdb4) [1] #4 0x00007fe059a0c07b in PMPI_Init (argc=0x1, argv=0x7fe059c2bdb4) [0,1] at ../../src/mpi/init/init.c:253
Is there any way to manually disable NUMA during compilation?
Kind regards,
Mick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any updates on this? We are seeing the same issue under WSL (used for local testing).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm also encountering this on an HPC system using a single node:
[0] Program received signal SIGSEGV, Segmentation fault.
[0] 0x00007ffff60c3636 in strtok_r () from /lib64/libc.so.6
[0] (mpigdb) bt
[0] #0 0x00007ffff60c3636 in strtok_r () from /lib64/libc.so.6
[0] #1 0x00007ffff7b58561 in __I_MPI___intel_sse2_strtok ()
[0] from /impi/2018.5.288-iccifortcuda-2019b/lib/libmpifort.so.12
[0] #2 0x00007ffff70e81f5 in MPID_nem_impi_create_numa_nodes_map ()
[0] at ../../src/mpid/ch3/src/mpid_init.c:459
[0] #3 0x00007ffff70eaf86 in MPID_Init (argc=0x0, argv=0x7ffff7299254,
[0] requested=-163687992, provided=0x0, has_args=0x41b210, has_env=0x407012)
[0] at ../../src/mpid/ch3/src/mpid_init.c:1771
[0] #4 0x00007ffff708bbc3 in MPIR_Init_thread (argc=0x0, argv=0x7ffff7299254,
[0] required=-163687992, provided=0x0) at ../../src/mpi/init/initthread.c:717
[0] #5 0x00007ffff707913b in PMPI_Init (argc=0x0, argv=0x7ffff7299254)
[0] at ../../src/mpi/init/init.c:253
[0] #6 0x00000000004012be in main ()
$ numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 104 105 106 107 108 109 110 111 112 113 114 115 116
node 0 size: 64078 MB
node 0 free: 59732 MB
node 1 cpus: 13 14 15 16 17 18 19 20 21 22 23 24 25 117 118 119 120 121 122 123 124 125 126 127 128 129
node 1 size: 64505 MB
node 1 free: 63486 MB
node 2 cpus: 26 27 28 29 30 31 32 33 34 35 36 37 38 130 131 132 133 134 135 136 137 138 139 140 141 142
node 2 size: 64505 MB
node 2 free: 61897 MB
node 3 cpus: 39 40 41 42 43 44 45 46 47 48 49 50 51 143 144 145 146 147 148 149 150 151 152 153 154 155
node 3 size: 64505 MB
node 3 free: 63400 MB
node 4 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 156 157 158 159 160 161 162 163 164 165 166 167 168
node 4 size: 64505 MB
node 4 free: 62791 MB
node 5 cpus: 65 66 67 68 69 70 71 72 73 74 75 76 77 169 170 171 172 173 174 175 176 177 178 179 180 181
node 5 size: 64505 MB
node 5 free: 63732 MB
node 6 cpus: 78 79 80 81 82 83 84 85 86 87 88 89 90 182 183 184 185 186 187 188 189 190 191 192 193 194
node 6 size: 64505 MB
node 6 free: 63578 MB
node 7 cpus: 91 92 93 94 95 96 97 98 99 100 101 102 103 195 196 197 198 199 200 201 202 203 204 205 206 207
node 7 size: 64436 MB
node 7 free: 63722 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 12 12 21 21 21 21
1: 12 10 12 12 21 21 21 21
2: 12 12 10 12 21 21 21 21
3: 12 12 12 10 21 21 21 21
4: 21 21 21 21 10 12 12 12
5: 21 21 21 21 12 10 12 12
6: 21 21 21 21 12 12 10 12
7: 21 21 21 21 12 12 12 10
Any solution here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I investigated a bit further into the call to `strtok_r` and found that it passes a NULL pointer in seemingly the 5th call which explains the segmentation fault but not why that null pointer is passed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Alexander_G
please do not post to such old threads, please create a new one instead.
Please also include the information required to reproduce the problem, including SW and hardware details.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page