Community
cancel
Showing results for 
Search instead for 
Did you mean: 
497 Views

mpirun error running in a cpuset

I've have errors using mpirun whitin a cpuset (regardles if the cset shield is activatet or not) cset set -lr cset: Name CPUs-X MEMs-X Tasks Subs Path ------------ ---------- - ------- - ----- ---- ---------- root 0-431 y 0-11 y 4956 2 / user 24-431 n 1-11 n 0 0 /user system 0-23 n 0 n 0 0 /system which mpirun /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun cset proc --move -p $$ / mpirun -np 10 ./wrf.exe #PROPERLY WORKS cset proc --move -p $$ /system mpirun -np 10 ./wrf.exe #PROPERLY WORKS cset proc --move -p $$ /user mpirun -np 10 ./wrf.exe #ERROR!!!! /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun: line 103: 343504 Segmentation fault (core dumped) mpiexec.hydra "$@" 0<&0 The error happens also in this way: cset proc --exec -s /user mpirun -- -np 10 ./wrf.exe The fact the error happens only in the /user cpuset is quite strange, isn'nt it? After all cpuset /user doesn't differ much from cpust /system wher mpirun work properly! The error happens whichever -np is, also without -np flags. Can anybody help me? thanks from Italy, Emanuele Lombardi ifort (IFORT) 19.1.0.166 20191121 Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024 (id: 082ae5608) SLES15SP1 HP Superdome Flex (ex SGI UV) topology System type: Superdome Flex System name: tiziano Serial number: CZ20040JWV 12 Blades 432 CPUs (online: 0-431) 12 Nodes 2230 GB Memory Total 1 Co-processor 2 Fibre Channel Controllers 4 Network Controllers 1 SATA Storage Controller 1 USB Controller 1 VGA GPU 2 RAID Controllers BTW I had the same error in 2013 as you can see from https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/392814#
0 Kudos
8 Replies
PrasanthD_intel
Moderator
497 Views

Hi Emanuele,

We have tried and were able to run mpirun in the cpuset successfully.

Could you provide the exact commands you have used for creating the cpusets? This will help us in replicating the scenario at our end.

Please provide us the log report after setting I_MPI_DEBUG=5.

If there is any other information you have please share.

 

Thanks

Prasanth

PrasanthD_intel
Moderator
497 Views

Hi Emanuele,

Could you provide us the log report after setting I_MPI_DEBUG=5?

This will help us in understanding the error better.

Regards

Prasanth

497 Views

Dear Prasanth

I didn't mention the system was booted with hypertrading, but to verify it wasn't this the problem I rebooted without HT.
Unfortunately nothing changes.

To simplify the problem I compile and use a very small code taken from
https://people.sc.fsu.edu/~jburkardt/f_src/hello_mpi/hello_mpi.f90
mpiifort hello_mpi.f90

The following are the commands I give to create my cpusets.
I tryed them also without --mem_exclusive and --cpu_exclusive switches but nothing changes.

cset set -lr
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root      0-431 y    0-11 y  4965    0 /

cset set -c 0-35 -m 0 --mem_exclusive --cpu_exclusive -s system
cset set -c 36-431 -m 1-11 --mem_exclusive --cpu_exclusive -s user

cset set -lr
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root      0-431 y    0-11 y  4921    2 /
         user     36-431 y    1-11 y     0    0 /user
       system       0-35 y       0 y     0    0 /system



export FI_PROVIDER=sockets
export I_MPI_DEBUG=5

cset proc --move -p $$ /system
mpirun -np 2 ./a.out
[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: sockets
P1  "Hello, world!"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       145005   tiziano   {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}
[0] MPI startup(): 1       145006   tiziano   {18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=5
26 May 2020   9:42:01.341 AM

P0  HELLO_MPI - Master process:
P0    FORTRAN90/MPI version
P0    An MPI test program.
P0    The number of MPI processes is        2
P0  "Hello, world!"

P0  HELLO_MPI - Master process:
P0    Normal end of execution: "Goodbye, world!".

P0    Elapsed wall clock time =   0.253995E-03 seconds.

P0  HELLO_MPI - Master process:
P0    Normal end of execution.

26 May 2020   9:42:01.342 AM



cset proc --move -p $$ /user
mpirun -np 2 ./a.out
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun: line 103: 145410 Segmentation fault      (core dumped) mpiexec.hydra "$@" 0<&0

which mpirun
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun

bash -x /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun -np 2 ./a.out
+ tempdir=/tmp
+ '[' -n '' ']'
+ '[' -n '' ']'
+ np_boot=
++ whoami
+ username=root
+ rc=0
++ uname -m
++ grep 1om
+ '[' -z /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi -a -z '' ']'
+ export I_MPI_MPIRUN=mpirun
+ I_MPI_MPIRUN=mpirun
+ '[' -n '' -a -z '' ']'
+ '[' -n '' -a -z '' ']'
+ '[' -z '' -a -z '' ']'
+ '[' -n '' ']'
+ '[' -n '' ']'
+ '[' -n '' ']'
+ '[' -n '' -a -n '' -a -n '' ']'
+ '[' -n '' -o -n '' ']'
+ '[' x = xyes -o x = xenable -o x = xon -o x = x1 ']'
+ '[' -n '' ']'
+ mpiexec.hydra -np 2 ./a.out
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun: line 103: 145555 Segmentation fault      (core dumped) mpiexec.hydra "$@" 0<&0
+ rc=139
+ cleanup=0
+ echo -np 2 ./a.out
+ grep '\-cleanup'
+ '[' 1 -eq 0 ']'
+ '[' -n '' ']'
+ '[' 0 -eq 1 ']'
+ exit 139


which mpiexec.hydra
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpiexec.hydra

 

PrasanthD_intel
Moderator
497 Views

Hi Emanuele,

We have tried reproducing the same at our end but haven't faced any such errors.

We are transferring this issue to the concerned team.

Thanks

Prasanth

jimdempseyatthecove
Black Belt
497 Views

In the mean time, consider writing a script that uses "cset set -lr" piped to grep to obtain the set of interest (e.g. user), or lack thereof, extract the logical CPU number range(s)/list(s), then set environment variable I_MPI_PIN_PROCESSOR_LIST to conform to the selected cset.

See: https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top...

Jim Dempsey

Michael_S
Employee
497 Views

Hi Emanuele,

You may try to use our internal system topology recognition via I_MPI_HYDRA_TOPOLIB=ipl .

Otherwise I'd recommend you to wait for IMPI 2019 update 8 which is scheduled for mid July, since we addressed several issues regarding the topology recognition within that build.

Therefore please let me know if update 8 addressed your issue, once you had a chance to give it a try.

Best regards,

Michael

497 Views

jimdempseyatthecove (Blackbelt) wrote:

In the mean time, consider writing a script that uses "cset set -lr" piped to grep to obtain the set of interest (e.g. user), or lack thereof, extract the logical CPU number range(s)/list(s), then set environment variable I_MPI_PIN_PROCESSOR_LIST to conform to the selected cset.

See: https://software.intel.com/content/www/us/en/develop/documentation/mpi-d...

Jim Dempsey

Thank you Jim, it works!
cset set -lr
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root      0-215 y    0-11 y  2688    2 /
         user     36-215 y    1-11 y     2    0 /user
       system       0-35 y       0 y     0    0 /system

cset proc --move -p $$ /user

export I_MPI_PIN_PROCESSOR_LIST=36-179
mpirun -np 2 ./a.out 
P         1  "Hello, world!"
 6 June 2020  11:10:31.607 AM

P         0  HELLO_MPI - Master process:
P         0    FORTRAN90/MPI version
P         0    An MPI test program.
P         0    The number of MPI processes is        2
P         0  "Hello, world!"

P         0  HELLO_MPI - Master process:
P         0    Normal end of execution: "Goodbye, world!".

P         0    Elapsed wall clock time =   0.202304E-03 seconds.

P         0  HELLO_MPI - Master process:
P         0    Normal end of execution.

 6 June 2020  11:10:31.608 AM

It seems that when cpus in I_MPI_PIN_PROCESSOR_LIST are more than 144  a warning is issued. (144 is a strange number for my topology)

IPL WARN> ipl_pin_list_direct syntax error, 36-180 list member should be -1, single CPU number, or CPU number range

export I_MPI_PIN_PROCESSOR_LIST=36-180
mpirun -np 2 ./a.out 
P         1  "Hello, world!"
 6 June 2020  11:12:26.181 AM

P         0  HELLO_MPI - Master process:
P         0    FORTRAN90/MPI version
P         0    An MPI test program.
P         0    The number of MPI processes is        2
P         0  "Hello, world!"

P         0  HELLO_MPI - Master process:
P         0    Normal end of execution: "Goodbye, world!".

P         0    Elapsed wall clock time =   0.244736E-03 seconds.

P         0  HELLO_MPI - Master process:
P         0    Normal end of execution.

 6 June 2020  11:12:26.181 AM
IPL WARN> ipl_pin_list_direct syntax error, 36-180 list member should be -1, single CPU number, or CPU number range

 

497 Views

Michael (Intel) wrote:

Hi Emanuele,

You may try to use our internal system topology recognition via I_MPI_HYDRA_TOPOLIB=ipl .

Otherwise I'd recommend you to wait for IMPI 2019 update 8 which is scheduled for mid July, since we addressed several issues regarding the topology recognition within that build.

Therefore please let me know if update 8 addressed your issue, once you had a chance to give it a try.

Best regards,

Michael

Thank you Michael,
export  I_MPI_HYDRA_TOPOLIB=ipl
works up to 9 cores, from 10 on it results in the errors listed below.
I'll wait for update 8 and I'll let you know.

mpirun -np 10 ./a.out 
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f0605da11d4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f0605529031]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x44c505) [0x7f060586a505]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x8ed86a) [0x7f0605d0b86a]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x64cd70) [0x7f0605a6ad70]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1fe5fa) [0x7f060561c5fa]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x4664b4) [0x7f06058844b4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPI_Init+0x11b) [0x7f060587fc7b]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi
Abort(1) on node 9: Internal error
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f39a5e391d4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f39a55c1031]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x44c505) [0x7f39a5902505]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x8ed86a) [0x7f39a5da386a]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x64cd70) [0x7f39a5b02d70]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1fe5fa) [0x7f39a56b45fa]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x4664b4) [0x7f39a591c4b4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPI_Init+0x11b) [0x7f39a5917c7b]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi
Abort(1) on node 8: Internal error

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 212868 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 212869 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 212870 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 212871 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 4 PID 212872 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 5 PID 212873 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 6 PID 212874 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 7 PID 212875 RUNNING AT tiziano
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
 

Reply