- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Tags:
- Cluster Computing
- General Support
- Intel® Cluster Ready
- Message Passing Interface (MPI)
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Emanuele,
We have tried and were able to run mpirun in the cpuset successfully.
Could you provide the exact commands you have used for creating the cpusets? This will help us in replicating the scenario at our end.
Please provide us the log report after setting I_MPI_DEBUG=5.
If there is any other information you have please share.
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Emanuele,
Could you provide us the log report after setting I_MPI_DEBUG=5?
This will help us in understanding the error better.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Prasanth
I didn't mention the system was booted with hypertrading, but to verify it wasn't this the problem I rebooted without HT.
Unfortunately nothing changes.
To simplify the problem I compile and use a very small code taken from
https://people.sc.fsu.edu/~jburkardt/f_src/hello_mpi/hello_mpi.f90
mpiifort hello_mpi.f90
The following are the commands I give to create my cpusets.
I tryed them also without --mem_exclusive and --cpu_exclusive switches but nothing changes.
cset set -lr
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-431 y 0-11 y 4965 0 /
cset set -c 0-35 -m 0 --mem_exclusive --cpu_exclusive -s system
cset set -c 36-431 -m 1-11 --mem_exclusive --cpu_exclusive -s user
cset set -lr
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-431 y 0-11 y 4921 2 /
user 36-431 y 1-11 y 0 0 /user
system 0-35 y 0 y 0 0 /system
export FI_PROVIDER=sockets
export I_MPI_DEBUG=5
cset proc --move -p $$ /system
mpirun -np 2 ./a.out
[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: sockets
P1 "Hello, world!"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 145005 tiziano {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}
[0] MPI startup(): 1 145006 tiziano {18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=5
26 May 2020 9:42:01.341 AM
P0 HELLO_MPI - Master process:
P0 FORTRAN90/MPI version
P0 An MPI test program.
P0 The number of MPI processes is 2
P0 "Hello, world!"
P0 HELLO_MPI - Master process:
P0 Normal end of execution: "Goodbye, world!".
P0 Elapsed wall clock time = 0.253995E-03 seconds.
P0 HELLO_MPI - Master process:
P0 Normal end of execution.
26 May 2020 9:42:01.342 AM
cset proc --move -p $$ /user
mpirun -np 2 ./a.out
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun: line 103: 145410 Segmentation fault (core dumped) mpiexec.hydra "$@" 0<&0
which mpirun
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun
bash -x /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun -np 2 ./a.out
+ tempdir=/tmp
+ '[' -n '' ']'
+ '[' -n '' ']'
+ np_boot=
++ whoami
+ username=root
+ rc=0
++ uname -m
++ grep 1om
+ '[' -z /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi -a -z '' ']'
+ export I_MPI_MPIRUN=mpirun
+ I_MPI_MPIRUN=mpirun
+ '[' -n '' -a -z '' ']'
+ '[' -n '' -a -z '' ']'
+ '[' -z '' -a -z '' ']'
+ '[' -n '' ']'
+ '[' -n '' ']'
+ '[' -n '' ']'
+ '[' -n '' -a -n '' -a -n '' ']'
+ '[' -n '' -o -n '' ']'
+ '[' x = xyes -o x = xenable -o x = xon -o x = x1 ']'
+ '[' -n '' ']'
+ mpiexec.hydra -np 2 ./a.out
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpirun: line 103: 145555 Segmentation fault (core dumped) mpiexec.hydra "$@" 0<&0
+ rc=139
+ cleanup=0
+ echo -np 2 ./a.out
+ grep '\-cleanup'
+ '[' 1 -eq 0 ']'
+ '[' -n '' ']'
+ '[' 0 -eq 1 ']'
+ exit 139
which mpiexec.hydra
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpiexec.hydra
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Emanuele,
We have tried reproducing the same at our end but haven't faced any such errors.
We are transferring this issue to the concerned team.
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the mean time, consider writing a script that uses "cset set -lr" piped to grep to obtain the set of interest (e.g. user), or lack thereof, extract the logical CPU number range(s)/list(s), then set environment variable I_MPI_PIN_PROCESSOR_LIST to conform to the selected cset.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Emanuele,
You may try to use our internal system topology recognition via I_MPI_HYDRA_TOPOLIB=ipl .
Otherwise I'd recommend you to wait for IMPI 2019 update 8 which is scheduled for mid July, since we addressed several issues regarding the topology recognition within that build.
Therefore please let me know if update 8 addressed your issue, once you had a chance to give it a try.
Best regards,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove (Blackbelt) wrote:In the mean time, consider writing a script that uses "cset set -lr" piped to grep to obtain the set of interest (e.g. user), or lack thereof, extract the logical CPU number range(s)/list(s), then set environment variable I_MPI_PIN_PROCESSOR_LIST to conform to the selected cset.
See: https://software.intel.com/content/www/us/en/develop/documentation/mpi-d...
Jim Dempsey
Thank you Jim, it works!
cset set -lr
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-215 y 0-11 y 2688 2 /
user 36-215 y 1-11 y 2 0 /user
system 0-35 y 0 y 0 0 /system
cset proc --move -p $$ /user
export I_MPI_PIN_PROCESSOR_LIST=36-179
mpirun -np 2 ./a.out
P 1 "Hello, world!"
6 June 2020 11:10:31.607 AM
P 0 HELLO_MPI - Master process:
P 0 FORTRAN90/MPI version
P 0 An MPI test program.
P 0 The number of MPI processes is 2
P 0 "Hello, world!"
P 0 HELLO_MPI - Master process:
P 0 Normal end of execution: "Goodbye, world!".
P 0 Elapsed wall clock time = 0.202304E-03 seconds.
P 0 HELLO_MPI - Master process:
P 0 Normal end of execution.
6 June 2020 11:10:31.608 AM
It seems that when cpus in I_MPI_PIN_PROCESSOR_LIST are more than 144 a warning is issued. (144 is a strange number for my topology)
IPL WARN> ipl_pin_list_direct syntax error, 36-180 list member should be -1, single CPU number, or CPU number range
export I_MPI_PIN_PROCESSOR_LIST=36-180
mpirun -np 2 ./a.out
P 1 "Hello, world!"
6 June 2020 11:12:26.181 AM
P 0 HELLO_MPI - Master process:
P 0 FORTRAN90/MPI version
P 0 An MPI test program.
P 0 The number of MPI processes is 2
P 0 "Hello, world!"
P 0 HELLO_MPI - Master process:
P 0 Normal end of execution: "Goodbye, world!".
P 0 Elapsed wall clock time = 0.244736E-03 seconds.
P 0 HELLO_MPI - Master process:
P 0 Normal end of execution.
6 June 2020 11:12:26.181 AM
IPL WARN> ipl_pin_list_direct syntax error, 36-180 list member should be -1, single CPU number, or CPU number range
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Michael (Intel) wrote:Hi Emanuele,
You may try to use our internal system topology recognition via I_MPI_HYDRA_TOPOLIB=ipl .
Otherwise I'd recommend you to wait for IMPI 2019 update 8 which is scheduled for mid July, since we addressed several issues regarding the topology recognition within that build.
Therefore please let me know if update 8 addressed your issue, once you had a chance to give it a try.
Best regards,
Michael
Thank you Michael,
export I_MPI_HYDRA_TOPOLIB=ipl
works up to 9 cores, from 10 on it results in the errors listed below.
I'll wait for update 8 and I'll let you know.
mpirun -np 10 ./a.out
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
Assertion failed in file ../../src/util/intel/shm_heap/impi_shm_heap.c at line 917: group_id < group_num
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f0605da11d4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f0605529031]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x44c505) [0x7f060586a505]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x8ed86a) [0x7f0605d0b86a]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x64cd70) [0x7f0605a6ad70]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1fe5fa) [0x7f060561c5fa]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x4664b4) [0x7f06058844b4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPI_Init+0x11b) [0x7f060587fc7b]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi
Abort(1) on node 9: Internal error
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f39a5e391d4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f39a55c1031]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x44c505) [0x7f39a5902505]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x8ed86a) [0x7f39a5da386a]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x64cd70) [0x7f39a5b02d70]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1fe5fa) [0x7f39a56b45fa]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x4664b4) [0x7f39a591c4b4]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPI_Init+0x11b) [0x7f39a5917c7b]
/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi
Abort(1) on node 8: Internal error
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 212868 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 212869 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 212870 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 212871 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 212872 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 212873 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 212874 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 212875 RUNNING AT tiziano
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page