Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Canvas314
Beginner
272 Views

Intel MPI 2018 - cannot create /tmp in MPI_init_thread() for communications in shm systems

Jump to solution

Hi forum users,

I have an internal MPI-application, "a.out". It failed to be launched via MPI, when two MPI processes on the same machine were created. The machine runs CentOS 7.4.

intel/mpirun -np 2 -machinefile machineFile ./a.out

The error message points to the creation of some file under the "/tmp" directory, for communications between the processes. Intel MPI ran well if the processes were launched on two separate machines.

I suspected the culprit could be some variables in the application, which corrupted the global namespace. I wonder if Intel MPI provides any debug mechanism to troubleshoot this problem. Any suggestions on the potential cause of this error are also welcome.

Thank you for the attention.

Canvas314

[mpiexec@xxx] [pgid: 0] got PMI command: cmd=abort exitcode=70826255
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(126)...................: fail failed
MPID_nem_init_ckpt(857)...............: fail failed
MPIDI_CH3I_Seg_commit(355)............: fail failed
MPIU_SHMW_Seg_create_and_attach(953)..: fail failed
MPIU_SHMW_Seg_create_attach_templ(620): lseek failed - Illegal seek

0 Kudos

Accepted Solutions
PrasanthD_intel
Moderator
108 Views

Hi Kevin,


Glad you have found the problem.

As your problem had resolved let us know if we can close this thread.


Regards

Prasanth


View solution in original post

8 Replies
PrasanthD_intel
Moderator
217 Views

Hi Kevin,

 

If you are launching all ranks on same node no need to use machinefile.

You can check for the correctness of the code using ITAC.

source <itac_installdir>/bin/itacvars.sh

mpirun -np< > -check_mpi ./<executable>

 

Also, you can enable debug info using I_MPI_DEBUG environment variable.

Please send us the logs using both these variables that will help us in finding the cause.

command : I_MPI_DEBUG=10 mpirun -np< > -check_mpi ./<executable>

 

Please update your IMPI to the latest version as 2018 is no longer supported. You can find supported versions here (Intel® Parallel Studio XE & Intel® oneAPI Toolkits...)

 

Regards

Prasanth

Canvas314
Beginner
203 Views

Hi @PrasanthD_intel 

Only 2018 is currently available for me. Also, setting I_MPI_DEBUG and I_MPI_DEBUG_OUTPUT produced

[0] MPI startup(): Intel(R) MPI Library, Version 2018 Update 3 Build 20180411 (id: 18329)
[0] MPI startup(): Copyright (C) 2003-2018 Intel Corporation. All rights reserved.
[0] MPI startup(): Multi-threaded optimized library

 

I had access to a 2019 TraceAnalyzer installation and "source" itacvars.sh. The redacted output is shown below.

  Hydra internal environment:
  ---------------------------
    MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1
    GFORTRAN_UNBUFFERED_PRECONNECTED=y
    I_MPI_HYDRA_UUID=......
    DAPL_NETWORK_PROCESS_NUM=2

  User set environment:
  ---------------------
    I_MPI_FABRICS=shm:tcp
    I_MPI_FALLBACK_DEVICE=disable
    I_MPI_PIN=disable
    I_MPI_ADJUST_REDUCE=2
    I_MPI_ADJUST_ALLREDUCE=2
    I_MPI_ADJUST_BCAST=1
    I_MPI_PLATFORM=auto
    I_MPI_DAPL_SCALABLE_PROGRESS=1
    LD_LIBRARY_PATH= ......

  Intel(R) MPI Library specific variables:
  ----------------------------------------
    I_MPI_MPIRUN=mpirun
    I_MPI_HYDRA_DEBUG=on
    I_MPI_DEBUG_OUTPUT=intel_debug.log
    I_MPI_DEBUG=10
    I_MPI_ROOT= ......
    I_MPI_HYDRA_UUID=......
    I_MPI_FABRICS=shm:tcp
    I_MPI_FALLBACK_DEVICE=disable
    I_MPI_PIN=disable
    I_MPI_ADJUST_REDUCE=2
    I_MPI_ADJUST_ALLREDUCE=2
    I_MPI_ADJUST_BCAST=1
    I_MPI_PLATFORM=auto
    I_MPI_DAPL_SCALABLE_PROGRESS=1


    Proxy information:
    *********************
      [1] proxy: xxxxxx.com (48 cores)
      Exec list: a.out (2 processes);


==================================================================================================

[mpiexec@xxxxxx.com] Timeout set to -1 (-1 means infinite)
[mpiexec@xxxxxx.com] Got a control port string of xxxxxx.com:41373

Proxy launch args: /intel/bin/pmi_proxy --control-port xxxxxx.com:41373 --debug --pmi-connect alltoall --pmi-aggregate --preload 'libVTmc.so' -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 444945054 --usize -2 --proxy-id

Arguments being passed to proxy 0:

...

[mpiexec@xxxxxx.com] Launch arguments: /intel/bin/pmi_proxy --control-port xxxxxx.com:41373 --debug --pmi-connect alltoall --pmi-aggregate --preload 'libVTmc.so' -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 444945054 --usize -2 --proxy-id 0
[proxy:0:0@xxxxxx.com] Start PMI_proxy 0
[proxy:0:0@xxxxxx.com] STDIN will be redirected to 1 fd(s): 17
ERROR: ld.so: object 'libVTmc.so' from LD_PRELOAD cannot be preloaded: ignored.
... ...
ERROR: ld.so: object 'libVTmc.so' from LD_PRELOAD cannot be preloaded: ignored.
OS information:
Trying lsb_release:
OS information:
Trying lsb_release:
lsb_release os: CentOS
lsb_release version: 7.7.1908
lsb_release os: CentOS
lsb_release version: 7.7.1908
ERROR: ld.so: object 'libVTmc.so' from LD_PRELOAD cannot be preloaded: ignored.
... ...
ERROR: ld.so: object 'libVTmc.so' from LD_PRELOAD cannot be preloaded: ignored.
[proxy:0:0@xxxxxx.com] got pmi command (from 12): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@xxxxxx.com] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=5ec95d0
[proxy:0:0@xxxxxx.com] got pmi command (from 12): get_maxes

[proxy:0:0@xxxxxx.com] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@xxxxxx.com] got pmi command (from 12): barrier_in

[proxy:0:0@xxxxxx.com] got pmi command (from 14): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@xxxxxx.com] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=60275d0
[proxy:0:0@xxxxxx.com] got pmi command (from 14): get_maxes

[proxy:0:0@xxxxxx.com] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@xxxxxx.com] got pmi command (from 14): barrier_in

[proxy:0:0@xxxxxx.com] forwarding command (cmd=barrier_in) upstream
[mpiexec@xxxxxx.com] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@xxxxxx.com] PMI response to fd 8 pid 14: cmd=barrier_out
[proxy:0:0@xxxxxx.com] PMI response: cmd=barrier_out
[proxy:0:0@xxxxxx.com] PMI response: cmd=barrier_out
[proxy:0:0@xxxxxx.com] got pmi command (from 14): get_ranks2hosts

[proxy:0:0@xxxxxx.com] PMI response: put_ranks2hosts 33 1
22 xxxxxx.com 0,1,
[proxy:0:0@xxxxxx.com] got pmi command (from 12): get_ranks2hosts

[proxy:0:0@xxxxxx.com] PMI response: put_ranks2hosts 33 1
22 xxxxxx.com 0,1,
[proxy:0:0@xxxxxx.com] got pmi command (from 14): get_appnum

[proxy:0:0@xxxxxx.com] PMI response: cmd=appnum appnum=0
[proxy:0:0@xxxxxx.com] got pmi command (from 12): get_appnum

[proxy:0:0@xxxxxx.com] PMI response: cmd=appnum appnum=0
[proxy:0:0@xxxxxx.com] got pmi command (from 14): get_my_kvsname

[proxy:0:0@xxxxxx.com] PMI response: cmd=my_kvsname kvsname=kvs_64680_0
[proxy:0:0@xxxxxx.com] got pmi command (from 12): get_my_kvsname

[proxy:0:0@xxxxxx.com] PMI response: cmd=my_kvsname kvsname=kvs_64680_0
[proxy:0:0@xxxxxx.com] got pmi command (from 14): get_my_kvsname

[proxy:0:0@xxxxxx.com] PMI response: cmd=my_kvsname kvsname=kvs_64680_0
[proxy:0:0@xxxxxx.com] got pmi command (from 12): get_my_kvsname

[proxy:0:0@xxxxxx.com] PMI response: cmd=my_kvsname kvsname=kvs_64680_0
[proxy:0:0@xxxxxx.com] got pmi command (from 14): barrier_in

[proxy:0:0@xxxxxx.com] got pmi command (from 12): abort
exitcode=70826255
[proxy:0:0@xxxxxx.com] we don't understand this command abort; forwarding upstream
[mpiexec@xxxxxx.com] [pgid: 0] got PMI command: cmd=abort exitcode=70826255
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(126)...................: fail failed
MPID_nem_init_ckpt(857)...............: fail failed
MPIDI_CH3I_Seg_commit(355)............: fail failed
MPIU_SHMW_Seg_create_and_attach(953)..: fail failed
MPIU_SHMW_Seg_create_attach_templ(620): lseek failed - Illegal seek

 

Thank you.

Canvas314

Canvas314
Beginner
194 Views

What baffles, is that, only global variables (declared via "extern") are initialized before main() in our code. They are stored in the memory space of each process. Inside main(), MPI_thread_init() is literally the first function call. I cannot think of any reason to related these global variables with the creation of the shared-memory space for IPC. Neither could I think of any user's code, which could affect the creation of the shared-memory space. Thank you.

PrasanthD_intel
Moderator
174 Views

Hi Kevin,


Is it possible for you to share the code with us? That would help us in debugging the cause of error.

Also please share the command line you were using.


Regards

Prasanth


Canvas314
Beginner
150 Views

Turned out to be that a function defined in our library, has the same name as a Linux system call. This led to multiple function definitions in the executable. We were able to run two MPI processes on the same machine after renaming the function.

Thank you for the discussion.

PrasanthD_intel
Moderator
109 Views

Hi Kevin,


Glad you have found the problem.

As your problem had resolved let us know if we can close this thread.


Regards

Prasanth


View solution in original post

Canvas314
Beginner
95 Views

Yes, please.

PrasanthD_intel
Moderator
84 Views

Hi Kevin,


Thanks for the confirmation.

As your issue has been resolved, we are closing this thread. If you require additional assistance from Intel, please start a new thread.


Regards

Prasanth