[solved] random problems with MPI + DAPL initialization in RedHat 5.4

Rafał_Błaszczyk · ‎03-09-2010

Hi I havesometimes problems with execution of a program with Intel MPI

It happens with an error on stderr (or stdout):

problem with execution of   on  wn20:  [Errno 13] Permission denied

What could be a problem?

here is my ulimit -a:

core file size (blocks, -c) 0

data seg size (kbytes, -d) unlimited

scheduling priority (-e) 0

file size (blocks, -f) unlimited

pending signals (-i) 135167

max locked memory (kbytes, -l) unlimited

max memory size (kbytes, -m) unlimited

open files (-n) 1024

pipe size (512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority (-r) 0

stack size (kbytes, -s) unlimited

cpu time (seconds, -t) unlimited

max user processes (-u) 135167

virtual memory (kbytes, -v) unlimited

file locks (-x) unlimited

I've checked logs on this node (wn20):

Mar 8 16:11:52 wn20 mpd: mpd starting; no mpdid yet

Mar 8 16:11:52 wn20 mpd: mpd has mpdid=wn20_45723 (port=45723)

Mar 8 16:11:53 wn20 mpd: wn20_45723 (run 1485): Warning: the directory pointed by TMPDIR (/tmp/pbs.2045.mgmt1) does not exist! /tmp will be used.

Mar 8 16:11:53 wn20 mpd: wn20_45723 (__init__ 1045): Warning: the directory pointed by TMPDIR (/tmp/pbs.2045.mgmt1) does not exist! /tmp will be used.

Mar 8 16:11:53 wn20 sshd[11867]: pam_unix(sshd:session): session closed for user routnwp

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_120

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_121

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_122

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_123

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_124

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_125

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_126

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_127

Mar 8 16:12:07 wn20 mpd: mpd ending mpdid=wn20_45723 (inside cleanup)

Mar 8 16:11:52 wn20 mpd: mpd starting; no mpdid yet

Mar 8 16:11:52 wn20 mpd: mpd has mpdid=wn20_45723 (port=45723)

Mar 8 16:11:53 wn20 mpd: wn20_45723 (run 1485): Warning: the directory pointed by TMPDIR (/tmp/pbs.2045.mgmt1) does not exist! /tmp will be used.

Mar 8 16:11:53 wn20 mpd: wn20_45723 (__init__ 1045): Warning: the directory pointed by TMPDIR (/tmp/pbs.2045.mgmt1) does not exist! /tmp will be used.

Mar 8 16:11:53 wn20 sshd[11867]: pam_unix(sshd:session): session closed for user routnwpMar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_120

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_121

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_122

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_123

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_124

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_125

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_126

Mar 8 16:11:57 wn20 mpdman: mpdman starting new log; wn20_mpdman_127

Mar 8 16:12:07 wn20 mpd: mpd ending mpdid=wn20_45723 (inside cleanup)

Dmitry_K_Intel2 · ‎03-09-2010

Hi,

Could you provide command line and output in verbose mode if possible.

Regards!

Dmitry

Dmitry_K_Intel2 · ‎03-09-2010

>Presumably meaning with environment variable I_MPI_DEBUG=9

It seems to me that the issue is related to mpdboot (or mpirun) so this is '--verbose' option for this command.

Regards!

Dmitry

Rafał_Błaszczyk · ‎03-17-2010

Hi Dmitry,

thanks for tip. Unfortunately I cannot reproduce the problem.

I've raised I_MPI_DEBUG to 5 as said in documentation, will 9 give more verbosity?

The problem is we've got now (with I_MPI_DEBUG=5):

[56] MPI startup(): DAPL provider OpenIB-cma specified in DAPL configuration file /etc/dat.conf

[cli_56]: got unexpected response to get :cmd=get kvsname=kvs_wn3_49596_0_0 key=DAPL_MISMATCH

:

[cli_56]: got unexpected response to put :cmd=put kvsname=kvs_wn3_49596_0_0 key=P56-businesscard value=rdma_port#21114$rdma_host#2:0:0:192:168:20:10:0:0:0:0:0:0:0:0$

:

[cli_56]: aborting job:

Fatal error in MPI_Init: Other MPI error, error stack:

MPIR_Init_thread(283)...: Initialization failed

MPIDD_Init(98)..........: channel initialization failed

MPIDI_CH3_Init(261).....:

MPIDI_CH3U_Init_rdma(64): PMI_KVS_Put returned -1

[56] MPI startup(): DAPL provider OpenIB-cma specified in DAPL configuration file /etc/dat.conf[cli_56]: got unexpected response to get :cmd=get kvsname=kvs_wn3_49596_0_0 key=DAPL_MISMATCH:[cli_56]: got unexpected response to put :cmd=put kvsname=kvs_wn3_49596_0_0 key=P56-businesscard value=rdma_port#21114$rdma_host#2:0:0:192:168:20:10:0:0:0:0:0:0:0:0$:[cli_56]: aborting job:Fatal error in MPI_Init: Other MPI error, error stack:MPIR_Init_thread(283)...: Initialization failedMPIDD_Init(98)..........: channel initialization failedMPIDI_CH3_Init(261).....:MPIDI_CH3U_Init_rdma(64): PMI_KVS_Put returned -1

what could be a possible problem?

here is also the output of env from one of mpi processess (I'm running bash script in mpirun to debug it more closely at MPI process level)

I_MPI_INFO_LCPU=16

I_MPI_INFO_SIGN=67237

VT_MPI=impi3

I_MPI_INFO_PACK=1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0

I_MPI_PIN_MAP=56 1,57 5,58 3,59 7,60 0,61 4,62 2,63 6

I_MPI_PIN_INFO=6

I_MPI_INFO_CACHE_SHARE=2,2,16

I_MPI_PIN_UNIT=6

I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1

I_MPI_INFO_CACHES=3

I_MPI_INFO_CORE=0,0,2,2,1,1,3,3,0,0,2,2,1,1,3,3

I_MPI_DEVICE=rdma

I_MPI_RDMA_EAGER_THRESHOLD=25972

I_MPI_INFO_CACHE_SIZE=32768,262144,8388608

I_MPI_DEBUG=5

I_MPI_INFO_CACHE1=8,0,10,2,9,1,11,3,8,0,10,2,9,1,11,3

I_MPI_PIN_MAP_SIZE=8

I_MPI_INFO_CACHE2=8,0,10,2,9,1,11,3,8,0,10,2,9,1,11,3

I_MPI_INFO_CACHE3=1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0

I_MPI_PERHOST=allcores

MPICH_INTERFACE_HOSTNAME=192.168.0.10

I_MPI_ROOT=/opt/intel/impi/3.2.1.009

One more thing, we run it through batch scheduler. After running task I saw that mpd process exist - could it be somewhat connected with the problem:

python /opt/intel/impi/3.2.1.009/bin64/mpd.py -h wn9 -p 34585 --ifhn=192.168.0.13 --ncpus=1 --myhost=wn13 --myip=192.168.0.13 -e -d -s 5

Dmitry_K_Intel2 · ‎03-17-2010

This problem is most likely related to configuration of OFED or IP addresses for IPoIB.

Again, I don't see your command line - it might be useful in some cases.
What is your DAPL version (run 'ofed_info' command)?
Could you provide /etc/dat.conf?
What interconnect cards do you use?

The higher number for I_MPI_DEBUG you set the more information you get.

Please try to run you application with I_MPI_DEVICE set to 'sock'.

Regards!
Dmitry

Rafał_Błaszczyk · ‎03-17-2010

>This problem is most likely related to configuration of OFED or IP addresses for IPoIB.

I'll check that, thanks. The problem is that it happens randomly only in particular jobs and the configuration is static...

My command line is

mpirun -r ssh -env I_MPI_DEBUG 5 -env I_MPI_DEVICE rdssm -np 196 /full/path/bin/cm_w_00.0.0.2.sh

where cm_w_00.0.0.2.sh contains

/full/path/bin/cm > $logbin 2>&1

and other few commands redirected logfiles (like $logbin) with names unique to mpiprocess to gather debugging data like output of ps, ulimit etc. but there is nothing interesting in those logs

I'm using stock RHEL5.4 OFED

dapl is dapl-2.0.19-2.el5 from repos, I do not have ofed_info command

/etc/dat.conf is/etc/ofed/dat.conf in RHEL:

ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""

ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""

ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""

ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""

ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""

ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""

ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""

ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""

ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""

ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""

cat /etc/ofed/dat.confofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""

I think about trying sock device but this will rather avoid problem which happens randomly - do you think that's a good idea?

I'm using Mellanox ConnectX:

ibv_devinfo

hca_id: mlx4_0

fw_ver: 2.6.100

node_guid: 0023:7dff:ff94:4518

sys_image_guid: 0023:7dff:ff94:451b

vendor_id: 0x02c9

vendor_part_id: 26428

hw_ver: 0xA0

board_id: HP_0120000009

phys_port_cnt: 2

port: 1

state: active (4)

max_mtu: 2048 (4)

active_mtu: 2048 (4)

sm_lid: 6

port_lid: 4

port_lmc: 0x00

port: 2

state: active (4)

max_mtu: 2048 (4)

active_mtu: 2048 (4)

sm_lid: 6

port_lid: 5

port_lmc: 0x00

Dmitry_K_Intel2 · ‎03-17-2010

Your first output refers to OpenIB-cma provider but there is no such provider in the dat.conf you sent me.
So you probably need to use DAT_OVERRIDE=/etc/ofed/dat.conf variable to point to the correct dat.conf file.
Could you also change DEVICE env variable to:
-env I_MPI_DEVICE rdssm:ofa-v2-mlx4_0-1
In this case mlx4_0 will be used explicitly

>The problem is that it happens randomly only in particular jobs
This is very strange. Might be something wrong with cluster configuration or unstable work of some nodes.
Could you also add:
-env I_MPI_FALLBACK_DEVICE off
to your command line.

Let me know the result.

Regards!
Dmitry

Rafał_Błaszczyk · ‎03-19-2010

I've tried your suggestion, it gave me:

[0] DAPL provider is not found and fallback device is not enabled

[cli_0]: aborting job:

Fatal error in MPI_Init: Other MPI error, error stack:

MPIR_Init_thread(283): Initialization failed

MPIDD_Init(98).......: channel initialization failed

MPIDI_CH3_Init(163)..: generic failure with errno = -1

(unknown)():

[0] MPI startup(): Intel MPI Library, Version 3.2.1 Build 20090312

rank 0 in job 1 wn1_33304 caused collective abort of all ranks

exit status of rank 0: return code 13

I've checked what is dapl library default dat.conf:

wn3 ~]$ dapltest

Dapltest: Service Point Ready - ofa-v2-ib0

wn3 ~]$ dapltestDapltest: Service Point Ready - ofa-v2-ib0

I've tried to use the same name with Intel mpirun with the same result -

[0] DAPL provider is not found and fallback device is not enabled

The weird thing is when running with just I_MPI_DEVICE=rdssm:

[0] MPI startup(): DAPL provider OpenIB-cma specified in DAPL configuration file /etc/dat.conf

[0] MPI startup(): RDMA, shared memory, and socket data transfer modes

[0] MPI startup(): Intel MPI Library, Version 3.2.1 Build 20090312

so it's trying to use OpenIB-cma, but it's not definied anywhere, weird thing is - it's working but not always...

So - Intel MPI is not using dat.conf which dat itself is using?

I'll also try to link dat.conf from /etc/ofed to /etc

Rafał_Błaszczyk · ‎03-19-2010

> I'll also try to link dat.conf from /etc/ofed to /etc

nope, linking won't work :/

Rafał_Błaszczyk · ‎03-19-2010

I realized that IntelMPI in RHEL 5.4 is not using dapl - it's using compat-dapl which has different dat.conf (don't ask me why):

# cat /etc/ofed/compat-dapl/dat.conf

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""

OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""

OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""

OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""

OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""

OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""

OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.2 dapl.1.2 "ipath0 1" ""

OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.2 dapl.1.2 "ipath0 2" ""

OpenIB-ehca0-2 u1.2 nonthreadsafe default libdaplscm.so.2 dapl.1.2 "ehca0 1" ""

OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""

I've usedOpenIB-mlx4_0-1 in I_MPI_DEVICE and it runs ok for now - I'm waiting if this error will appear again.

Do you think this is what you wanted me to do?

Andres_M_Intel4 · ‎03-19-2010

Perhaps the following link may be helpful to understand about DAPL providers.

I would recommend not having invalid DAPL entries onyour dat.conf file, you may want to only offer to your cluster users those which are fully functional.

http://software.intel.com/en-us/articles/intel-mpi-library-for-linux-experience-with-various-interconnects-and-dapl-providers/

Rafał_Błaszczyk · ‎03-19-2010

Hi, thanks. I've already read that.

The problem was the entries were not completely bad - I was using just wrong names (from other dat.conf) but what you wanted is to use fixed name for DAPL provider, right?

Could that help in solving this issue?:

[94] MPI startup(): DAPL provider OpenIB-cma specified in DAPL configuration file

[cli_94]: got unexpected response to get :cmd=get kvsname=kvs_wn3_49596_0_0 key=DAPL_MISMATCH

:

[cli_94]: got unexpected response to put :cmd=put kvsname=kvs_wn3_49596_0_0 key=P94-businesscard

value=rdma_port#18839$rdma_host#2:0:0:192:168:20:14:0:0:0:0:0:0:0:0$

[cli_94]: got unexpected response to get :cmd=get kvsname=kvs_wn3_49596_0_0 key=DAPL_MISMATCH:[cli_94]: got unexpected response to put :cmd=put kvsname=kvs_wn3_49596_0_0 key=P94-businesscardvalue=rdma_port#18839$rdma_host#2:0:0:192:168:20:14:0:0:0:0:0:0:0:0$

It's an Intel MPI message, could you explain to me what does it mean? I cannot find any docs about it.

It looks like DAPL provider has been chosen (it was the same when it was running fine).

Rafał_Błaszczyk · ‎03-19-2010

after changing I_MPI_DEVICE toOpenIB-mlx4_0-1, I've got

from wn3 from RANK0 process:

[0] MPI startup(): DAPL provider OpenIB-mlx4_0-1

[cli_0]: got unexpected response to get :cmd=get kvsname=kvs_wn3_37604_0_0 key=DAPL_MISMATCH

:

[cli_0]: got unexpected response to put :cmd=put kvsname=kvs_wn3_37604_0_0 key=shm_name value=2D1921C52957AD9B5645EBCD4BA371D0

:

[0] MPI startup(): Intel MPI Library, Version 3.2.1 Build 20090312

[cli_0]: aborting job:

Fatal error in MPI_Init: Other MPI error, error stack:

MPIR_Init_thread(283)....: Initialization failed

MPIDD_Init(98)...........: channel initialization failed

MPIDI_CH3_Init(319)......:

MPIDI_CH3U_Init_sshm(239): PMI_KVS_Put returned -1

(unknown)():

from other processes:

[14] MPI startup(): DAPL provider OpenIB-mlx4_0-1

[cli_14]: got unexpected response to get :cmd=get kvsname=kvs_wn3_37604_0_0 key=DAPL_MISMATCH

:

[cli_14]: PMIU_parse_keyvals: unexpected key delimiter at character 1 in !

[cli_14]: expecting cmd=barrier_out, got !

[cli_14]: aborting job:

Fatal error in MPI_Init: Other MPI error, error stack:

MPIR_Init_thread(283)....: Initialization failed

MPIDD_Init(98)...........: channel initialization failed

MPIDI_CH3_Init(319)......:

MPIDI_CH3U_Init_sshm(257): PMI_Barrier returned -1

(unknown)():

Rafał_Błaszczyk · ‎04-06-2010

I've found a solution to my problem.

I believe it was the same problem as described here:http://software.intel.com/en-us/articles/random-fabric-errors-on-rhel5U4/

(workaround I_MPI_RDMA_CREATE_CONN_QUAL = 0seemed to work too)

After upgrading to OFED 1.5 with new DAPL the problem was finally solved.

DAPL version from RedHat 5.4 seems buggy.

BTW: If anyone knows why RedHat decided to have two separate dat.conf files for each dapl version (1 and 2) please give me a note.

I havesuccessfullyused new UCM interface (v2) with ConnectX (ofa-v2-mlx4_0-1u in dat.conf) which seems to be much much faster with many-core jobs than the old CMA provider.

I believe that when sticking to RH provided OFED it's good to have one common dat.conf (DAT_OVERRIDE) with providers from DAPL1 and DAPL2.

Dmitry_K_Intel2 · ‎04-07-2010

Rafal, thanks for sharing this information.