Solved: shm and ssm don't work, but rdma does

jon · ‎01-26-2009

I've installed the cluster toolkit on a cluster of RedHat EL 4.5 quad-core Opteron machines connected with Infiniband and gigE. The sock and rdma devices work fine, but shm and ssm don't. Should they?

$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE shm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)...: Initialization failed
MPIDD_Init(98)..........: channel initialization failed
MPIDI_CH3_Init(203).....:
MPIDI_CH3_SHM_Init(1704): unable to open shared memory object /Intel_MPI_26805_1189641421 (errno 13)
rank 0 in job 1 l01_43514 caused collective abort of all ranks
exit status of rank 0: killed by signal 9

$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE ssm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)..................: Initialization failed
MPIDD_Init(98).........................: channel initialization failed
MPIDI_CH3_Init(295)....................:
MPIDI_CH3U_Init_sshm(233)..............: unable to create a bootstrap message queue
MPIDI_CH3I_BootstrapQ_create_named(329): failed to create a shared memory message queue
MPIDI_CH3I_mqshm_create(96)............: Out of memory
MPIDI_CH3I_SHM_Get_mem_named(619)......: unable to open shared memory object /mpich2q4F06DE4F7724E8FD36DE0FD2497E2D4B (errno 13)
rank 0 in job 1 l01_43550 caused collective abort of all ranks
exit status of rank 0: killed by signal 9

Gergana_S_Intel · ‎01-26-2009

Hi jon,

Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.

For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm

And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0

When you're using rdma, you're only running over the DAPL interface and you never hit the shared memory issue. Same with sock and running over the TCP/IP stack.

I hope this helps.

Regards,
~Gergana

View solution in original post

Gergana_S_Intel · ‎01-26-2009

Hi jon,

Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.

For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm

And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0

When you're using rdma, you're only running over the DAPL interface and you never hit the shared memory issue. Same with sock and running over the TCP/IP stack.

I hope this helps.

Regards,
~Gergana

jon · ‎01-27-2009

Gergana, that indeed was the problem. Thanks for the thorough response.