Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
2275 Discussions

shm and ssm don't work, but rdma does

jon
Beginner
2,198 Views
I've installed the cluster toolkit on a cluster of RedHat EL 4.5 quad-core Opteron machines connected with Infiniband and gigE. The sock and rdma devices work fine, but shm and ssm don't. Should they?

$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE shm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)...: Initialization failed
MPIDD_Init(98)..........: channel initialization failed
MPIDI_CH3_Init(203).....:
MPIDI_CH3_SHM_Init(1704): unable to open shared memory object /Intel_MPI_26805_1189641421 (errno 13)
rank 0 in job 1 l01_43514 caused collective abort of all ranks
exit status of rank 0: killed by signal 9

$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE ssm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)..................: Initialization failed
MPIDD_Init(98).........................: channel initialization failed
MPIDI_CH3_Init(295)....................:
MPIDI_CH3U_Init_sshm(233)..............: unable to create a bootstrap message queue
MPIDI_CH3I_BootstrapQ_create_named(329): failed to create a shared memory message queue
MPIDI_CH3I_mqshm_create(96)............: Out of memory
MPIDI_CH3I_SHM_Get_mem_named(619)......: unable to open shared memory object /mpich2q4F06DE4F7724E8FD36DE0FD2497E2D4B (errno 13)
rank 0 in job 1 l01_43550 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
0 Kudos
1 Solution
Gergana_S_Intel
Employee
2,198 Views
Hi jon,

Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.

For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm

And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0

When you're using rdma, you're only running over the DAPL interface and you never hit the shared memory issue. Same with sock and running over the TCP/IP stack.

I hope this helps.

Regards,
~Gergana

View solution in original post

0 Kudos
2 Replies
Gergana_S_Intel
Employee
2,199 Views
Hi jon,

Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.

For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm

And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0

When you're using rdma, you're only running over the DAPL interface and you never hit the shared memory issue. Same with sock and running over the TCP/IP stack.

I hope this helps.

Regards,
~Gergana
0 Kudos
jon
Beginner
2,198 Views
Gergana, that indeed was the problem. Thanks for the thorough response.
0 Kudos
Reply