- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've installed the cluster toolkit on a cluster of RedHat EL 4.5 quad-core Opteron machines connected with Infiniband and gigE. The sock and rdma devices work fine, but shm and ssm don't. Should they?
$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE shm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)...: Initialization failed
MPIDD_Init(98)..........: channel initialization failed
MPIDI_CH3_Init(203).....:
MPIDI_CH3_SHM_Init(1704): unable to open shared memory object /Intel_MPI_26805_1189641421 (errno 13)
rank 0 in job 1 l01_43514 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE ssm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)..................: Initialization failed
MPIDD_Init(98).........................: channel initialization failed
MPIDI_CH3_Init(295)....................:
MPIDI_CH3U_Init_sshm(233)..............: unable to create a bootstrap message queue
MPIDI_CH3I_BootstrapQ_create_named(329): failed to create a shared memory message queue
MPIDI_CH3I_mqshm_create(96)............: Out of memory
MPIDI_CH3I_SHM_Get_mem_named(619)......: unable to open shared memory object /mpich2q4F06DE4F7724E8FD36DE0FD2497E2D4B (errno 13)
rank 0 in job 1 l01_43550 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE shm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)...: Initialization failed
MPIDD_Init(98)..........: channel initialization failed
MPIDI_CH3_Init(203).....:
MPIDI_CH3_SHM_Init(1704): unable to open shared memory object /Intel_MPI_26805_1189641421 (errno 13)
rank 0 in job 1 l01_43514 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
$ mpirun -n 2 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE ssm ./test_c
WARNING: Can't read mpd.hosts for list of hosts, start only on current
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283)..................: Initialization failed
MPIDD_Init(98).........................: channel initialization failed
MPIDI_CH3_Init(295)....................:
MPIDI_CH3U_Init_sshm(233)..............: unable to create a bootstrap message queue
MPIDI_CH3I_BootstrapQ_create_named(329): failed to create a shared memory message queue
MPIDI_CH3I_mqshm_create(96)............: Out of memory
MPIDI_CH3I_SHM_Get_mem_named(619)......: unable to open shared memory object /mpich2q4F06DE4F7724E8FD36DE0FD2497E2D4B (errno 13)
rank 0 in job 1 l01_43550 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi jon,
Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.
For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm
And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0
When you're using rdma, you're only running over the DAPL interface and you never hit the shared memory issue. Same with sock and running over the TCP/IP stack.Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.
For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm
And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0
I hope this helps.
Regards,
~Gergana
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi jon,
Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.
For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm
And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0
When you're using rdma, you're only running over the DAPL interface and you never hit the shared memory issue. Same with sock and running over the TCP/IP stack.Yes, they should. The most likely cause of this is that the /dev/shm device is not mounted on your cluster. Can you verify that? The shared memory device (/dev/shm) is how the Intel MPI Library communicates when using the shm, ssm, or rdssm devices.
For example:
[user@cluster ~]$ ls /dev | grep shm
shm
[user@cluster ~]$ df -k | grep shm
tmpfs 8208492 0 8208492 0% /dev/shm
And, of course, since it's mounted as a tmpfs file system, you need to make sure that it's listed correctly in your fstab file:
[user@cluster ~]$ cat /etc/fstab | grep shm
tmpfs /dev/shm tmpfs defaults 0 0
I hope this helps.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gergana, that indeed was the problem. Thanks for the thorough response.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page