- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi There,
I have a system with 6 computenodes, /opt folder is nfs shared and intel parallel studio cluster version installed on nfs server.
I am using slurm as workload manager. When i run a vasp job on 1 node there is no problem, But when i start to run the job on 2 or more nodes i am getting the following errors;
rank = 28, revents = 29, state = 1
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
I tested the ssh between computenodes with sshconnectivity.exp /nodefile
The user information is shared over ldap server which is headnode.
I couldn't find a working solution in the net. Do anyone has ever had this error?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just in case some one checked into this topic in the future.
I have solved my problem. Its about the setting of file '/etc/hosts'.
In my case, there were two ip-addresses with the same machine name, eg:
192.168.14.1 debian
192.168.15.1 debian.
Deleting one ip-address which is not in use for MPI will do fix the problem.
zhongqi
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok i think it is not directly related to intel parallel studio.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Firat
Have you solved this problem? Unfortunately,I am now facing this problem as well
Thanks
zhongqi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhongqi,
I had fixed the problem but didn't record what i have done. All i can say is, it was related to infiniband configuration.
Have A Good Day.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just in case some one checked into this topic in the future.
I have solved my problem. Its about the setting of file '/etc/hosts'.
In my case, there were two ip-addresses with the same machine name, eg:
192.168.14.1 debian
192.168.15.1 debian.
Deleting one ip-address which is not in use for MPI will do fix the problem.
zhongqi
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page