Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel multinode Run Problem

Firat_Y_
Beginner
2,717 Views

Hi There,

I have a system with 6 computenodes, /opt folder is nfs shared and intel parallel studio cluster version installed on nfs server.

I am using slurm as workload manager. When i run a vasp job on 1 node there is no problem, But when i start to run the job on 2 or more nodes i am getting the following errors;

rank = 28, revents = 29, state = 1
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0

 

I tested the ssh between computenodes with sshconnectivity.exp /nodefile

The user information is shared over ldap server which is headnode.

I couldn't find a working solution in the net. Do anyone has ever had this error?

Thanks.

 

 

0 Kudos
1 Solution
Zhongqi_Zhang
Novice
2,717 Views

Just in case some one checked into this topic in the future.

I have solved my problem. Its about the setting of file '/etc/hosts'.

In my case, there were two ip-addresses with the same machine name, eg:

192.168.14.1 debian

192.168.15.1 debian.

Deleting one ip-address which is not in use for MPI will do fix the problem.

zhongqi

View solution in original post

0 Kudos
4 Replies
Firat_Y_
Beginner
2,717 Views

Ok i think it is not directly related to intel parallel studio.

0 Kudos
Zhongqi_Zhang
Novice
2,717 Views

Hi Firat

Have you solved this problem? Unfortunately,I am now facing this problem as well

Thanks

zhongqi

0 Kudos
Firat_Y_
Beginner
2,717 Views

Hi Zhongqi,

I had fixed the problem but didn't record what i have done. All i can say is, it was related to infiniband configuration. 

Have A Good Day.

0 Kudos
Zhongqi_Zhang
Novice
2,718 Views

Just in case some one checked into this topic in the future.

I have solved my problem. Its about the setting of file '/etc/hosts'.

In my case, there were two ip-addresses with the same machine name, eg:

192.168.14.1 debian

192.168.15.1 debian.

Deleting one ip-address which is not in use for MPI will do fix the problem.

zhongqi

0 Kudos
Reply