Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
72 Views

Intel MPI & LSF compatibility

Jump to solution

 

Hello, Everyone.

 

There is an issue in which calculation results are different if a specific node is included during forecast model execution based MPI communication.
I think it's probably the error associated with mpirun's hostfile option.
Because, if the hostfile allocated by the LSF job scheduler is executed as it is, no problem occurs. However, if the hostfile is sorted and used with the -f option, the calculation result is different randomly.

I think this is because the head allocated by the LSF and the head set in the sorted hostfile are different.
I am wondering if the following two variables can be specified when running mpirun.


1) hydra_bstrap_proxy --upstream-host duru0333
2) pmi_proxy --control-port duru0252:44774

 

I'd like to test if this fits well and the problem does not occur.

 

Thank you in advance.
Kihang

0 Kudos

Accepted Solutions
Highlighted
Moderator
58 Views

Hi Kihang,


Does this issue occur if the specific node is not included in the queue?

How are you sorting the node list? and why?

Can you set I_MPI_HYDRA_DEBUG=on and see if --upstream-host is different in both the cases?

(1. Using default nodelist in $PBS_NODEFILE

2. The sorted hostfile you have generated)


In our case, node is same as in PBS_NODEFILE despite the change of order in hostfile.

Can you provide us with any reproducer code?


Regarding setting the upstream host manually we will discuss with our internal team and get back to you.


Regards

Prasanth


View solution in original post

0 Kudos
3 Replies
Highlighted
Moderator
59 Views

Hi Kihang,


Does this issue occur if the specific node is not included in the queue?

How are you sorting the node list? and why?

Can you set I_MPI_HYDRA_DEBUG=on and see if --upstream-host is different in both the cases?

(1. Using default nodelist in $PBS_NODEFILE

2. The sorted hostfile you have generated)


In our case, node is same as in PBS_NODEFILE despite the change of order in hostfile.

Can you provide us with any reproducer code?


Regarding setting the upstream host manually we will discuss with our internal team and get back to you.


Regards

Prasanth


View solution in original post

0 Kudos
Highlighted
Beginner
48 Views

 

Hello Prasanth,

The error I mentioned in the previous post was a mistake in identifying the cause.

The error did not occur according to the order of the host.

It happened even when running on one node, and the cause is also different, so I will close it here and open another post.


Thank you.

Kihang

0 Kudos
Highlighted
Moderator
21 Views

Hi Kihang,


Since you have raised a new thread for your problem and as you have suggested, we are closing this thread.

Any further interaction in this thread will be considered community only


Regards

Prasanth


0 Kudos