Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

Intel MPI & LSF compatibility

youn__kihang
Novice
1,380 Views

 

Hello, Everyone.

 

There is an issue in which calculation results are different if a specific node is included during forecast model execution based MPI communication.
I think it's probably the error associated with mpirun's hostfile option.
Because, if the hostfile allocated by the LSF job scheduler is executed as it is, no problem occurs. However, if the hostfile is sorted and used with the -f option, the calculation result is different randomly.

I think this is because the head allocated by the LSF and the head set in the sorted hostfile are different.
I am wondering if the following two variables can be specified when running mpirun.


1) hydra_bstrap_proxy --upstream-host duru0333
2) pmi_proxy --control-port duru0252:44774

 

I'd like to test if this fits well and the problem does not occur.

 

Thank you in advance.
Kihang

0 Kudos
1 Solution
PrasanthD_intel
Moderator
1,366 Views

Hi Kihang,


Does this issue occur if the specific node is not included in the queue?

How are you sorting the node list? and why?

Can you set I_MPI_HYDRA_DEBUG=on and see if --upstream-host is different in both the cases?

(1. Using default nodelist in $PBS_NODEFILE

2. The sorted hostfile you have generated)


In our case, node is same as in PBS_NODEFILE despite the change of order in hostfile.

Can you provide us with any reproducer code?


Regarding setting the upstream host manually we will discuss with our internal team and get back to you.


Regards

Prasanth


View solution in original post

0 Kudos
3 Replies
PrasanthD_intel
Moderator
1,367 Views

Hi Kihang,


Does this issue occur if the specific node is not included in the queue?

How are you sorting the node list? and why?

Can you set I_MPI_HYDRA_DEBUG=on and see if --upstream-host is different in both the cases?

(1. Using default nodelist in $PBS_NODEFILE

2. The sorted hostfile you have generated)


In our case, node is same as in PBS_NODEFILE despite the change of order in hostfile.

Can you provide us with any reproducer code?


Regarding setting the upstream host manually we will discuss with our internal team and get back to you.


Regards

Prasanth


0 Kudos
youn__kihang
Novice
1,356 Views

 

Hello Prasanth,

The error I mentioned in the previous post was a mistake in identifying the cause.

The error did not occur according to the order of the host.

It happened even when running on one node, and the cause is also different, so I will close it here and open another post.


Thank you.

Kihang

0 Kudos
PrasanthD_intel
Moderator
1,329 Views

Hi Kihang,


Since you have raised a new thread for your problem and as you have suggested, we are closing this thread.

Any further interaction in this thread will be considered community only


Regards

Prasanth


0 Kudos
Reply