- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, Everyone.
There is an issue in which calculation results are different if a specific node is included during forecast model execution based MPI communication.
I think it's probably the error associated with mpirun's hostfile option.
Because, if the hostfile allocated by the LSF job scheduler is executed as it is, no problem occurs. However, if the hostfile is sorted and used with the -f option, the calculation result is different randomly.
I think this is because the head allocated by the LSF and the head set in the sorted hostfile are different.
I am wondering if the following two variables can be specified when running mpirun.
1) hydra_bstrap_proxy --upstream-host duru0333
2) pmi_proxy --control-port duru0252:44774
I'd like to test if this fits well and the problem does not occur.
Thank you in advance.
Kihang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kihang,
Does this issue occur if the specific node is not included in the queue?
How are you sorting the node list? and why?
Can you set I_MPI_HYDRA_DEBUG=on and see if --upstream-host is different in both the cases?
(1. Using default nodelist in $PBS_NODEFILE
2. The sorted hostfile you have generated)
In our case, node is same as in PBS_NODEFILE despite the change of order in hostfile.
Can you provide us with any reproducer code?
Regarding setting the upstream host manually we will discuss with our internal team and get back to you.
Regards
Prasanth
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kihang,
Does this issue occur if the specific node is not included in the queue?
How are you sorting the node list? and why?
Can you set I_MPI_HYDRA_DEBUG=on and see if --upstream-host is different in both the cases?
(1. Using default nodelist in $PBS_NODEFILE
2. The sorted hostfile you have generated)
In our case, node is same as in PBS_NODEFILE despite the change of order in hostfile.
Can you provide us with any reproducer code?
Regarding setting the upstream host manually we will discuss with our internal team and get back to you.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Prasanth,
The error I mentioned in the previous post was a mistake in identifying the cause.
The error did not occur according to the order of the host.
It happened even when running on one node, and the cause is also different, so I will close it here and open another post.
Thank you.
Kihang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kihang,
Since you have raised a new thread for your problem and as you have suggested, we are closing this thread.
Any further interaction in this thread will be considered community only
Regards
Prasanth
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page