Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2162 Discussions

Compiler bug for Coarray Fortran over InfiniBand

as14
Beginner
2,829 Views

Good afternoon,

There is a poorly documented bug in the Intel compiler when Coarray Fortran is used over Infiniband, which requires the following fix: export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0. 

The bug and the fix are only mentioned in 2 or 3 places online (e.g. here https://blog.hpc.qmul.ac.uk/intel-release-2020_4.html). When can we expect this bug to be fixed?

More importantly, what does this compiler option actually do? What are the speed implications? Are the alternative bug fixes which are faster?

Intel officially supports MPI over Infiniband, so this should be fixed for Coarrays too. See here: https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

and here: https://www.intel.com/content/www/us/en/developer/articles/technical/mpi-compatibility-nvidia-mellanox-ofed-infiniband.html

Thanks!

Labels (3)
0 Kudos
18 Replies
ShivaniK_Intel
Moderator
2,785 Views

Hi,


Thanks for posting in the Intel forums.


We are working on this issue internally and will get back to you soon.


Thanks & Regards

Shivani


0 Kudos
as14
Beginner
2,695 Views

Dear @ShivaniK_Intel ,

 

Thanks for looking into this issue. Have you made any progress?

0 Kudos
as14
Beginner
2,757 Views

Hi Shivani,

 

Thanks for the reply. I look forward to your response and will keep an eye on this thread - this is an important issue.

 

Thanks again!

0 Kudos
ShivaniK_Intel
Moderator
2,684 Views

Hi,

 

Thank you for your patience.

 

We are working on it internally and would let you know once the fix is released.

 

Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
2,649 Views

Hi,

 

Thanks for your patience.

 

As we are in the middle of diagnosing the issue could you please let us know the version of the Intel compiler you are using?

 

Thanks & Regards

Shivani

0 Kudos
ShivaniK_Intel
Moderator
2,631 Views

Hi,


As we did not hear back from you could you please respond to my previous post?


Thanks & Regards

Shivani


0 Kudos
as14
Beginner
2,564 Views

Dear @ShivaniK_Intel,

 

Thank you for getting back to me about this - I appreciate your help! 

 

I am working on a cluster and using:


intel-oneapi-compilers/2022.0.2-gcc-11.2.0-yzi4tsu

intel-oneapi-mpi/2021.4.0-gcc-11.2.0-2e7zm7z
 
 
Thanks again!
0 Kudos
ShivaniK_Intel
Moderator
2,553 Views

Hi,


Could you please try with the latest Intel oneAPI version 2023.1 and MPI version 2021.9 and let us know if you face any issues?


Thanks & Regards

Shivani


0 Kudos
as14
Beginner
2,488 Views

Hi @ShivaniK_Intel,

Sorry for the delay - I needed to get my cluster admins to make this version of the compiler available. I have tested with these versions, and the result is worse in that I cannot get coarrays to run at all. The bug fix command (export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0) can't seem to avoid the "[n3501-040:3970562:0:3970562] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x31)" error. This is extremely worrying (and such a shame, because the Intel implementation of coarrays is excellent - it is a shame the Infiniband implementation is so buggy). Can you offer any suggestions?

I have attached the log files from the runs, both with and without the bugfix command. You will see that with the command, the images initialise and display their information is sequence (so the "sync all" command is working). However, when they come to transfer information the error appears.

I look forward to hearing how to fix this - as I say Intel CAF is excellent when it runs.

Thanks.

0 Kudos
ShivaniK_Intel
Moderator
2,440 Views

Hi,

 

Thanks for providing the details.

 

As you are still able to reproduce the issue with the latest Intel oneAPI version, could you please provide us with the sample reproducer and the steps you have followed to reproduce the issue?

 

Thanks & Regards

Shivani

 

0 Kudos
ShivaniK_Intel
Moderator
2,398 Views

Hi,


Could you also please provide us with your cluster details to investigate more on it?


Thanks & Regards

Shivani


0 Kudos
as14
Beginner
2,364 Views

Dear @ShivaniK_Intel,

 

Thanks for your help with this. Please find attached the relevant files. You will find information on the cluster in the "cluster.info" file. 

 

The rest of the files show a simplified example which demonstrates the error.

 

You simply need to type on SLURM:
$ ./cluster.compile

$./cluster.submit

Just make sure you update the qos and partition in the cluster.slrm file.

You will see my SLURM outputs both when and when I don't use the bugfix command (which adds a significant time overhead).

 

If you need any more information please let me know! I am very much looking forward to getting this issue resolved.

 

Many thanks!

0 Kudos
as14
Beginner
2,307 Views

@ShivaniK_Intel I seem to be making some headway. After loading my intel modules and checking the libfabric version ("fi_info --version"), I see that it is using 1.13. I have found that the code works as expected (without the strange compiler setting) when I manually override this with a newer version of libfabric using:

export FI_PROVIDER_PATH=...

export LD_LIBRARY_PATH=...:$LD_LIBRARY_PATH

It therefore seems that this coarray bug relates older libfabric versions. The Intel-MPI compiler 2021.9 includes the libfabric version which was out at the time of the compiler release. This needs to be updated.

I get the programme to run without the need for disabling RMA. I would still appreciate you explaining what exactly the compiler setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 does. 

Thanks again for a really good implementation of Coarray Fortran - the speed is comparable to MPI and the syntax is obviously much simpler.

0 Kudos
as14
Beginner
2,290 Views

@ShivaniK_Intel sorry - I've been doing more testing and actually, while this method removes the bug it doesn't show any improved performance. This therefore means in newer libfabric versions this setting is implicitly set? This doesn't seem right to me... I think we're still looking at a CAF bug with Infiniband... I look forward to hearing how you're getting on in figuring this out... Thanks!

0 Kudos
ShivaniK_Intel
Moderator
2,269 Views

Hi,

 

Thanks for your patience and providing the information.

 

Could you please let us know whether you are facing a similar issue with the Intel processor?


We can only offer direct support for Intel hardware platforms that the Intel® oneAPI product supports. Intel provides instructions on how to compile oneAPI code for both CPU and a wide range of GPU accelerators.

 

https://intel.github.io/llvm-docs/GetStartedGuide.html

 

Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
2,123 Views

Hi,

 

>>"I would still appreciate you explaining what exactly the compiler setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 does. "

 

Does not switch off RMA, but switches internal default transport-specific implementation.

 

The next release of Intel MPI (2021.10) would work well for CAF.

 

Please let us know if you further need any information.

 

Thanks & Regards

Shivani

 

0 Kudos
ShivaniK_Intel
Moderator
1,982 Views

Hi,


As we did not hear back from you could you please respond to my previous post?


Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
1,920 Views

Hi,


We have not heard back from you. This thread will be no longer monitored by Intel. If you further need any assistance please post a new question.


Thanks & Regards

Shivani


0 Kudos
Reply