- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good afternoon,
There is a poorly documented bug in the Intel compiler when Coarray Fortran is used over Infiniband, which requires the following fix: export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0.
The bug and the fix are only mentioned in 2 or 3 places online (e.g. here https://blog.hpc.qmul.ac.uk/intel-release-2020_4.html). When can we expect this bug to be fixed?
More importantly, what does this compiler option actually do? What are the speed implications? Are the alternative bug fixes which are faster?
Intel officially supports MPI over Infiniband, so this should be fixed for Coarrays too. See here: https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in the Intel forums.
We are working on this issue internally and will get back to you soon.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shivani,
Thanks for the reply. I look forward to your response and will keep an eye on this thread - this is an important issue.
Thanks again!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for your patience.
We are working on it internally and would let you know once the fix is released.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your patience.
As we are in the middle of diagnosing the issue could you please let us know the version of the Intel compiler you are using?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As we did not hear back from you could you please respond to my previous post?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear @ShivaniK_Intel,
Thank you for getting back to me about this - I appreciate your help!
I am working on a cluster and using:
intel-oneapi-compilers/2022.0.2-gcc-11.2.0-yzi4tsu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please try with the latest Intel oneAPI version 2023.1 and MPI version 2021.9 and let us know if you face any issues?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @ShivaniK_Intel,
Sorry for the delay - I needed to get my cluster admins to make this version of the compiler available. I have tested with these versions, and the result is worse in that I cannot get coarrays to run at all. The bug fix command (export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0) can't seem to avoid the "[n3501-040:3970562:0:3970562] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x31)" error. This is extremely worrying (and such a shame, because the Intel implementation of coarrays is excellent - it is a shame the Infiniband implementation is so buggy). Can you offer any suggestions?
I have attached the log files from the runs, both with and without the bugfix command. You will see that with the command, the images initialise and display their information is sequence (so the "sync all" command is working). However, when they come to transfer information the error appears.
I look forward to hearing how to fix this - as I say Intel CAF is excellent when it runs.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing the details.
As you are still able to reproduce the issue with the latest Intel oneAPI version, could you please provide us with the sample reproducer and the steps you have followed to reproduce the issue?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you also please provide us with your cluster details to investigate more on it?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear @ShivaniK_Intel,
Thanks for your help with this. Please find attached the relevant files. You will find information on the cluster in the "cluster.info" file.
The rest of the files show a simplified example which demonstrates the error.
You simply need to type on SLURM:
$ ./cluster.compile
$./cluster.submit
Just make sure you update the qos and partition in the cluster.slrm file.
You will see my SLURM outputs both when and when I don't use the bugfix command (which adds a significant time overhead).
If you need any more information please let me know! I am very much looking forward to getting this issue resolved.
Many thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ShivaniK_Intel I seem to be making some headway. After loading my intel modules and checking the libfabric version ("fi_info --version"), I see that it is using 1.13. I have found that the code works as expected (without the strange compiler setting) when I manually override this with a newer version of libfabric using:
export FI_PROVIDER_PATH=...
export LD_LIBRARY_PATH=...:$LD_LIBRARY_PATH
It therefore seems that this coarray bug relates older libfabric versions. The Intel-MPI compiler 2021.9 includes the libfabric version which was out at the time of the compiler release. This needs to be updated.
I get the programme to run without the need for disabling RMA. I would still appreciate you explaining what exactly the compiler setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 does.
Thanks again for a really good implementation of Coarray Fortran - the speed is comparable to MPI and the syntax is obviously much simpler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ShivaniK_Intel sorry - I've been doing more testing and actually, while this method removes the bug it doesn't show any improved performance. This therefore means in newer libfabric versions this setting is implicitly set? This doesn't seem right to me... I think we're still looking at a CAF bug with Infiniband... I look forward to hearing how you're getting on in figuring this out... Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your patience and providing the information.
Could you please let us know whether you are facing a similar issue with the Intel processor?
We can only offer direct support for Intel hardware platforms that the Intel® oneAPI product supports. Intel provides instructions on how to compile oneAPI code for both CPU and a wide range of GPU accelerators.
https://intel.github.io/llvm-docs/GetStartedGuide.html
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>>"I would still appreciate you explaining what exactly the compiler setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 does. "
Does not switch off RMA, but switches internal default transport-specific implementation.
The next release of Intel MPI (2021.10) would work well for CAF.
Please let us know if you further need any information.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As we did not hear back from you could you please respond to my previous post?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will be no longer monitored by Intel. If you further need any assistance please post a new question.
Thanks & Regards
Shivani
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page