Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2286 Discussões

Compiler bug for Coarray Fortran over InfiniBand

as14
Principiante
7.043 Visualizações

Good afternoon,

There is a poorly documented bug in the Intel compiler when Coarray Fortran is used over Infiniband, which requires the following fix: export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0. 

The bug and the fix are only mentioned in 2 or 3 places online (e.g. here https://blog.hpc.qmul.ac.uk/intel-release-2020_4.html). When can we expect this bug to be fixed?

More importantly, what does this compiler option actually do? What are the speed implications? Are the alternative bug fixes which are faster?

Intel officially supports MPI over Infiniband, so this should be fixed for Coarrays too. See here: https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

and here: https://www.intel.com/content/www/us/en/developer/articles/technical/mpi-compatibility-nvidia-mellanox-ofed-infiniband.html

Thanks!

Etiquetas (3)
0 Kudos
18 Respostas
ShivaniK_Intel
Moderador
6.999 Visualizações

Hi,


Thanks for posting in the Intel forums.


We are working on this issue internally and will get back to you soon.


Thanks & Regards

Shivani


as14
Principiante
6.909 Visualizações

Dear @ShivaniK_Intel ,

 

Thanks for looking into this issue. Have you made any progress?

as14
Principiante
6.971 Visualizações

Hi Shivani,

 

Thanks for the reply. I look forward to your response and will keep an eye on this thread - this is an important issue.

 

Thanks again!

ShivaniK_Intel
Moderador
6.898 Visualizações

Hi,

 

Thank you for your patience.

 

We are working on it internally and would let you know once the fix is released.

 

Thanks & Regards

Shivani


ShivaniK_Intel
Moderador
6.863 Visualizações

Hi,

 

Thanks for your patience.

 

As we are in the middle of diagnosing the issue could you please let us know the version of the Intel compiler you are using?

 

Thanks & Regards

Shivani

ShivaniK_Intel
Moderador
6.845 Visualizações

Hi,


As we did not hear back from you could you please respond to my previous post?


Thanks & Regards

Shivani


as14
Principiante
6.778 Visualizações

Dear @ShivaniK_Intel,

 

Thank you for getting back to me about this - I appreciate your help! 

 

I am working on a cluster and using:


intel-oneapi-compilers/2022.0.2-gcc-11.2.0-yzi4tsu

intel-oneapi-mpi/2021.4.0-gcc-11.2.0-2e7zm7z
 
 
Thanks again!
ShivaniK_Intel
Moderador
6.767 Visualizações

Hi,


Could you please try with the latest Intel oneAPI version 2023.1 and MPI version 2021.9 and let us know if you face any issues?


Thanks & Regards

Shivani


as14
Principiante
6.702 Visualizações

Hi @ShivaniK_Intel,

Sorry for the delay - I needed to get my cluster admins to make this version of the compiler available. I have tested with these versions, and the result is worse in that I cannot get coarrays to run at all. The bug fix command (export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0) can't seem to avoid the "[n3501-040:3970562:0:3970562] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x31)" error. This is extremely worrying (and such a shame, because the Intel implementation of coarrays is excellent - it is a shame the Infiniband implementation is so buggy). Can you offer any suggestions?

I have attached the log files from the runs, both with and without the bugfix command. You will see that with the command, the images initialise and display their information is sequence (so the "sync all" command is working). However, when they come to transfer information the error appears.

I look forward to hearing how to fix this - as I say Intel CAF is excellent when it runs.

Thanks.

ShivaniK_Intel
Moderador
6.654 Visualizações

Hi,

 

Thanks for providing the details.

 

As you are still able to reproduce the issue with the latest Intel oneAPI version, could you please provide us with the sample reproducer and the steps you have followed to reproduce the issue?

 

Thanks & Regards

Shivani

 

ShivaniK_Intel
Moderador
6.612 Visualizações

Hi,


Could you also please provide us with your cluster details to investigate more on it?


Thanks & Regards

Shivani


as14
Principiante
6.578 Visualizações

Dear @ShivaniK_Intel,

 

Thanks for your help with this. Please find attached the relevant files. You will find information on the cluster in the "cluster.info" file. 

 

The rest of the files show a simplified example which demonstrates the error.

 

You simply need to type on SLURM:
$ ./cluster.compile

$./cluster.submit

Just make sure you update the qos and partition in the cluster.slrm file.

You will see my SLURM outputs both when and when I don't use the bugfix command (which adds a significant time overhead).

 

If you need any more information please let me know! I am very much looking forward to getting this issue resolved.

 

Many thanks!

as14
Principiante
6.521 Visualizações

@ShivaniK_Intel I seem to be making some headway. After loading my intel modules and checking the libfabric version ("fi_info --version"), I see that it is using 1.13. I have found that the code works as expected (without the strange compiler setting) when I manually override this with a newer version of libfabric using:

export FI_PROVIDER_PATH=...

export LD_LIBRARY_PATH=...:$LD_LIBRARY_PATH

It therefore seems that this coarray bug relates older libfabric versions. The Intel-MPI compiler 2021.9 includes the libfabric version which was out at the time of the compiler release. This needs to be updated.

I get the programme to run without the need for disabling RMA. I would still appreciate you explaining what exactly the compiler setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 does. 

Thanks again for a really good implementation of Coarray Fortran - the speed is comparable to MPI and the syntax is obviously much simpler.

as14
Principiante
6.504 Visualizações

@ShivaniK_Intel sorry - I've been doing more testing and actually, while this method removes the bug it doesn't show any improved performance. This therefore means in newer libfabric versions this setting is implicitly set? This doesn't seem right to me... I think we're still looking at a CAF bug with Infiniband... I look forward to hearing how you're getting on in figuring this out... Thanks!

ShivaniK_Intel
Moderador
6.483 Visualizações

Hi,

 

Thanks for your patience and providing the information.

 

Could you please let us know whether you are facing a similar issue with the Intel processor?


We can only offer direct support for Intel hardware platforms that the Intel® oneAPI product supports. Intel provides instructions on how to compile oneAPI code for both CPU and a wide range of GPU accelerators.

 

https://intel.github.io/llvm-docs/GetStartedGuide.html

 

Thanks & Regards

Shivani


ShivaniK_Intel
Moderador
6.337 Visualizações

Hi,

 

>>"I would still appreciate you explaining what exactly the compiler setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 does. "

 

Does not switch off RMA, but switches internal default transport-specific implementation.

 

The next release of Intel MPI (2021.10) would work well for CAF.

 

Please let us know if you further need any information.

 

Thanks & Regards

Shivani

 

ShivaniK_Intel
Moderador
6.196 Visualizações

Hi,


As we did not hear back from you could you please respond to my previous post?


Thanks & Regards

Shivani


ShivaniK_Intel
Moderador
6.134 Visualizações

Hi,


We have not heard back from you. This thread will be no longer monitored by Intel. If you further need any assistance please post a new question.


Thanks & Regards

Shivani


Responder