I had an MPI application that was crashing when running with the OFI Intel fabric, but which was working fine with TCP or DAPL fabric. After some investigation, I realized that the problem was that my application had its stack marked with the executable bit, which is not supported by the Intel MPI launcher with OFI.
I can reproduce this behavior with the following code:
module caller1_m implicit none contains subroutine caller(callee) external :: callee call callee() end subroutine caller end module caller1_m !=============================================================================== program test1 use caller1_m implicit none call caller(callee) contains subroutine callee print *, 'Greetings from "callee"!' end subroutine callee end program test1
One possible workaround is to get rid of the "contain'ed" subroutine by putting everything in a big module (see test_v2.f90 in this repository). Still: is the expected behavior of the Fortran compiler? If so, should one in general avoid passing contain'ed subroutines as actual arguments to other subroutines?
How are you detecting this? I know that, on Windows at least, the compiler/runtime had to change the way it created "thunks" for passing internal procedures so that the memory for the mini-routine was allocated on the heap and not the stack. I don't know how it worked on Linux.
For more information on "thunks", see my post Doctor Fortran in "Think, Thank, Thunk".
Thanks for your quick reply and for sending the reference regarding the "thunks", I will read it carefully later this week.
Regarding your question, I checked whether the ELF binary has an executable stack with the "execstack -q" command. For instance, for the example above I get:
execstack -q test_v1.x X test_v1.x
In addition, one can also check if the object file contains an executable stack with the "readelf -SW" command. For instance:
readelf -SW test_v1.o | grep GNU-stack [ 7] .note.GNU-stack NOTE 0000000000000000 000314 000000 00 X 0 0 1
The "X" executable attribute vanishes in both cases if I put the contain'ed subroutine in a separate module.
I am going to recommend that you file a ticket on this with Intel support. The people who understand how this works don't follow the forum. I am uncomfortable with the notion that the execute protection is being removed from the stack.
This issue will occur only when the routine you pass is an "internal procedure", contained in another procedure (or the main program). Passing an "external procedure" (including a module procedure) doesn't need a thunk (because no local variable context is needed.)