OpenMP crash assigning array from function call

Thomas_O_ · ‎12-17-2015

Hi, after the version 16.0.1 fixed a bug with assignment to zero-sized arrays (intentionally or not;-), my compiler-killing Fortran 95 TR4 code finds yet another one. The attached program crashes like that: thead 0 iteration 1 thead 1 iteration 51 thread 0 allocating thread 1 allocating thread 0 assigning thread 1 assigning forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source a.out 000000000047BB15 Unknown Unknown Unknown Backtraces are not useful. Running the resulting binary in valgrind shows many errors, unclear which ones follow from an initial corruption (several invalid writes of size 8). A simple -openmp as compiler flag is enough, it also happens with -openmp -O0. You need at least two threads to trigger it. I hope it is sufficiently trivial to get to the underlying race condition using the example code. There are some hints in the comments. It is interesting that the crash only occurs when assigning via a function call that returns the array. Replacing that by a subroutine call fixes the issue. Judging from the astonishment of a colleague that I actually think of doing something nefarious like returning arrays from functions, I guess that serves me right for thinking that one can read a book about standard post-77 Fortran and then write code that works in the real world;-) Well, that book did not mention OpenMP at all. I suppose this is rather agnostic to the runtime besides the Intel Compiler, but this is my platform: - Haswell Xeons - CentOS 7.1 x86-64 (linux 3.10, glibc 2.17) - Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.1.150 Build 20151021 (crash also present in version 15) Other compilers (GNU, PGI) have no issue with the code. PS: I tried to submit this to Premier Support as our site does have a proper license and contract for that, but the web site prevented me from doing this. I sent a report about that to our Intel representatives already. PPS: I need to enable multiple levels of off-site JavaScript to be able to upload files?! What happened to simple HTML forms? Why not even a simple fallback?

John_D_6 · ‎12-17-2015

Hello Thomas

it looks like the array that is returned is larger than the thread-local stack size. Note that this is compiler-dependent, so that could explain the differences with other compilers. Just test it with a larger stack size, e.g.:

export OMP_STACKSIZE=1G

Cheers,
John

Thomas_O_ · ‎12-17-2015

Hi John, I see your point. Indeed, with raised stack size, this does not crash. My surprise here is actually that the runtime decides to do an explicit allocation on the stack! I never imagined that happening. The GNU compiler for sure does what I expected:

   allocate(perplexion(problem_size))
...
  401023:       bf 20 17 50 00          mov    $0x501720,%edi
  401028:       e8 63 fc ff ff          callq  400c90 <malloc@plt>
  40102d:       48 89 45 c0             mov    %rax,-0x40(%rbp)
  401031:       48 8b 45 c0             mov    -0x40(%rbp),%rax

A plain call to malloc(), to get heap memory. The Intel compiler does something more elaborate:

  allocate(perplexion(problem_size))
...
  403644:       48 89 d6                mov    %rdx,%rsi
  403647:       89 ca                   mov    %ecx,%edx
  403649:       e8 32 69 00 00          callq  409f80 <for_alloc_allocatable>
  40364e:       48 89 85 d0 fe ff ff    mov    %rax,-0x130(%rbp)

So, for_alloc_allocatable() is a supposedly smart function that decides to put things on the stack? I know that you need a rather unlimited stack for Fortran programs utilizing non-allocatable arrays. But this optimization to put allocatable arrays there is rather dangerous since apparently the default value for OMP_STACKSIZE is … well, what? My ulimit for stack size is unlimited (max locked memory limited to 6 GB, though), but the per-thread stack apparently has some limit besides that, defined by the Intel Fortran runtime. The whole reason for these per-thread allocations and associations is to get memory local to the thread. If I do the allocation in the main thread, where the unlimited stack applies and so it shouldn't matter if it's heap or stack memory, will the first thread assigning to the memory still get it placed locally on its NUMA socket? But even without that (talking about 10-20 % performance hit because of cross-socket memory access), I am disturbed by the OpenMP stack limit not being unlimited if the process ulimit for the stack is unlimited. Is this supposed to be the case? Then I need to treat OpenMP programming with even more paranoid care than before (and it stopped being fun very quickly a long time ago anyway). Well, John, thanks for your to-the-point suggestion. But somehow that raises more questions for me ... PS: Note that the original code did not deallocate in the same routine, the memory was long-lived, of course. In my example, it looks like an automatic variable on the stack, local to the routine, was meant anyway.

Steven_L_Intel1 · ‎12-17-2015

for_alloc_allocatable always allocates to the heap. Stack allocation is done with inline code. ALLOCATEs are always done on the heap. But there may be code that causes a temporary array to be copied to the stack.

jimdempseyatthecove · ‎12-17-2015

The problem (IMHO) is that the compiler is generating a temporary array for a function result. I think this is a byproduct of having reallocate left hand side enabled and the compiler not knowing in advance if the recipient of the function call will (or will not) require a reallocation. If you need to assure (when possible) reallocation (in this case allocation of temp) is not performed then code use the subroutine call interface.

Jim Dempsey

Steven_L_Intel1 · ‎12-17-2015

LHS reallocation is not involved here. In Fortran, there's no "peeking" at the LHS in order to change behavior. Function fake_1d_fun returns an ordinary array result - this will go on the stack by default. If you compile with -heap-arrays it will go on the heap. This is what gfortran does, which is why it works there by default. You could also make the return value ALLOCATABLE and allocate it before assigning, then that will get deallocated after the return.

jimdempseyatthecove · ‎12-17-2015

>>there's no "peeking" at the LHS in order to change behavior

The function need no stinkin "peeking" at runtime, the compiler will know what the caller requires at build time and can call the appropriate entry point (and build the function with multiple variants/entry points). The compiler already does this vector enabled functions/subroutines, I see no reason why it cannot do this in this case. I would think it would be beneficial to eliminate unnecessary temporaries (and copy).

Jim Dempsey

Thomas_O_ · ‎12-18-2015

So, could you clear up my confusion about how function return values work? I just assumed that, since Fortran works with references instead of copies in subroutine arguments (?!), Fortran functions also naturally get a reference for the return value from the caller, who is responsible for allocating appropriate memory. And when I assign to a variable, the memory to use is right there. In my world of Fortran modules, the caller always has a function interface and knows what shape of output is coming. This also fits with my past experience of abysmal performance of gfortran on Solaris, where it used malloc() for various unnecessary temporary variables in mathematical expressions involving Fortran functions. So, is it the calling code preparing space on the stack before calling the function, just to copy that right into an heap-allocated array shortly afterwards? Mind that my mind does not carry much experience with FORTRAN 77 or earlier, I learned Fortran 90 from the book, after programming mostly in C-like and various scripting languages. So I really do tend to jump to strange conclusions, regarding Fortran 90 as a designed language, not as a thin layer of paint over decades of FORTRAN. Steve:

You could also make the return value ALLOCATABLE and allocate it before assigning, then that will get deallocated after the return.

You mean explicitly allocating the temporary copy on the heap that way? And this is equivalent to using ifort -heap-arrays? This still strikes me as surprisingly inefficient. Is this a C compatibility thing to keep similar ABI to C functions that don't know so much about referencing Fortran arrays? The point I take away from this is that I should stop using functions for anything but scalar return values. You don't want those temporary copies, neither on the stack nor the heap. Another hint is that my colleague with F77 experience didn't even know that you can return arrays because the old syntax for functions didn't even make that possible(?). Too often I see Fortran90+ as the mathematical/somewhat-functional language it could be, but in reality isn't. Compilers are trained on the code that exists, and that code is mostly not written my me.

jimdempseyatthecove · ‎12-18-2015

Maybe Steve can answer this.

Assume the called function uses either heap arrays or allocatable for the result (temporary). In both cases the function code is going to clean-up (delete) the temporaries in the process of returning. Therefore, the data referenced in the copy operation in the callers code will reside in freed memory. (that is unless there is a "delete after copy" flag in the array descriptor). While this may not be an issue in a single threaded program, it will be an issue in a multi-threaded program.

Jim Dempsey

Steven_L_Intel1 · ‎12-18-2015

I corrected my post #6 - we don't have current plans to make -heap-arrays the default, though it is under consideration. I was thinking of -assume realloc_lhs, which we do plan to make the default in a future major version.

To Jim's question in #9 - the "space" for function return values is always created by the caller. This may be the actual value or a descriptor. If the return value is ALLOCATABLE or was created with heap-arrays, it gets automatically deallocated at the end of the statement containing the function reference. In a multi-threaded program, this space is in the thread-local context of the caller like any other local variable. The function itself doesn't do this cleanup. For functions that return non-allocatable variable-sized results, an explicit interface is always required and this gives the caller the information it needs to create the appropriate sized space for the return value.

Thomas_O_ · ‎12-18-2015

So, in my example, there really is stack allocation for the function result in the calling function, then assignment to the explicitly allocated variable. As I had the original crash with high optimization settings, too, I have to presume that the optimizer does not see this (for me;-) obviously unnecessary work. Although, the rationale might be that working on the stack memory might make the function faster. Is there actually a way to write the code using the array returned by a function without implying a temporary copy (apart from rephrasing it with subroutines)? Edit: Reading up on lhs_alloc, I realize yet another potential performance hit from the compiler having to check if the LHS needs reallocation. So, I guess I should go around and add (:,:,:,:) to various LHS variables to prevent that. Also: Shouldn't the default per-thread stack limit somehow incorporate the fact that I set unlimited stack via ulimit?

Steven_L_Intel1 · ‎12-18-2015

If you want to simply avoid use of the stack here, compile with -heap-arrays. You could also make the return value ALLOCATABLE.

There's no such thing as unlimited stack - ulimit simply sets the stacksize to the maximum defined in the kernel build. But each thread gets a chunk of that, defined by the KMP_STACKSIZE environment variable.

I tend to doubt that even -Qipo is aggressive enough to eliminate the copy in this instance.

jimdempseyatthecove · ‎12-18-2015

>>To Jim's question in #9 - the "space" for function return values is always created by the caller.

Then why does (did) it create (allocate if heap arrays) a temporary, and then copy (and delete if heap arrays) when the target of the = is known at the time of the call?

Jim Dempsey

Thomas_O_ · ‎12-18-2015

Thanks for the clarification. Good to know that a copy of the function result seems unavoidable. About the stack size: I looks to me as if ulimit -c unlimited indeed practically removes a stack size limit on my platform. Of course, the next one is the limit on used memory, but the stack itself seems limited by

16:58|node001:stress$ grep STK_LIM_MAX /usr/include/asm-generic/resource.h 
#ifndef _STK_LIM_MAX
# define _STK_LIM_MAX           RLIM_INFINITY
17:00|node001:stress$ grep RLIM_INFINITY  /usr/include/asm-generic/resource.h
#ifndef RLIM_INFINITY
# define RLIM_INFINITY          (~0UL)
# define _STK_LIM_MAX           RLIM_INFINITY

~0UL is a bit larger than the physical memory in this box;-) The default limit is 8 MiB. If The Intel runtime would just divide _STK_LIM_MAX by the number of threads, that still would leave plenty. Am I correct in assuming the current Intel compiler uses a default fixed value for KMP_STACKLIMIT/OMP_STACKLIMIT? I found a value of 4 MiB in a random documentation on the net.

Steven_L_Intel1 · ‎12-18-2015

That's the language semantics - the RHS is evaluated entirely before the LHS is modified.

jimdempseyatthecove · ‎12-18-2015

>>That's the language semantics - the RHS is evaluated entirely before the LHS is modified

Forgot about that (shame on me).

The LHS potentially could be one of the arguments passed into the function. In this situation (with in situ modification) INTENT(IN) would not protect against modifying the input argument indirectly via the result (assuming it were in situ as earlier requested). Note, this still would not preclude the compiler from optimizing out the allocate/copy/deallocate in the case where there were it can be determined that there is no alias (LHS not in argument list and LHS is a local variable and thus known not to be aliased via dummy argument in caller).

Jim Dempsey