- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Balises:
- Intel® Fortran Compiler
Lien copié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hello Thomas
it looks like the array that is returned is larger than the thread-local stack size. Note that this is compiler-dependent, so that could explain the differences with other compilers. Just test it with a larger stack size, e.g.:
export OMP_STACKSIZE=1G
Cheers,
John
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
allocate(perplexion(problem_size)) ... 401023: bf 20 17 50 00 mov $0x501720,%edi 401028: e8 63 fc ff ff callq 400c90 <malloc@plt> 40102d: 48 89 45 c0 mov %rax,-0x40(%rbp) 401031: 48 8b 45 c0 mov -0x40(%rbp),%raxA plain call to malloc(), to get heap memory. The Intel compiler does something more elaborate:
allocate(perplexion(problem_size)) ... 403644: 48 89 d6 mov %rdx,%rsi 403647: 89 ca mov %ecx,%edx 403649: e8 32 69 00 00 callq 409f80 <for_alloc_allocatable> 40364e: 48 89 85 d0 fe ff ff mov %rax,-0x130(%rbp)So, for_alloc_allocatable() is a supposedly smart function that decides to put things on the stack? I know that you need a rather unlimited stack for Fortran programs utilizing non-allocatable arrays. But this optimization to put allocatable arrays there is rather dangerous since apparently the default value for OMP_STACKSIZE is … well, what? My ulimit for stack size is unlimited (max locked memory limited to 6 GB, though), but the per-thread stack apparently has some limit besides that, defined by the Intel Fortran runtime. The whole reason for these per-thread allocations and associations is to get memory local to the thread. If I do the allocation in the main thread, where the unlimited stack applies and so it shouldn't matter if it's heap or stack memory, will the first thread assigning to the memory still get it placed locally on its NUMA socket? But even without that (talking about 10-20 % performance hit because of cross-socket memory access), I am disturbed by the OpenMP stack limit not being unlimited if the process ulimit for the stack is unlimited. Is this supposed to be the case? Then I need to treat OpenMP programming with even more paranoid care than before (and it stopped being fun very quickly a long time ago anyway). Well, John, thanks for your to-the-point suggestion. But somehow that raises more questions for me ... PS: Note that the original code did not deallocate in the same routine, the memory was long-lived, of course. In my example, it looks like an automatic variable on the stack, local to the routine, was meant anyway.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
for_alloc_allocatable always allocates to the heap. Stack allocation is done with inline code. ALLOCATEs are always done on the heap. But there may be code that causes a temporary array to be copied to the stack.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
The problem (IMHO) is that the compiler is generating a temporary array for a function result. I think this is a byproduct of having reallocate left hand side enabled and the compiler not knowing in advance if the recipient of the function call will (or will not) require a reallocation. If you need to assure (when possible) reallocation (in this case allocation of temp) is not performed then code use the subroutine call interface.
Jim Dempsey
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
LHS reallocation is not involved here. In Fortran, there's no "peeking" at the LHS in order to change behavior. Function fake_1d_fun returns an ordinary array result - this will go on the stack by default. If you compile with -heap-arrays it will go on the heap. This is what gfortran does, which is why it works there by default. You could also make the return value ALLOCATABLE and allocate it before assigning, then that will get deallocated after the return.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
>>there's no "peeking" at the LHS in order to change behavior
The function need no stinkin "peeking" at runtime, the compiler will know what the caller requires at build time and can call the appropriate entry point (and build the function with multiple variants/entry points). The compiler already does this vector enabled functions/subroutines, I see no reason why it cannot do this in this case. I would think it would be beneficial to eliminate unnecessary temporaries (and copy).
Jim Dempsey
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
You could also make the return value ALLOCATABLE and allocate it before assigning, then that will get deallocated after the return.You mean explicitly allocating the temporary copy on the heap that way? And this is equivalent to using ifort -heap-arrays? This still strikes me as surprisingly inefficient. Is this a C compatibility thing to keep similar ABI to C functions that don't know so much about referencing Fortran arrays? The point I take away from this is that I should stop using functions for anything but scalar return values. You don't want those temporary copies, neither on the stack nor the heap. Another hint is that my colleague with F77 experience didn't even know that you can return arrays because the old syntax for functions didn't even make that possible(?). Too often I see Fortran90+ as the mathematical/somewhat-functional language it could be, but in reality isn't. Compilers are trained on the code that exists, and that code is mostly not written my me.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Maybe Steve can answer this.
Assume the called function uses either heap arrays or allocatable for the result (temporary). In both cases the function code is going to clean-up (delete) the temporaries in the process of returning. Therefore, the data referenced in the copy operation in the callers code will reside in freed memory. (that is unless there is a "delete after copy" flag in the array descriptor). While this may not be an issue in a single threaded program, it will be an issue in a multi-threaded program.
Jim Dempsey
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I corrected my post #6 - we don't have current plans to make -heap-arrays the default, though it is under consideration. I was thinking of -assume realloc_lhs, which we do plan to make the default in a future major version.
To Jim's question in #9 - the "space" for function return values is always created by the caller. This may be the actual value or a descriptor. If the return value is ALLOCATABLE or was created with heap-arrays, it gets automatically deallocated at the end of the statement containing the function reference. In a multi-threaded program, this space is in the thread-local context of the caller like any other local variable. The function itself doesn't do this cleanup. For functions that return non-allocatable variable-sized results, an explicit interface is always required and this gives the caller the information it needs to create the appropriate sized space for the return value.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
If you want to simply avoid use of the stack here, compile with -heap-arrays. You could also make the return value ALLOCATABLE.
There's no such thing as unlimited stack - ulimit simply sets the stacksize to the maximum defined in the kernel build. But each thread gets a chunk of that, defined by the KMP_STACKSIZE environment variable.
I tend to doubt that even -Qipo is aggressive enough to eliminate the copy in this instance.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
>>To Jim's question in #9 - the "space" for function return values is always created by the caller.
Then why does (did) it create (allocate if heap arrays) a temporary, and then copy (and delete if heap arrays) when the target of the = is known at the time of the call?
Jim Dempsey
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
16:58|node001:stress$ grep STK_LIM_MAX /usr/include/asm-generic/resource.h #ifndef _STK_LIM_MAX # define _STK_LIM_MAX RLIM_INFINITY 17:00|node001:stress$ grep RLIM_INFINITY /usr/include/asm-generic/resource.h #ifndef RLIM_INFINITY # define RLIM_INFINITY (~0UL) # define _STK_LIM_MAX RLIM_INFINITY~0UL is a bit larger than the physical memory in this box;-) The default limit is 8 MiB. If The Intel runtime would just divide _STK_LIM_MAX by the number of threads, that still would leave plenty. Am I correct in assuming the current Intel compiler uses a default fixed value for KMP_STACKLIMIT/OMP_STACKLIMIT? I found a value of 4 MiB in a random documentation on the net.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
That's the language semantics - the RHS is evaluated entirely before the LHS is modified.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
>>That's the language semantics - the RHS is evaluated entirely before the LHS is modified
Forgot about that (shame on me).
The LHS potentially could be one of the arguments passed into the function. In this situation (with in situ modification) INTENT(IN) would not protect against modifying the input argument indirectly via the result (assuming it were in situ as earlier requested). Note, this still would not preclude the compiler from optimizing out the allocate/copy/deallocate in the case where there were it can be determined that there is no alias (LHS not in argument list and LHS is a local variable and thus known not to be aliased via dummy argument in caller).
Jim Dempsey
- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable