Possible bug in ifort + OMP

Luckner__Paul · ‎08-19-2020

The allocate statement seems to be broken within an OMP loop.

Setup:

create an OMP loop
use allocate and supply something large into the optional source argument
ERROR segmentation fault.

Compiler version:
$ ifort -v
ifort version 19.1.1.217

JohnNichols · ‎08-19-2020

program main_tmp

  implicit none

  integer              :: i,j,n
  integer, allocatable :: x(:)

  !$omp parallel do default(none) private(i,j,n,x)
  do i = 1, 6
    n = 10**i
    allocate (x(n), source=[(j, j=1,n)])
  end do
  !$omp end parallel do

end program

As far as I am aware you need to deallocate before you reallocate after the first allocation

Luckner__Paul · ‎08-19-2020

You are correct. I tried to cut down the code and I shouldnt have deleted the external procedure.

The bug still arises.

program main_tmp

  implicit none

  external :: foo
  integer :: i

  !$omp parallel do default(none) private(i)
  do i = 1, 6
    call foo(i)
  end do
  !$omp end parallel do
end program

subroutine foo(i)
  integer, intent(in) :: i

  integer              :: j, n
  integer, allocatable :: x(:)

  n = 10**i
  allocate (x(n), source=[(j, j=1,n)])
end subroutine

Luckner__Paul · ‎08-19-2020

Even worse!!

Just supplying large arguments to any procedure gives a segmentation fault!? (See precedure bar)

module foo_m
  implicit none
contains
  subroutine bar(iarr)
    integer, intent(in) :: iarr(:)

    integer :: k
    k = iarr(1)                    ! just any operation. otherwise the compiler might optimize this function away?!
  end subroutine

  subroutine foo(i)
    integer, intent(in) :: i

    integer              :: j, n
    integer, allocatable :: x(:)

    n = 10**i
    allocate (x(n), source=[(j, j=1,n)])
  end subroutine
end module

program main_tmp
  use foo_m
  implicit none

  integer :: i,j

  !$omp parallel do default(none) private(i,j)
  do i = 1, 6
    ! call foo(i)                     ! uncomment to see that   allocate(..,source=<something large>)   will break
    ! call bar([(j, j=1,10**i)])      ! uncomment to see that   precedure(<something large>)            will break
  end do
  !$omp end parallel do
end program

Johannes_Rieke · ‎08-20-2020

Hi Paul,

I cannot reproduce the error in Windows OS with PSXE 2020 u2 (19.1.2.254). Your example code in the post directly above works as expected in debug mode.

My command line:

/nologo /debug:full /MP /Od /Qopenmp /warn:all /module:"x64\Debug\\" /object:"x64\Debug\\" /Fd"x64\Debug\vc150.pdb" /traceback /check:all /libs:static /threads /dbglibs /c

Do you work on GNU/Linux? How does your build command line look like?

BR, Johannes

Luckner__Paul · ‎08-20-2020

I replied already multiple times but the website doesnt show it!?!?

Anyways, I uploaded my build instructions to pastebin. See here

https://pastebin.com/j8AYtQun

PS: I do use GNU/Linux.

JohnNichols · ‎08-20-2020

it works in Intel Fortran - heap set to 0

Luckner__Paul · ‎08-20-2020

I guess you are talking about that option?

$ ifort -heap-arrays 0

https://software.intel.com/......

JohnNichols · ‎08-20-2020

I ran it on the latest VS 2019 Preview with the latest Intel using a standard Fortran program - it rang in 5 cores and then crashed with a stack error which you fix with heaps zero in VS.

Johannes_Rieke · ‎08-20-2020

If the heap-array options helps, then the line

allocate (x(n), source=[(j, j=1,n)])

might creates a temporary array on the stack due to the implied loop. One could split the allocation and the initialzation with a consecutive do loop to check this. There had been a discussion some time ago regarding implied do loops, temps and stack... but I don't find it.

Further OMP creates overhead, which could have an impact on stack. But I'm unsure about this.

If speed doesn't matter, you could stick to the heap option. Stack is generally faster than heap. If you could identify the code section, which requires heap, you could maybe improve the speed. But remember, premature optimization is the root of all evil

ps: You override the defaut real and interger kind in your build command (-i8, ...) and that's twice defined. I would use this only for lagacy code and otherwise change it in the code.

Luckner__Paul · ‎08-21-2020

The heap-arrays option did fix the segmentation fault and thus, it is clear that temporary arrays saved onto stack don't work nicely together with OMP.

I guess different threads get a specific stack size and if the temporary array is too large it will overwrite the stack of the next thread... Shouldn't the compiler know the stack sizes and not do that? It could automatically save them to the heap...

Now the programmer has to take care not to use implied do loops in OMP parallel regions (unless heap-arrays 0 is used)?

John_Campbell · ‎08-21-2020

It is important to know where private copies of arrays are defined/located; being either on the shared heap or each thread's stack and then to knowing how big each thread stack needs to be.

My experience in Win-64 is that:

private copies of allocatable arrays are placed on the heap. (seen as allocatable at the !$OMP region)
private copies of automatic, local and argument arrays are placed on the thread stack.
this location can be modified with compiler options.
the master thread and other threads each have a stack size, which should be known and managed.
for the master thread, private copies of arrays will be duplicated which means the master thread stack is likely to require a larger size.

I am learning how to use the stack in a 64-bit environment. My latest omp usage is to declare all stacks large (500MB). This only defines a virtual address space, while only the used portion of each stack is allocated physical memory. ie making the stacks much larger than "necessary" does not affect physical memory demand, but you need to have an idea of what "necessary" can be. ( You may also need to consider the virtual memory limit ). ifort has a similar approach with memory address when locating heap extensions.

This problem comes about because each thread stack is defined when each thread is initiated. They can not be extended, unlike the heap which can be extended up to the physical/virtual memory limit. If you are using !$OMP to speed up your program performance, relying on virtual memory is not a likely option.

Johannes_Rieke · ‎08-21-2020

I agree Paul, that the memory handling for OMP - given to the users responsibility - is not solved optimal in the current Intel compiler. There seems to be some changes recently coming with 19.0.x and newer. Maybe not always to the best?

A thread, which is loosly related to this issue, can be found here.

Maybe Intel can improve the user experience with OMP. Maybe the upcomming OneAPI compilers do a better job on this? Maybe we should use coarrays and let the compiler do the MPI/OpenMP whatever layer.