Memory leak with nested OpenMP regions

Nikita_Tropin · ‎08-10-2015

Hello,

I have a memory leak problem with nested OpenMP parallel regions when both regions run with multiple threads. At the same time, when only one region have multiple threads and other one is 1-threaded, there is no memory leak.

There is C# program that calls Fortran dll multiple times. In Fortran dll I have nested parallel regions like that:

subroutine Sub1()

*some work*

!$ call OMP_SET_NESTED(.TRUE.)
!$OMP parallel do num_threads(n1) shared(...) private(...)

do i=1,2

*allocation of private dynamic arrays*

*some work*

call Sub2(args)

*some work*

*deallocation of private arrays*

end do

!$OMP end parallel do

end subroutine

subroutine Sub2(args)

*allocation dynamic arrays*

*some work*

!$OMP parallel do num_threads(n2) shared(...) private(...)

do i=1,2

*allocation of private dynamic arrays*

*some work*

*deallocation of private dynamic arrays*

end do

!$OMP end parallel do

*deallocation of dynamic arrays*

end subroutine

So I have 2 loops each has 2 iterations. When I run the code with n1=2, n2=1, or n1=1, n2=2 (number of threads in Sub1 and Sub2), everything is fine, but when I run it with n1=2, n2=2 then I have memory leak and program crashes after some time when memory usage reaches 2 Gb (I build it as 32-bit app).

VMMAP tool shows that most memory as taken by "Private data", where I can see a lot of memory blocks with size 1024 Kb and total WS 1000 Kb, the amount of such blocks is increasing with time. Because of such even size (exactly 1 Mb) I suspect that these are some system blocks, maybe stacks of OpenMP threads?

I tried it both on 15.0 and 16.0 (beta) compilers, the behavior is the same.

jimdempseyatthecove · ‎08-10-2015

I suggest you add code to the beginning of sub1 to assure that the C# program is not calling sub1 from one C# thread while it is concurrently running sub1 from a different C# thread.

Note, it is not necessarily wrong to do so, however, if the multiple C# call rate exceeds the throughput rate, you will produce backlogs of work (each consuming stack and thread resources).

Jim Dempsey

Nikita_Tropin · ‎08-19-2015

Jim,

It appears that this issue is unrelated to C# and DLL. I've managed to create simple example of Fortran console application that has nested OpenMP loops and memory leak, the code is below. Again, if I disable OMP_NESTED or set number of threads in at least one of the OpenMP loops to 1, everything works fine. But as soon as I have 2 threads in both inner and outer loops, I have memory leak and see a lot of 1 Mb memory blocks in Private Data in VMMap.

program OmpNestedMemLeak

    implicit none

  double precision h
  double precision estimate
  integer i
  integer n,n1,n2
  double precision sum2
  double precision x

  n1=100000
  n2=1000000
  h = 1.0D+00 / dble ( 2 * n1 )
  sum2 = 0.0D+00
  write (*,*), "Entering main loop.."
!$    call OMP_SET_NESTED(.TRUE.)
!$omp parallel shared ( h, n1,n2 ) private ( i, x ) num_threads(2)
!$omp do reduction ( + : sum2 )
  do i = 1, n1
    call nested_sub(n2)
    x = h * dble ( 2 * i - 1 )
    sum2 = sum2 + 1.0D+00 / ( 1.0D+00 + x**2 )
  end do
!$omp end do
!$omp end parallel

  estimate = 4.0D+00 * sum2 / dble ( n1 )
  write (*,*), "Main sub result", estimate
        
end program OmpNestedMemLeak

subroutine nested_sub(n)
  double precision h
  double precision estimate
  integer i
  integer n
  double precision sum2
  double precision x
  
    h = 1.0D+00 / dble ( 2 * n )
    sum2 = 0.0D+00
    !$omp parallel shared ( h, n ) private ( i, x ) num_threads(2)
    !$omp do reduction ( + : sum2 )
      do i = 1, n
        x = h * dble ( 2 * i - 1 )
        sum2 = sum2 + 1.0D+00 / ( 1.0D+00 + x**2 )
      end do
    !$omp end do
!$omp end parallel
end subroutine

Kevin_D_Intel · ‎08-19-2015

Using that small test case built with the 16.0 compiler and running under Inspector, for 32-bit only, there does appear to be continually increasing memory usage when OMP_NESTED is enabled. The same does not occur for Intel 64.

I directed this to the attention of our OpenMP Development team for some deeper analysis and will let you know what I hear back.

(Internal tracking id: DPD200375133)

(Resolution Update on 12/09/2015): This defect is fixed in the Intel® Parallel Studio XE 2016 Update 1 Release (PSXE 2016.1.051 / CnL 2016.1.146 - Windows)

John_Campbell · ‎08-22-2015

Thanks for the example post above. I was interested in understanding nested !$OMP, so I have taken this example and tested it with and without omp_nested

I also have investigated providing some reporting statistics, while running for mult- threaded runs. I have learnt something new for initialising and accumulating statistics. in a parallel region.

Finally, by comparing nested to non-nested performance: Having the outer loop "n1" large makes the nested approach to be a poor alternative. Based on these tests, I would expect there are few situations where OMP_NESTED is a good approach. I would expect this could be where n1 is small or there is poor load balance between threads. Once n1 is greater than the number of threads available, I would expect nesting to be unfavourable. Or am I missing something here ?

I have derived two examples :omp_nest_v5.f90 is nested, while omp_nest_v4.f90 is not.
For my testing, they are dramatically different in run times, although I am not sure what optimise has done to the loop in nested_sub for v4.

John