Memory Leak in OpenMP Task parallel application with Ifort 17

Martin_K_7 · ‎09-16-2016

I am trying to parallelize an algorithm using DAG-Scheduling via OpenMP tasks and there many programs are killed by the Linux kernel due to Out-Of-Memory after a calls to the parallelized code although the allocated memory is only 1% of the servers main memory. But this happens only if I use the Intel Compilers from 2015, 2016 or even the new 2017 edition.

Here is a small example building the same task dependency graph as the algorithm crashing:

    PROGRAM OMP_TASK_PROBLEM
        IMPLICIT NONE
    
        INTEGER M, N
        PARAMETER(M = 256, N=256)
        DOUBLE PRECISION X(M,N)
    
        X(1:M,1:N) = 0.0D0
    
        CALL COMPUTE_X(M, N, X, M)
    
        ! WRITE(*,*) X (1:M, 1:N)
    
    END PROGRAM
    
    
    SUBROUTINE COMPUTE_X(M,N, X, LDX)
        IMPLICIT NONE
        INTEGER M, N, LDX
        DOUBLE PRECISION X(LDX, N)
        INTEGER K, L, KOLD, LOLD
    
    
        !$omp parallel default(shared)
        !$omp master
        L = 1
        DO WHILE ( L .LE. N )
            K = M
            DO WHILE (K .GT. 0)
                IF ( K .EQ. M .AND. L .EQ. 1) THEN
                    !$omp  task depend(out:X(K,L)) firstprivate(K,L) default(shared)  
                    X(K,L) = 0
                    !$omp end task
                ELSE IF ( K .EQ. M .AND. L .GT. 1) THEN
                    !$omp  task depend(out:X(K,L)) depend(in:X(K,LOLD)) firstprivate(K,L,LOLD) default(shared)  
                    X(K,L) = 1 + X(K,LOLD)
                    !$omp end task
                ELSE IF ( K .LT. M .AND. L .EQ. 1) THEN
                    !$omp  task depend(out:X(K,L)) depend(in:X(KOLD,L)) firstprivate(K,L,KOLD) default(shared)   
                    X(K,L) = 2 + X(KOLD,L)
                    !$omp end task
                ELSE
                    !$omp  task depend(out:X(K,L)) depend(in:X(KOLD,L),X(K,LOLD)) firstprivate(K,L,KOLD, LOLD) default(shared)
                    X(K,L) = X(KOLD, L) + X(K,LOLD)
                    !$omp end task
                END IF
    
                KOLD = K
                K = K - 1
            END DO
    
            LOLD = L
            L = L + 1
        END DO
        !$omp end master
        !$omp taskwait
        !$omp end parallel
    END SUBROUTINE

After compiling it using `ifort -qopenmp -g omp_test.f90` and running it via `valgrind` it reports:

==23255== 1,048,576 bytes in 1 blocks are possibly lost in loss record 22 of 27
==23255==    at 0x4C29BFD: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==23255==    by 0x516AF47: bget(kmp_info*, long) (kmp_alloc.c:741)
==23255==    by 0x516AC6D: ___kmp_fast_allocate (kmp_alloc.c:2012)
==23255==    by 0x51CFAC7: __kmp_task_alloc (kmp_tasking.c:997)
==23255==    by 0x51CFA36: __kmpc_omp_task_alloc (kmp_tasking.c:1134)
==23255==    by 0x40367E: compute_x_ (omp_task_problem.f90:31)
==23255==    by 0x51DC412: __kmp_invoke_microtask (in /scratch/software/intel-2017/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64_lin/libiomp5.so)
==23255==    by 0x51AC186: __kmp_invoke_task_func (kmp_runtime.c:7055)
==23255==    by 0x51AD229: __kmp_fork_call (kmp_runtime.c:2361)
==23255==    by 0x5184EE7: __kmpc_fork_call (kmp_csupport.c:339)
==23255==    by 0x4034D3: compute_x_ (omp_task_problem.f90:24)
==23255==    by 0x4030CC: MAIN__ (omp_task_problem.f90:10)
==23255==
==23255== 1,048,576 bytes in 1 blocks are possibly lost in loss record 23 of 27
==23255==    at 0x4C29BFD: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==23255==    by 0x516AF47: bget(kmp_info*, long) (kmp_alloc.c:741)
==23255==    by 0x516AC6D: ___kmp_fast_allocate (kmp_alloc.c:2012)
==23255==    by 0x51CE30A: __kmp_add_node (kmp_taskdeps.cpp:204)
==23255==    by 0x51CE30A: __kmp_process_deps (kmp_taskdeps.cpp:320)
==23255==    by 0x51CE30A: __kmp_check_deps (kmp_taskdeps.cpp:365)
==23255==    by 0x51CE30A: __kmpc_omp_task_with_deps (kmp_taskdeps.cpp:523)
==23255==    by 0x403A11: compute_x_ (omp_task_problem.f90:35)
==23255==    by 0x51DC412: __kmp_invoke_microtask (in /scratch/software/intel-2017/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64_lin/libiomp5.so)
==23255==    by 0x51AC186: __kmp_invoke_task_func (kmp_runtime.c:7055)
==23255==    by 0x51AD229: __kmp_fork_call (kmp_runtime.c:2361)
==23255==    by 0x5184EE7: __kmpc_fork_call (kmp_csupport.c:339)
==23255==    by 0x4034D3: compute_x_ (omp_task_problem.f90:24)
==23255==    by 0x4030CC: MAIN__ (omp_task_problem.f90:10)

The line numbers of the `compute_x_` function in the backtrace correspond to the `!$omp task` statements. These memory leaks accumulated rapidly to an amount of memory such that the program crashes.

Using gcc-6.2 for this `valgrind` ends up with:

   ==21246== LEAK SUMMARY:
    ==21246==    definitely lost: 0 bytes in 0 blocks
    ==21246==    indirectly lost: 0 bytes in 0 blocks
    ==21246==      possibly lost: 8,640 bytes in 15 blocks
    ==21246==    still reachable: 4,624 bytes in 4 blocks
    ==21246==         suppressed: 0 bytes in 0 blocks
    ==21246==

where the leaks are only from the first initialization of the OpenMP runtime system.

So my question is: Why does the Intel Compiler/Intel OpenMP runtime system produce theses leaks or alternatively is there an error in the way I have
designed the task parallelism.

jimdempseyatthecove · ‎09-16-2016

Try checking the TBB forum for valgrind and memory leaks. OpenMP tasking is now using the TBB scalable allocator. I think what you are seeing false positives (due to slabs being held and not returned to the heap).

Jim Dempsey

Martin_K_7 · ‎09-16-2016

I think that is not a false positive. I also had this in mind because the Intel Inspector does not report a leak. But by replacing the function call

CALL COMPUTE_X(M,N,X,M)

for benchmarking purpose by

DO I = 1, 100000
   CALL COMPUTE_X(M,N,X,M) 
END DO

I can watch the process requesting more and more memory using top on the command line. The code shown above easily take more than 1 GB of memory and keep on growing until the kernel kills the process due to out of memory. From this point of view this can not be a false positive.

jimdempseyatthecove · ‎09-16-2016

While this may be, or may likely be, indicative of a memory leak, it can also be indicative of memory fragmentation:

allocate large
allocate small
free large
allocate small (taken from free'd large)
allocate same size large (this causes memory creep)
...

Some heap managers will try to work around these allocation patterns. As to what is happening, this will require closer investigation.

Jim Dempsey

Martin_K_7 · ‎09-17-2016

Due to the fact, that the Linux kernel kills the process, i.e. the memory keeps acquired by application, this is a leak and not a fragmentation problem in my eyes.

I could reproduce the crash on:

CentOS 7.2 / amd64 (2x Intel Xeon E5-2640 v3, 64 GB RAM Parallel Studio XE 2016 update 3, Parallel Studio XE 2017, Linux Kernel 3.10)
Ubuntu 14.04.5 /amd64 (2x Intel Xeon E5-2690, 32 GB RAM, Parallel Studio XE 2017, Linux Kernel 3.13)

On both systems looping over the COMPUTE_X function let the kernel kill the application. On the CentOS system the kernel states:

[2234836.890145] Out of memory: Kill process 7831 (a.out) score 972 or sacrifice child
[2234836.890177] Killed process 7831 (a.out) total-vm:65058044kB, anon-rss:63919612kB, file-rss:0kB

and on the Ubuntu System:

[2234068.581246] Out of memory: Kill process 11797 (a.out) score 971 or sacrifice child
[2234068.581314] Killed process 11797 (a.out) total-vm:33065444kB, anon-rss:31885948kB, file-rss:0kB

Kevin_D_Intel · ‎09-19-2016

I'll try re-creating the described behavior and route this to our OpenMP developers for assistance with deeper analysis.

jimdempseyatthecove · ‎09-19-2016

I have a separate issue posted on premier.intel.com that may be related to this, though the scenario is different. In my situation (KNL) I have nested task levels and many threads enqueuing/dequeuing tasks. I do not have dependencies. I get SIGSEGV's which seem to be related to (IMHO) non-thread safe coding. I sent in a reproducer, and am awaiting a resolution.

Jim Dempsey

Martin_K_7 · ‎09-26-2016

Is there anything from new about this problem?

jimdempseyatthecove · ‎09-26-2016

When running your program, with this slight modification:

PROGRAM OMP_TASK_PROBLEM
    IMPLICIT NONE

    INTEGER M, N, I
    PARAMETER(M = 256, N=256)
    DOUBLE PRECISION X(M,N)

    DO I=1,100000
        X(1:M,1:N) = 0.0D0

        CALL COMPUTE_X(M, N, X, M)
    END DO

    ! WRITE(*,*) X (1:M, 1:N)

END PROGRAM


SUBROUTINE COMPUTE_X(M,N, X, LDX)
    IMPLICIT NONE
    INTEGER M, N, LDX
    DOUBLE PRECISION X(LDX, N)
    INTEGER K, L, KOLD, LOLD


    !$omp parallel default(shared)
    !$omp master
    L = 1
    DO WHILE ( L .LE. N )
        K = M
        DO WHILE (K .GT. 0)
            IF ( K .EQ. M .AND. L .EQ. 1) THEN
                !$omp  task depend(out:X(K,L)) firstprivate(K,L) default(shared)  
                X(K,L) = 0
                !$omp end task
            ELSE IF ( K .EQ. M .AND. L .GT. 1) THEN
                !$omp  task depend(out:X(K,L)) depend(in:X(K,LOLD)) firstprivate(K,L,LOLD) default(shared)  
                X(K,L) = 1 + X(K,LOLD)
                !$omp end task
            ELSE IF ( K .LT. M .AND. L .EQ. 1) THEN
                !$omp  task depend(out:X(K,L)) depend(in:X(KOLD,L)) firstprivate(K,L,KOLD) default(shared)   
                X(K,L) = 2 + X(KOLD,L)
                !$omp end task
            ELSE
                !$omp  task depend(out:X(K,L)) depend(in:X(KOLD,L),X(K,LOLD)) firstprivate(K,L,KOLD, LOLD) default(shared)
                X(K,L) = X(KOLD, L) + X(K,LOLD)
                !$omp end task
            END IF

            KOLD = K
            K = K - 1
        END DO

        LOLD = L
        L = L + 1
    END DO
    !$omp end master
    !$omp taskwait
    !$omp end parallel
END SUBROUTINE

Build on Windows as Win32, there is no leak as a result of executing the loop. I do not have valgrind installed here, so my suspicion here is that the leak is an object allocated at initialization time, and held through program termination. This is a benign situation because when the process ends, the RAM (Virtual Memory of the entire process) is freed.

In your first post, you report the oom killer is killing your jobs. The sample code, is not producing memory creep. So I suspect something else is happening. Does the above code (with loop) get killed, or is observed as continually eating memory?

Jim Dempsey

jimdempseyatthecove · ‎09-26-2016

FWIW

Here is a scenario that is known to cause increasing memory consumption. This may or may not apply to your code, but is one of those "gotcha's".

If your application is mixed language programming (C/C++/C#/...) with Fortran. And the main program is spawing threads (pthread, std::thread, boost::, other), the gotcha is each thread (different) that calls your Fortran subroutine (containing parallel regions) will create a separate OpenMP thread pool. This pool is not returned upon exit of the function, nor returned upon termination of the thread that spawned the pool (within the Fortran subroutine).

The correct way to handle this is to not use ...join(); nor terminate your threads. Instead, once created, re-use the calling thread (e.g. use a pthread condition variable to run the next iteration, or some other means to pass in the args). Yes, this is a little more work, but it is better than the alternative. (oom killing your process).

Jim Dempsey

Kevin_D_Intel · ‎09-26-2016

I reproduced the reported loss when using valgrind, and also (as you noted) the no report of a loss with Intel® Inspector. I also tried the variant looping over the call and it ran for a very long time. I wasn’t patient enough on the system that I ran on to see any failure, but it was consuming memory as described.

I routed this to our OpenMP Developers and will share what I hear from their deeper analysis.

(Internal tracking id: DPD200414610)

Martin_K_7 · ‎09-27-2016

@jimdempseyatthecove The code with the modification (include the loop around the COMPUTE_X call) is the one which produces the crashes on my systems as reported above. And I use Linux and not a Win32 as you used for your tests. And I think due to the differences in the implementation of the threading between Windows and Linux it is useless trying to reproduce the bug on a Win32 system. If you have the possibility try it again on a Ubuntu 14.04 or CentOS 7 x86-64 system.

In order to avoid problems with the mixed language programming I isolated the bug as a Fortran only program and even this program failed. So it could not be a problem of mixed language codes.

@Kevin D. (Intel) The kill only happens if the Linux kernel do not use swap space on a hard disc. The HPC servers I am using have hard disk swap turned of. So disabling all swap before launching the code should also reproduce the kill of the application.

jimdempseyatthecove · ‎09-27-2016

Martin,

I ported the program over to a CentOS 7.2 in KNL, and built as 64-bit application. Release build shows memory leak

Jim Dempsey

Martin_K_7 · ‎09-27-2016

So from your experiments we can state that it is at least a bug in all 64 bit enabled Linux implementations of the task scheduler. Can you explain the plots a bit more?

jimdempseyatthecove · ‎09-28-2016

The plot was sub-section from the GNOME System Monitor. The top line was CPU usage. When all lines were pegged at the top was when the application was running. When then fell down, was when I Ctrl-C'd the program. The bottom line is the memory (and swap) usage. Memory consumption reached 25% of 94GB by the time I terminated the program. The chart is a clear indication of a memory leak.

If I were to guess, the dependency nodes are allocated from the TBB scalable allocator slab and either a) not returned, b) allocated from thread N's slab and returned to thread's M slab, or c) for some reason nodes never get reused.

Jim Dempsey

Martin_K_7 · ‎01-11-2017

I updated the compiler suite on my system to 2017 update 1 and the following changes. The above example now works without increasing the memory requirements but if I change

....
!$omp master 
....

...
!$omp end master
...

into

....
!$omp single
....

...
!$omp end single
...

The same error occurs again.

jimdempseyatthecove · ‎01-11-2017

Single will permit any one thread of the team through that section of the parallel region. Should you place the parallel region in a loop, then eventually, all threads will have had an opportunity to run through the single section. This will result in each thread obtaining scalable allocator slabs (think of these as thread private heaps). Thus, expect nThreads x the amount of allocated RAM.

Jim Dempsey

Martin_K_7 · ‎01-12-2017

But, in the given example there should not be a difference between the usage of `master` and `single` and the memory keeps growing as before. For me the problem is solved because it works now with master construct but of the sake of completeness it should work as well with the single here. Because in many tutorial I saw the task directive is often combined with the `single` statement.