OpenMP and -heap-arrays not compatible since ifort 13?

Mikhail_O_ · ‎02-11-2016

Noticed that an application using OpenMP compiled with ifort 15.0.3 20150407 with the flag "-heap-arrays XXX" suffers a large performance degradation, increasing with the number of threads (on 1 thread is the same, on 8 threads it is 4 times slower, on 32 threads it is 16 times slower). This is not the case with ifort 12.1.3 20120212, the same option does not reduce the performance. Any idea why this change?

jimdempseyatthecove · ‎02-11-2016

Please show your !$OMP directives. It sounds like either:

a) you have a large amount of fisteprivate or copyin data
b) Your parallel region runtime is much less than the (internal) heap critical section for allocation/deallocation.
c) You are timing the first (only) passage through after allocation, and thus timing the "first touch" latency of mapping Virtual Memory to RAM and then allocation of page file pages, then potentially wiping the pages.

Jim Dempsey

Mikhail_O_ · ‎02-12-2016

Thank you for reply.

It is proprietary software (MCNP), so I am not going to modify it. Yes, it has a lot of $OMP THREADPRIVATE data filled at the beginning. However, my question was different: I do understand that all these data may slow down the code, I am wondering why it was not happening with ifort ver. <13 ? In other words what heap-stack related parameters were changed in Open MP 4.0 which caused so different behaviour?

jimdempseyatthecove · ‎02-12-2016

Did you run your timing section discarding the first iteration. The first touch issue mentioned in c) above?

If you want to figure this out, then it may be time to roll up your sleeves.

You may need to link in the debug C Runtime Library heap for your system, then find a list of diagnostic functions supported by the heap. You may need to write the Fortran interfaces to these C routines. The functions of interest are:

How much of the heap was allocated (as opposed to free).
How much memory is the process taking (heap, code and stack)

Prior to first parallel region, print out (save) allocated portion of heap, and process memory used.
Add a new first parallel region, that doesn't do much other than establishing the OpenMP thread team.
Then print out (save) allocated portion of heap, and process memory used.

The difference of the two will inform you of the initial consumption of instantiating your thread team (additional stack, context, heap).

In front of the parallel region that you suspect is causing the overconsumption, collect and display the allocated portion of heap, and process memory used.

If your parallel region is instantiated with a !$OMP PARALLEL DO
Then separate it into a !$OMP PARALLEL followed by !$OMP DO (and fixup the $OMP END DO and $OMP END PARALLEL)
Then insert between the two:

!$OMP BARRIER
!$OMP MASTER
call YourDiagnosticPrintOfHeapUsedAndProcessMemoryUsed
!$OMP END MASTER
!$OMP BARRIER

You might want to make the above a subroutine.

Then after the !$OMP END DO, but inside the parallel region, redo the above,
Then after the !$OMP END PARALLEL

collect and display the allocated portion of heap, and process memory used

The printouts should provide some insight at which point the allocations occur, then hopefully you can make corrective actions.

Jim Dempsey

Steven_L_Intel1 · ‎02-12-2016

I am aware of an issue regarding dynamic allocation in OpenMP parallel regions that causes a performance degradation. This is not an "incompatibility". I'm not sure if 15.0.3 has this problem but it will be fixed in 16.0.2, due out very soon.