BUG: deallocate always aquires a lock in openmp region in windows

Martin1 · ‎01-28-2021

Originally I thought that some compiler options or other setup was wrong, but after lots of tests of options and variants (see performance difference with allocate/deallocate linux and windows ) I came to the conclusion that there is probably a bug in deallocate called from within an openmp region in windows. No matter what, it is pretty obvious that deallocate always aquires at least one lock (possibly more, because calling deallocate in a critical section improves runtime by a factor of two actually). For variants of the attached code, which actually show that allocate scales, whereas deallocate is the culprit, see the linked discussion above. The code shown below just shows that there is something wrong with the parallel version (e.g. parallel loop is slower by a factor of more than 100 on an 18 core machine, on linux it is actually faster by a factor of 10-15 or so).

Compile command (from the oneapi console): ifort /Qopenmp /O2 perf_alloc.f90

program perf_alloc

use OMP_LIB
implicit none

real(8) :: tstart
real(8), dimension(1:2) :: telaps

integer(4), parameter :: cnt = 100000000
integer(4) :: i

integer, dimension(:), pointer :: a

print *,'num threads = ', omp_get_max_threads()

! non-parallel
tstart = getTime()
do i = 1,cnt
   allocate(a(1:40))
   deallocate(a)
end do
telaps(1) = elapsedTime(tstart)

! parallel
tstart = getTime()
!$omp parallel do schedule(dynamic, 100) default(shared) private(i, a)
do i = 1,cnt
   allocate(a(1:40))
   deallocate(a)
end do
!$omp end parallel do
telaps(2) = elapsedTime(tstart)

print '(a,2f15.5,"s")', 'non/parallel: ', telaps(1:2)
print '(a,f15.5)', 'ratio = ', telaps(1)/telaps(2)


contains

function getTime() result(tstamp)
   real(8) :: tstamp

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   tstamp = real(tdouble,8)
end function getTime


function elapsedTime(tstart) result(telapsed)
   real(8) :: telapsed
   real(8), intent(in) :: tstart

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   telapsed = real(tdouble,8) - tstart
end function elapsedTime

end program perf_alloc

Martin1 · ‎02-02-2021

Is there nobody at Intel who even cares about such an elemental problem, which should effect quite a number of people (maybe not even aware that there is a problem, if running time is not compared with linux)?

To be clear: it means that any allocation within openmp region should be kept to a minimum if ifort is used on windows, making any modern (i.e. fortran95!) coding difficult.

And it looks like that I am not the only one who has stumbled upon it:

"Extremely-poor-OpenMP-performance-in-2019-version-of-Fortran"

"Intel-Fortran-Compiler/OpenMP-threading-performance"

Note that the first link "extremely-poor-..." contains vtune profiler output, which clearly shows that deallocation is the problem. My own vtune profiles look similar, dominated by for_deallocate and similar routines within the libifcoremd. The scalable_free routine from libiomp5md does not make any appearance as it should.