Originally I thought that some compiler options or other setup was wrong, but after lots of tests of options and variants (see performance difference with allocate/deallocate linux and windows ) I came to the conclusion that there is probably a bug in deallocate called from within an openmp region in windows. No matter what, it is pretty obvious that deallocate always aquires at least one lock (possibly more, because calling deallocate in a critical section improves runtime by a factor of two actually). For variants of the attached code, which actually show that allocate scales, whereas deallocate is the culprit, see the linked discussion above. The code shown below just shows that there is something wrong with the parallel version (e.g. parallel loop is slower by a factor of more than 100 on an 18 core machine, on linux it is actually faster by a factor of 10-15 or so).
Compile command (from the oneapi console): ifort /Qopenmp /O2 perf_alloc.f90
program perf_alloc use OMP_LIB implicit none real(8) :: tstart real(8), dimension(1:2) :: telaps integer(4), parameter :: cnt = 100000000 integer(4) :: i integer, dimension(:), pointer :: a print *,'num threads = ', omp_get_max_threads() ! non-parallel tstart = getTime() do i = 1,cnt allocate(a(1:40)) deallocate(a) end do telaps(1) = elapsedTime(tstart) ! parallel tstart = getTime() !$omp parallel do schedule(dynamic, 100) default(shared) private(i, a) do i = 1,cnt allocate(a(1:40)) deallocate(a) end do !$omp end parallel do telaps(2) = elapsedTime(tstart) print '(a,2f15.5,"s")', 'non/parallel: ', telaps(1:2) print '(a,f15.5)', 'ratio = ', telaps(1)/telaps(2) contains function getTime() result(tstamp) real(8) :: tstamp integer(8) :: cnt, cntRate real(8) :: tdouble call system_clock(cnt, cntRate) tdouble = real(cnt,8) / real(cntRate,8) tstamp = real(tdouble,8) end function getTime function elapsedTime(tstart) result(telapsed) real(8) :: telapsed real(8), intent(in) :: tstart integer(8) :: cnt, cntRate real(8) :: tdouble call system_clock(cnt, cntRate) tdouble = real(cnt,8) / real(cntRate,8) telapsed = real(tdouble,8) - tstart end function elapsedTime end program perf_alloc
Is there nobody at Intel who even cares about such an elemental problem, which should effect quite a number of people (maybe not even aware that there is a problem, if running time is not compared with linux)?
To be clear: it means that any allocation within openmp region should be kept to a minimum if ifort is used on windows, making any modern (i.e. fortran95!) coding difficult.
And it looks like that I am not the only one who has stumbled upon it:
Note that the first link "extremely-poor-..." contains vtune profiler output, which clearly shows that deallocation is the problem. My own vtune profiles look similar, dominated by for_deallocate and similar routines within the libifcoremd. The scalable_free routine from libiomp5md does not make any appearance as it should.