Solved: Re: Huge performance difference with allocate/deallocate in openmp region between Linux and Windows - Page 2

Martin1 · ‎01-26-2021

The attached small piece of code shows a huge difference in performance between Linux and Windows. The non-parallel section actually runs in a comparable time. But whereas Linux scales nicely (factor 14 on an 18 core intel processor, not perfect probably due to frequency scaling?), windows actually degrades badly (by a factor of about 2 for 2 threads and 40-60 for 18 threads), so its more than 100 times slower than on Linux.

I have compiled with oneapi ifort, and options "-qopenmp -O2" and "/Qopenmp /O2" (both compiled from the command line, in windows using the oneapi command shell).

Any idea or suggestions? For example: does windows use the scalable allocator library from tbb as done in linux or do I need additionally options?

PS: Playing around with more complex code (where I have an array of array pointers, so that I can allocate and deallocate and separate loops, in which case I expect some internal global lock in the malloc implementation of the runtime system) I see that the problem is actually the deallocate, whereas the allocate behaves about the same as in Linux.

program perf_alloc

use OMP_LIB
implicit none

real(8) :: tstart
real(8), dimension(1:2) :: telaps

integer(4), parameter :: cnt = 100000000
integer(4) :: i

integer, dimension(:), pointer :: a

print *,'num threads = ', omp_get_max_threads()

! non-parallel
tstart = getTime()
do i = 1,cnt
   allocate(a(1:40))
   deallocate(a)
end do
telaps(1) = elapsedTime(tstart)

! parallel
tstart = getTime()
!$omp parallel do schedule(dynamic, 100) default(shared) private(i, a)
do i = 1,cnt
   allocate(a(1:40))
   deallocate(a)
end do
!$omp end parallel do
telaps(2) = elapsedTime(tstart)

print '(a,2f15.5,"s")', 'non/parallel: ', telaps(1:2)
print '(a,f15.5)', 'ratio = ', telaps(1)/telaps(2)


contains

function getTime() result(tstamp)
   real(8) :: tstamp

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   tstamp = real(tdouble,8)
end function getTime


function elapsedTime(tstart) result(telapsed)
   real(8) :: telapsed
   real(8), intent(in) :: tstart

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   telapsed = real(tdouble,8) - tstart
end function elapsedTime

end program perf_alloc

John_Young · ‎01-27-2021

Martin,

I played around with your program and you are I also observe that the linking method does not seem to matter. So, the issues I described may not be the exact same issue. I do agree that the slowdown has something to do with blocking in the threaded section.

So, I tried using mkl_alloc/mkl_free instead of using the default allocate/deallocate. This method seems to have no issues within the threaded region and I see a speedup for the parallel loops. In all of our Fortran codes, we no longer call allocate/deallocate directly but wrap these in separate function calls. The separate calls can call allocate/deallocate if desired, but we can also replace these with other memory management routines (in a single place for the whole program).

Attached is my modifications that call mkl_alloc and mkl_free. My compile line was

ifort /O2 /Qmkl /Qopenmp allocate.f90

View solution in original post

Martin1 · ‎02-15-2021

Hi John,

thanks again for the help and in particular for the mkl allocate example code. That was easy to test and adapt. The mkl malloc works much better, but still falls short of what can be achieved on Linux. For that reason I also tried mimalloc, which is almost a drop in replacement to mkl malloc and really easy to compile and use. It shows much better performance on windows.

A benchmark case on my real code was like 270s (original) -> 240s (mkl) -> 200s (mimalloc + replace a few allocs by static arrays/reuse). The replacement did not yield any performance gains on Linux, though. On Linux the benchmarks run in about 190s. This was on a 9980x, 18 core desktop. Considering the fact the most of the time is spent in a few compute/memory bound routines for matrix and vector operations which ran equally fast on Linux and Windows, these numbers show how bad the problem is.

Anyway, thanks again. For anybody interested in mimalloc, here are the interfaces and an example (de)allocate for a 2d integer(4) array, derived from the mkl variant, posted above:

interface
   function mi_malloc(size) bind(c)
      use iso_c_binding
      type(c_ptr) :: mi_malloc
      integer(kind=c_size_t), value :: size
   end function mi_malloc

   subroutine mi_free(ptr) bind(c)
      use iso_c_binding
      type(c_ptr), value :: ptr
   end subroutine mi_free
end interface

subroutine mi_alloc_arr2d_int4(ptr, l1, u1, l2, u2)
   integer(4), dimension(:,:), pointer, intent(out) :: ptr
   integer,                             intent(in)  :: l1, u1, l2, u2

   integer(4), dimension(:,:), pointer :: qtr
   integer(kind=c_size_t) :: n1, n2, bs
   type(c_ptr) :: cptr
   integer(kind=c_size_t), parameter :: bytes = sizeof(0_4)

   n1 = u1 - l1 + 1
   n2 = u2 - l2 + 1
   if ((n1 < 1) .or. (n2 < 1)) then
      ! abort with an error
      stop 1
   end if
   bs = n1 * n2 * bytes
   cptr = mi_malloc(bs)
   if (.not. c_associated(cptr)) then
      ! abort with an error
      stop 2
   end if

   call c_f_pointer(cptr, qtr, shape=[n1,n2])
   ! c_f_pointer always generates a pointer with lower bounds = 1 for arrays
   ptr(l1:u1,l2:u2) => qtr
end subroutine mi_alloc_arr2d_int4


subroutine mi_dealloc_arr2d_int4(ptr)
   integer(4), dimension(:,:), pointer, intent(inout) :: ptr

   type(c_ptr) :: cptr

   if (associated(ptr)) then
      cptr = c_loc(ptr)
      call mi_free(cptr)
      ptr => null()
   end if
end subroutine mi_dealloc_arr2d_int4