- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The attached small piece of code shows a huge difference in performance between Linux and Windows. The non-parallel section actually runs in a comparable time. But whereas Linux scales nicely (factor 14 on an 18 core intel processor, not perfect probably due to frequency scaling?), windows actually degrades badly (by a factor of about 2 for 2 threads and 40-60 for 18 threads), so its more than 100 times slower than on Linux.
I have compiled with oneapi ifort, and options "-qopenmp -O2" and "/Qopenmp /O2" (both compiled from the command line, in windows using the oneapi command shell).
Any idea or suggestions? For example: does windows use the scalable allocator library from tbb as done in linux or do I need additionally options?
PS: Playing around with more complex code (where I have an array of array pointers, so that I can allocate and deallocate and separate loops, in which case I expect some internal global lock in the malloc implementation of the runtime system) I see that the problem is actually the deallocate, whereas the allocate behaves about the same as in Linux.
program perf_alloc
use OMP_LIB
implicit none
real(8) :: tstart
real(8), dimension(1:2) :: telaps
integer(4), parameter :: cnt = 100000000
integer(4) :: i
integer, dimension(:), pointer :: a
print *,'num threads = ', omp_get_max_threads()
! non-parallel
tstart = getTime()
do i = 1,cnt
allocate(a(1:40))
deallocate(a)
end do
telaps(1) = elapsedTime(tstart)
! parallel
tstart = getTime()
!$omp parallel do schedule(dynamic, 100) default(shared) private(i, a)
do i = 1,cnt
allocate(a(1:40))
deallocate(a)
end do
!$omp end parallel do
telaps(2) = elapsedTime(tstart)
print '(a,2f15.5,"s")', 'non/parallel: ', telaps(1:2)
print '(a,f15.5)', 'ratio = ', telaps(1)/telaps(2)
contains
function getTime() result(tstamp)
real(8) :: tstamp
integer(8) :: cnt, cntRate
real(8) :: tdouble
call system_clock(cnt, cntRate)
tdouble = real(cnt,8) / real(cntRate,8)
tstamp = real(tdouble,8)
end function getTime
function elapsedTime(tstart) result(telapsed)
real(8) :: telapsed
real(8), intent(in) :: tstart
integer(8) :: cnt, cntRate
real(8) :: tdouble
call system_clock(cnt, cntRate)
tdouble = real(cnt,8) / real(cntRate,8)
telapsed = real(tdouble,8) - tstart
end function elapsedTime
end program perf_alloc
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Martin,
I played around with your program and you are I also observe that the linking method does not seem to matter. So, the issues I described may not be the exact same issue. I do agree that the slowdown has something to do with blocking in the threaded section.
So, I tried using mkl_alloc/mkl_free instead of using the default allocate/deallocate. This method seems to have no issues within the threaded region and I see a speedup for the parallel loops. In all of our Fortran codes, we no longer call allocate/deallocate directly but wrap these in separate function calls. The separate calls can call allocate/deallocate if desired, but we can also replace these with other memory management routines (in a single place for the whole program).
Attached is my modifications that call mkl_alloc and mkl_free. My compile line was
ifort /O2 /Qmkl /Qopenmp allocate.f90
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
thanks again for the help and in particular for the mkl allocate example code. That was easy to test and adapt. The mkl malloc works much better, but still falls short of what can be achieved on Linux. For that reason I also tried mimalloc, which is almost a drop in replacement to mkl malloc and really easy to compile and use. It shows much better performance on windows.
A benchmark case on my real code was like 270s (original) -> 240s (mkl) -> 200s (mimalloc + replace a few allocs by static arrays/reuse). The replacement did not yield any performance gains on Linux, though. On Linux the benchmarks run in about 190s. This was on a 9980x, 18 core desktop. Considering the fact the most of the time is spent in a few compute/memory bound routines for matrix and vector operations which ran equally fast on Linux and Windows, these numbers show how bad the problem is.
Anyway, thanks again. For anybody interested in mimalloc, here are the interfaces and an example (de)allocate for a 2d integer(4) array, derived from the mkl variant, posted above:
interface
function mi_malloc(size) bind(c)
use iso_c_binding
type(c_ptr) :: mi_malloc
integer(kind=c_size_t), value :: size
end function mi_malloc
subroutine mi_free(ptr) bind(c)
use iso_c_binding
type(c_ptr), value :: ptr
end subroutine mi_free
end interface
subroutine mi_alloc_arr2d_int4(ptr, l1, u1, l2, u2)
integer(4), dimension(:,:), pointer, intent(out) :: ptr
integer, intent(in) :: l1, u1, l2, u2
integer(4), dimension(:,:), pointer :: qtr
integer(kind=c_size_t) :: n1, n2, bs
type(c_ptr) :: cptr
integer(kind=c_size_t), parameter :: bytes = sizeof(0_4)
n1 = u1 - l1 + 1
n2 = u2 - l2 + 1
if ((n1 < 1) .or. (n2 < 1)) then
! abort with an error
stop 1
end if
bs = n1 * n2 * bytes
cptr = mi_malloc(bs)
if (.not. c_associated(cptr)) then
! abort with an error
stop 2
end if
call c_f_pointer(cptr, qtr, shape=[n1,n2])
! c_f_pointer always generates a pointer with lower bounds = 1 for arrays
ptr(l1:u1,l2:u2) => qtr
end subroutine mi_alloc_arr2d_int4
subroutine mi_dealloc_arr2d_int4(ptr)
integer(4), dimension(:,:), pointer, intent(inout) :: ptr
type(c_ptr) :: cptr
if (associated(ptr)) then
cptr = c_loc(ptr)
call mi_free(cptr)
ptr => null()
end if
end subroutine mi_dealloc_arr2d_int4
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »