- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Originally I thought that some compiler options or other setup was wrong, but after lots of tests of options and variants (see performance difference with allocate/deallocate linux and windows ) I came to the conclusion that there is probably a bug in deallocate called from within an openmp region in windows. No matter what, it is pretty obvious that deallocate always aquires at least one lock (possibly more, because calling deallocate in a critical section improves runtime by a factor of two actually). For variants of the attached code, which actually show that allocate scales, whereas deallocate is the culprit, see the linked discussion above. The code shown below just shows that there is something wrong with the parallel version (e.g. parallel loop is slower by a factor of more than 100 on an 18 core machine, on linux it is actually faster by a factor of 10-15 or so).
Compile command (from the oneapi console): ifort /Qopenmp /O2 perf_alloc.f90
program perf_alloc
use OMP_LIB
implicit none
real(8) :: tstart
real(8), dimension(1:2) :: telaps
integer(4), parameter :: cnt = 100000000
integer(4) :: i
integer, dimension(:), pointer :: a
print *,'num threads = ', omp_get_max_threads()
! non-parallel
tstart = getTime()
do i = 1,cnt
allocate(a(1:40))
deallocate(a)
end do
telaps(1) = elapsedTime(tstart)
! parallel
tstart = getTime()
!$omp parallel do schedule(dynamic, 100) default(shared) private(i, a)
do i = 1,cnt
allocate(a(1:40))
deallocate(a)
end do
!$omp end parallel do
telaps(2) = elapsedTime(tstart)
print '(a,2f15.5,"s")', 'non/parallel: ', telaps(1:2)
print '(a,f15.5)', 'ratio = ', telaps(1)/telaps(2)
contains
function getTime() result(tstamp)
real(8) :: tstamp
integer(8) :: cnt, cntRate
real(8) :: tdouble
call system_clock(cnt, cntRate)
tdouble = real(cnt,8) / real(cntRate,8)
tstamp = real(tdouble,8)
end function getTime
function elapsedTime(tstart) result(telapsed)
real(8) :: telapsed
real(8), intent(in) :: tstart
integer(8) :: cnt, cntRate
real(8) :: tdouble
call system_clock(cnt, cntRate)
tdouble = real(cnt,8) / real(cntRate,8)
telapsed = real(tdouble,8) - tstart
end function elapsedTime
end program perf_alloc
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there nobody at Intel who even cares about such an elemental problem, which should effect quite a number of people (maybe not even aware that there is a problem, if running time is not compared with linux)?
To be clear: it means that any allocation within openmp region should be kept to a minimum if ifort is used on windows, making any modern (i.e. fortran95!) coding difficult.
And it looks like that I am not the only one who has stumbled upon it:
"Extremely-poor-OpenMP-performance-in-2019-version-of-Fortran"
"Intel-Fortran-Compiler/OpenMP-threading-performance"
Note that the first link "extremely-poor-..." contains vtune profiler output, which clearly shows that deallocation is the problem. My own vtune profiles look similar, dominated by for_deallocate and similar routines within the libifcoremd. The scalable_free routine from libiomp5md does not make any appearance as it should.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page