Solved: Huge performance difference with allocate/deallocate in openmp region between Linux and Windows

Martin1 · ‎01-26-2021

The attached small piece of code shows a huge difference in performance between Linux and Windows. The non-parallel section actually runs in a comparable time. But whereas Linux scales nicely (factor 14 on an 18 core intel processor, not perfect probably due to frequency scaling?), windows actually degrades badly (by a factor of about 2 for 2 threads and 40-60 for 18 threads), so its more than 100 times slower than on Linux.

I have compiled with oneapi ifort, and options "-qopenmp -O2" and "/Qopenmp /O2" (both compiled from the command line, in windows using the oneapi command shell).

Any idea or suggestions? For example: does windows use the scalable allocator library from tbb as done in linux or do I need additionally options?

PS: Playing around with more complex code (where I have an array of array pointers, so that I can allocate and deallocate and separate loops, in which case I expect some internal global lock in the malloc implementation of the runtime system) I see that the problem is actually the deallocate, whereas the allocate behaves about the same as in Linux.

program perf_alloc

use OMP_LIB
implicit none

real(8) :: tstart
real(8), dimension(1:2) :: telaps

integer(4), parameter :: cnt = 100000000
integer(4) :: i

integer, dimension(:), pointer :: a

print *,'num threads = ', omp_get_max_threads()

! non-parallel
tstart = getTime()
do i = 1,cnt
   allocate(a(1:40))
   deallocate(a)
end do
telaps(1) = elapsedTime(tstart)

! parallel
tstart = getTime()
!$omp parallel do schedule(dynamic, 100) default(shared) private(i, a)
do i = 1,cnt
   allocate(a(1:40))
   deallocate(a)
end do
!$omp end parallel do
telaps(2) = elapsedTime(tstart)

print '(a,2f15.5,"s")', 'non/parallel: ', telaps(1:2)
print '(a,f15.5)', 'ratio = ', telaps(1)/telaps(2)


contains

function getTime() result(tstamp)
   real(8) :: tstamp

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   tstamp = real(tdouble,8)
end function getTime


function elapsedTime(tstart) result(telapsed)
   real(8) :: telapsed
   real(8), intent(in) :: tstart

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   telapsed = real(tdouble,8) - tstart
end function elapsedTime

end program perf_alloc

John_Young · ‎01-27-2021

Martin,

I played around with your program and you are I also observe that the linking method does not seem to matter. So, the issues I described may not be the exact same issue. I do agree that the slowdown has something to do with blocking in the threaded section.

So, I tried using mkl_alloc/mkl_free instead of using the default allocate/deallocate. This method seems to have no issues within the threaded region and I see a speedup for the parallel loops. In all of our Fortran codes, we no longer call allocate/deallocate directly but wrap these in separate function calls. The separate calls can call allocate/deallocate if desired, but we can also replace these with other memory management routines (in a single place for the whole program).

Attached is my modifications that call mkl_alloc and mkl_free. My compile line was

ifort /O2 /Qmkl /Qopenmp allocate.f90

View solution in original post

John_Young · ‎01-26-2021

Hi,

We have run into similar problems before on Windows. First, you might want to check your thread affinity which can affect the performance significantly in some cases. I typically use

KMP_AFFINITY=granularity=fine,compact

If that does not help, then I do not know if you are encountering the same problem we had, but it seems like it. Here is our post (which did not get much feedback):

https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-threading-performance/m-p/1137925

We are not sure of the proper terminology, but what we call a 'dynamic build' means we linked with the 'multi-threaded dll' runtime libraries and what we called a 'static build' means we linked with the plain 'multi-threaded' runtime libraries. Also, our code for which we observed the behavior was a mixed C++/Fortran code and not a pure Fortran code.

The upshot is to try to link your program with the 'multi-threaded' runtime library instead of the 'multi-threaded dll' runtime library and see if the efficiency improves.

From more investigation, we concluded the following at the time (but never posted it). None of the below has been confirmed by Intel, so it is only an educated guess that could be incorrect.

Simulation Performance
1. In terms of simulation performance, the dynamic build seems to have better memory characteristics than the static build. However, the dynamic build seems to run slower than the static build.
2. The exact reason for this is not 100% understood. However, we believe that the primary reason for the differences is that the two builds use different memory managers.
3. By default, the dynamic build seems to use a memory manager that manages a single memory pool between all threads. So, in multi-threaded code, the dynamic-build memory manager only allows one thread at a time to request or release memory. Usually this is a fast, but as memory becomes fragmented and the memory manager has to work harder to get a free block of memory, these memory management requests becomes slower and slower and multi-threaded code can seemingly become almost single-threaded. For us, we could see this in large simulations as the poor threading efficiency rarely happens at the start of the simulation but only later on as the memory becomes fragmented.
4. However, since the dynamic build seems to use a single pool of memory, it is able to easily release the memory back to the Operating System. So, the peak memory performance of our simulations was much better for the dynamic build than the static build. On the other hand, this seemingly better memory performance may not be 100% real due to memory fragmentation.
5. By default, the static build seems to use a memory manager where each thread has its own pool of memory. When multiple threads request memory at the same time, the memory manager does not have block other threads and manage each memory request serially. Hence, the parallel efficiency of the static build seems much better throughout our whole simulation.
6. In terms of peak memory, since each thread manages its own memory pool, when a thread releases memory, the memory manager does not seem to return it to the operating system but keeps it available for when the thread makes another memory request. From the OS point-of-view, the static build has a much higher peak memory profile than that of the dynamic build.

Validation:
1. To support the idea that the parallel efficiency and memory behavior differences are due to the memory managers, we tried forcing alternative memory managers to be linked into both builds. What we found is that when we forced a memory manager that uses separate pools of memory for each thread into the dynamic build, we obtained parallel efficiency and memory behavior much more similar to the static build. We were not able to find an alternative single-pool memory manager that was easy to use. So we did not directly check the converse: forcing a memory manager that uses a single pool of memory for each thread into the static build.
2. We do not know exactly which memory managers are used by default for the two builds, so we could not force the exact memory manager for one into the other. So, the final behavior of the builds were not exactly like default builds, only approximately similar.

Conclusions:
1. There is a fundamental trade-off in terms of parallel efficiency and peak memory performance. Per-thread memory managers do not block other threads, but keep memory resident in memory even when released by the program. Single-pool memory managers do block other threads but are much more aggressive at returning released memory to the OS to give better peak memory performance.
2. This doesn't necessarily have anything to do with whether you statically or dynamically link the underlying libraries.
3. The different performance characteristics originally seemed to be build dependent, but, on further investigation, the differences seem to be due to the fact that the two builds use different memory management schemes. What exactly these two schemes are and why do the two builds use different ones are not known.
5. All tests were performed on Windows machines. *NIX memory managers seem to do a better job at managing memory than those of Windows, but we suspect similar issues for large enough problems.

jimdempseyatthecove · ‎01-26-2021

Martin,

The following is an experiment you could perform in short order. As to if this will work now, as it did some time in the past, I cannot say. Therefore experimentation is in order.

Rename your PROGRAM into a subroutine taking no arguments (e.g. subroutine wasPROGRAM() )
Create a project that makes a static library from your program
Add a C++ project that uses TBB with its scalable allocator.
The code doesn't have to do much except

1) interact once with the TBB scalable allocator (to assure it is hooked in properly)
3) call for_rtl_init()
4) call wasPROGRAM()
5) call for_rtl_finish()
See: https://scc.ustc.edu.cn/zlsc/tc4600/intel/2015.1.133/compiler_f/GUID-FF451765-7BD5-4A03-BE38-729CE9CB9C69.htm

The older versions of TBB (~2007) used to hook malloc and free (which are/were called from the Fortran runtime system). As to if this holds true now, I cannot say.

Good luck.

Jim Dempsey

John_Young · ‎01-26-2021

Jim,

In my post where I mentioned we switched in an alternative memory manager in the 'multi-threaded dll build runtime' and got performance that mirrored the 'multi-threaded runtime build', one of the ones we tried was the TBB. However, to accomplish this, all we did was add the following to the Windows link line:

tbbmalloc_proxy.lib /INCLUDE:"__TBB_malloc_proxy"

In my notes (from 2019), I have written that just switching to the TBB memory allocated did not change much when using the multi-threaded dll runtime library if using the default Fortran allocate/deallocate. However, using the TBB memory manager and managing memory using MKL_ALLOC and MKL_FREE produced parallel efficiency similar to using the multi-threaded run-time library.

Again, this is all from my notes from two years ago, so the details are a bit fuzzy.

Martin1 · ‎01-27-2021

Thanks for the advice. I have tried most of it (except for the advice from Jim, creating a program, using tbb and calling fortran subroutine, see below) and nothing works. Well, I see that with static linking parallel as well as non-parallel (de)allocate are a bit faster (30-40% or so). But I always end up observing that the deallocate must involve some global lock and bad contention (surprisingly enclosing the deallocate in an omp critical section also speeds up the loop by a factor of about two?!?!).

Taking one step back: There must be something fundamentally wrong. I have a really small and simple test case. I compile it with the oneapi installed console, which provides all the basic setup. This should be sufficient to compile such a simple test case (with "ifort /MT /Qopenmp /O2" or with /MD instead of /MT). It should just work as expected and not giving this abysmal performance for deallocate.

Regarding TBB: I have checked on linux and windows. On windows TBB malloc is part of libiomp5md.dll, which is used by the executable (according to dumpbin). Similarly for linux. In both cases there are lots if tbb references within the binary data of dll/so/executable. That's why I am unwilling to create a C-program, using tbb etc. The tbb allocator is already on board. (KMP_SETTINGS=TRUE also shows that setup is equivalent, so I do not know of any option which could influence the allocator except for OMP_ALLOCATOR).

John_Young · ‎01-27-2021

Martin,

I played around with your program and you are I also observe that the linking method does not seem to matter. So, the issues I described may not be the exact same issue. I do agree that the slowdown has something to do with blocking in the threaded section.

So, I tried using mkl_alloc/mkl_free instead of using the default allocate/deallocate. This method seems to have no issues within the threaded region and I see a speedup for the parallel loops. In all of our Fortran codes, we no longer call allocate/deallocate directly but wrap these in separate function calls. The separate calls can call allocate/deallocate if desired, but we can also replace these with other memory management routines (in a single place for the whole program).

Attached is my modifications that call mkl_alloc and mkl_free. My compile line was

ifort /O2 /Qmkl /Qopenmp allocate.f90

John_Young · ‎01-27-2021

By the way, it would be nice if Intel would address this issue so that the Fortran compiler could generate much more efficient code directly. Since Fortran is supposed to be a high-performance language for scientific and numerical codes, it seems strange to me that something as simple as managing memory in a threaded region has such poor performance without having to jump through hoops.

I do acknowledge there may be a trade-off here between peak memory and run-time, so maybe that is the reason for the current choice.

jimdempseyatthecove · ‎01-27-2021

John,

Interesting approach. A difficulty I see is that one must change ALLOCATABLEs to POINTERs.

The better approach would be for a compile time (or a runtime environment variable) option to select (or not) the TBB scalable allocator.

A reason the user might not want to use it is when desiring to conserve resources over performance.

Jim Dempsey

John_Young · ‎01-27-2021

Jim,

Yes, requiring pointers instead of allocatables is a limitation. Maybe there is some fancy-coding way around that, but I don't know off the top of my head.

I agree that sometimes the memory performance is more important than the simulation time. We have users that sometimes request the 'multi-threaded dll' version of our codes just for that reason. However, the majority of our users prefer the simulation time improvement. When there are many cores available, the loss of threading we observe in the 'multi-threaded dll' versions can result in an order of magnitude longer run times.

However, for such a simple test case like Martin's, it's a shock to see how poor the threaded loop performs. For numerical codes, this should be something that Fortran shines at right out-of-the-box.

The frustrating thing is how much of this is undocumented so that we have to speculate about what is actually happening. Since our codes show such a different behavior based on whether we use the multi-threaded or multi-threaded dll run-time libraries, I really thought Martin's problem sounded similar. However, when I tried his code (as he noted), I observed similar (poor) times for the parallel loop in both cases. So, why does it make such a difference in our codes? I don't know. Is it that we have mixed-language codes and something subtle is happening with the libraries being pulled in? I do know that our codes lose threading due to blocking at the allocate/deallocate calls and the educated guesses I wrote about previously in this thread made sense at the time. We probably spent a month's worth of work at the time trying to figure out what was going onand a workaround to improve the threading efficiency.

Martin1 · ‎01-27-2021

Hi John and Jim,

thanks a lot for the advice, I will further try, in particular the mkl approach. I guess that mkl uses a copy of the tbbmalloc as well, but sets it up properly. If that is the case, than maybe the C(TBB)->fortran approach from Jim might actually help, by ensuring that a proper tbbmalloc version is called. If there are several tbbmalloc copies linked than there is probably some proxy/jump-tables or whatever, which needs to be setup at some point. So using tbb or mkl first might ensure that a scalable bug-free malloc is used.

To further understand the issue I already did some timings with a variation, where I can measure allocate and deallocate separately. See code below. Using critical for both allocate and deallocate, I get the same timing, again confirming that some locking (similar to the critical locking) is involved in deallocate. If I use omp master (instead of omp ciritcal) and without the "do schedule(...)" part, thus being in a parallel region but only with a single thread doing the allocation, then I get comparable performance for allocate and deallocate (and strangely about 30% faster than in a non-parallelised loop!) So it looks like that different deallocate code-paths is actually used here, as I would expect.

In my opinion this all points to a plain bug in the deallocate code of the libiomp5md library in windows. It just does not look like that a wrong version of deallocate is called as I first assumed, otherwise allocate would not scale that well.

program perf_alloc

use OMP_LIB
implicit none

real(8) :: tstart
real(8), dimension(1:2) :: telaps

integer(4), parameter :: cnt = 10000000
integer(4) :: i

type :: t
   real, allocatable :: u
end type t
type(t), dimension(:), allocatable :: a

print *,'num threads = ', omp_get_max_threads()

allocate(a(1:cnt))

! allocate
tstart = getTime()
!$omp parallel do schedule(dynamic, 1000) default(shared) private(i)
do i = 1,cnt
!$omp critical (aaa)
   allocate(a(i)%u)
!$omp end critical (aaa)
end do
!$omp end parallel do
telaps(1) = elapsedTime(tstart)

! deallocate
tstart = getTime()
!$omp parallel do schedule(dynamic, 1000) default(shared) private(i)
do i = 1,cnt
!$omp critical (ddd)
   deallocate(a(i)%u)
!$omp end critical (ddd)
end do
!$omp end parallel do
telaps(2) = elapsedTime(tstart)


print '(a,2f15.5,"s")', 'alloc, dealloc: ', telaps(1:2)
print '(a,2f15.5)', 'ratio = ', telaps(1)/telaps(2), telaps(2)/telaps(1)

deallocate(a)

contains

function getTime() result(tstamp)
   real(8) :: tstamp

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   tstamp = real(tdouble,8)
end function getTime


function elapsedTime(tstart) result(telapsed)
   real(8) :: telapsed
   real(8), intent(in) :: tstart

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   telapsed = real(tdouble,8) - tstart
end function elapsedTime

end program perf_alloc

Martin1 · ‎01-28-2021

I tried the cpp(tbb)->fortran approach, but it does not seem to work. For completeness sake, below is the source code, which hopefully executes Jim's suggested steps as intended. I compiled on the oneapi console with

ifort /MT /Qopenmp /O2 fff.f90 /c

icl /O2 mmm.cpp fff.obj /o m.exe tbbmalloc_proxy.lib /INCLUDE:"__TBB_malloc_proxy"

output is: (alloc/dealloc ratio should be roughly in the order of 1)

Success: free (ucrtbase.dll), byte pattern: <C7442410000000008B4424>
Success: _msize (ucrtbase.dll), byte pattern: <E90B000000CCCCCCCCCCCC>
Success: _aligned_free (ucrtbase.dll), byte pattern: <4883EC284885C9740D4883>
Success: _aligned_msize (ucrtbase.dll), byte pattern: <48895C2408574883EC2049>
Success: _o_free (ucrtbase.dll), byte pattern: <488BD1488D0DA667FFFFE9>
Success: _free_base (ucrtbase.dll), byte pattern: <4883EC284885C9741A4C8B>
=====
 hallo from fff          18
 num threads =           18
alloc, dealloc:         0.01700        6.84700s
ratio =         0.00248      402.76565

Here are fff.f90 and mmm.cpp:

module fff

use OMP_LIB

implicit none
private

public fff_entry

type :: t
   real, allocatable :: u
end type t


contains


subroutine fff_entry() bind(C, name='fff_entry')
   print *,'hallo from fff', omp_get_max_threads()
   call perf_alloc()
end subroutine fff_entry


subroutine perf_alloc()
   real(8) :: tstart
   real(8), dimension(1:3) :: telaps

   integer, parameter :: cnt = 10000000
   integer :: i

   type(t), dimension(:), allocatable :: a

   print *,'num threads = ', omp_get_max_threads()

   allocate(a(1:cnt))

   ! allocate
   tstart = getTime()
   !$omp parallel do schedule(dynamic, 1000) default(shared) private(i)
   do i = 1,cnt
      allocate(a(i)%u)
   end do
   !$omp end parallel do
   telaps(1) = elapsedTime(tstart)

   ! deallocate
   tstart = getTime()
   !$omp parallel do schedule(dynamic, 1000) default(shared) private(i)
   do i = 1,cnt
      deallocate(a(i)%u)
   end do
   !$omp end parallel do
   telaps(2) = elapsedTime(tstart)


   print '(a,2(f15.5,"s"))', 'alloc, dealloc: ', telaps(1:2)
   print '(a,2f15.5)', 'ratio = ', telaps(1)/telaps(2), telaps(2)/telaps(1)

   deallocate(a)
end subroutine perf_alloc


function getTime() result(tstamp)
   real(8) :: tstamp

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   tstamp = real(tdouble,8)
end function getTime


function elapsedTime(tstart) result(telapsed)
   real(8) :: telapsed
   real(8), intent(in) :: tstart

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   telapsed = real(tdouble,8) - tstart
end function elapsedTime


end module fff

#include <stdio.h>
#include <stdlib.h>
#include "tbb/tbbmalloc_proxy.h"

extern "C" {
  void fff_entry();
}

extern "C" {
  void for_rtl_init_(int *, char **);
  int for_rtl_finish_();
};

int main(int argc, char **argv) {
  for_rtl_init_(&argc, argv);

  // from: https://www.threadingbuildingblocks.org/docs/help/reference/memory_allocation/TBB_malloc_replacement_log_func.html
  char **func_replacement_log;
  int func_replacement_status = TBB_malloc_replacement_log(&func_replacement_log);
  for (char** log_string = func_replacement_log; *log_string != 0; log_string++) {
    printf("%s\n",*log_string);
  }
  if (func_replacement_status != 0) {
    printf("tbbmalloc_proxy cannot replace memory allocation routines\n");
  }

  double *a = (double*) malloc(100 * sizeof(double));
  free(a);
  int i;

  printf("=====\n");
  fff_entry();

  int fstat = for_rtl_finish_();

  return 0;
}

Martin1 · ‎01-28-2021

The mkl wrappers work, so I might apply this hack.

However, besides the pointer<-> allocatable matter, it has some more drawbacks (but it might do to remove the worst of the performance loss I have seen). First, for each type, I need my own routine, at least for allocate. For deallocate a class(*) variant might do, maybe. Furthermore, source as well as mold arguments, default initial values for derived-types etc are possibly difficult, and using it with larger class-trees might be a nightmare. But at least it is a work-around. So thanks a lot!!

John_Young · ‎01-28-2021

Martin,

I also could not get Jim's approach of using TBB to mitigate the problem as well.

The only approach that seems to work for me was to call mkl_alloc and mkl_free instead of allocate/deallocate. This does not require pulling in TBB separately. However, as Jim noted, you need to be using pointers instead of allocatables.

I also played around with your suggestion to use critical's around the deallocate. I was able to produce your modest speedup in your test program doing this. However, when I tried to do it in our full codes, it did not produce any change in timings, so maybe it has to do with the large loops in your test program where very little work is being done otherwise.

John_Young · ‎01-28-2021

Yes, we created a separate module for our 'memory-management' routines. We have specific allocate/deallocate routines for 1D and 2D integer arrays and 1D and 2D real and complex arrays (both single and double precision). You can wrap all these functions in a single allocate generic interface and deallocate generic interface, so your calling routines call the same function name.

Martin1 · ‎01-28-2021

Hi John,

I was thinking along these lines. We already have macros to automatically add error checks etc. so with generic interfaces for the most common cases it should do. But the more complex OOP code might be difficult, and mixing fortran/mkl allocs might be error prone. I will see.

jimdempseyatthecove · ‎01-28-2021

Martin,

The TBB scalable allocator uses threadprivate structures to hold priviously allocated freed data (allocations performed in coarse granularities). This algorithm works best when the free is performed by the thread that performed the allocation (otherwise the free is interlockingly storing into a different threads private collection).

As an experiment, change your schedule clauses (both loops) from dynamic to static to see what happens.

If that corrects the performance issue on the free (deallocate) .AND. you really require dynamic scheduling, then you may need to modify your structures (containing the allocatables) to track the thread number that performed the allocation. Then you would restructure your deallocation loop for all threads to scan the entire collection and filter the deallocates.

Jim Dempsey

Martin1 · ‎01-28-2021

Hi Jim,

due to a valgrind warning "conditional jump depends on uninitialised value" within the tbbmalloc routines, I studied the C code of the tbbmalloc code closely and I understand the problems. That's why I also check static versus dynamic, but surprisingly, I do not see any timing differences, neither on linux nor windows.

I have been using dynamic rather than static because 10 years ago or so I saw bad micro-stuttering in a linear solver probably caused by a combination of static scheduling, core pinning and the kernel doing some work on a core, which as a consequence got blocked and could not finish its work chunk. Dynamic scheduling easily solved it.

jimdempseyatthecove · ‎01-28-2021

I am not suggesting you move away from using dynamic scheduling. Instead, the static schedule test was simply a test to see if the deallocate performance issue was a result of the thread performing the deallocate not being the thread that performed the allocation was the cause of the slowdown. IIF so, then you could attack this problem in an alternate manner. (tag the allocates with the omp thread number).

I am making an assumption that by using two loops:

{allocate, possibly do some work}
interviening code... more work
{deallocate, no work}

That the deallocate loop does no work other than deallocation.

Because of this, you can use an !$omp parallel without the DO and have all threads iterate the full range of the objects with the allocated members and selectively perform the deallocate should the thread identify the allocation having been performed by itself (i.e the allocation loop saves the thread number that performed the allocation). Note, while this is redundant work, the redundancy is likely all L1 and L2 hits (and no critical sections).

Jim Dempsey

Martin1 · ‎01-28-2021

Hi Jim,

I am aware of this issue, the allocating thread should also deallocate to avoid requiring a lock. That was one reason why the first test code I posted used a thread private variable, which was allocated and immediately deallocated by the same thread. In that case (in particular if only a small amount of memory is used), no locking should be involved. It should all happen in a thread private memory block.

Just to be sure I tested your suggestion (I have done something similar a couple years ago), with some really surprising result. On linux, if I take care that the block is released by a different thread, it actually performs better. Variations in timing are significant, but averaged over 10-20 runs, it is pretty consistent (like 0.55s for allocate, and 0.9s[different] and 1.1s[same] for deallocate). I used an array of 800 bytes but the size does not matter much, even with a scalar and a much greate cnt I see the same behaviour. Below is the code, where "mod(i,nbt) == it" in the deallocate loop ensures that the same thread deallocates, whereas the "mod(i+1,nbt) == it" variant ensures that a different thread deallocates. This second version performs faster... For whatever reason.

program perf_alloc

use OMP_LIB
implicit none

real(8) :: tstart
real(8), dimension(1:2) :: telaps

integer(4), parameter :: cnt = 10000000
integer(4) :: i, nbt, it

type :: t
   real(8), dimension(:), allocatable :: u
end type t
type(t), dimension(:), allocatable :: a

nbt = omp_get_max_threads()
print *,'num threads = ', nbt

allocate(a(1:cnt))

! allocate
tstart = getTime()
!$omp parallel default(shared) private(i, it)
it = omp_get_thread_num()
do i = 1,cnt
   if (mod(i,nbt) == it) then
      allocate(a(i)%u(1:100))
   end if
end do
!$omp end parallel
telaps(1) = elapsedTime(tstart)

! deallocate
tstart = getTime()
!$omp parallel default(shared) private(i, it)
it = omp_get_thread_num()
do i = 1,cnt
   !if (mod(i,nbt) == it) then
   if (mod(i+1,nbt) == it) then   ! slightly faster
      deallocate(a(i)%u)
   end if
end do
!$omp end parallel
telaps(2) = elapsedTime(tstart)


print '(a,2(f15.5,"s"))', 'alloc, dealloc: ', telaps(1:2)
print '(a,2f15.5)', 'ratio = ', telaps(1)/telaps(2), telaps(2)/telaps(1)

deallocate(a)

contains

function getTime() result(tstamp)
   real(8) :: tstamp

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   tstamp = real(tdouble,8)
end function getTime


function elapsedTime(tstart) result(telapsed)
   real(8) :: telapsed
   real(8), intent(in) :: tstart

   integer(8) :: cnt, cntRate
   real(8) :: tdouble

   call system_clock(cnt, cntRate)
   tdouble = real(cnt,8) / real(cntRate,8)
   telapsed = real(tdouble,8) - tstart
end function elapsedTime

end program perf_alloc

jimdempseyatthecove · ‎01-29-2021

This may be the case that deallocate has twice as much work to do as allocate.

10 million deallocates taking ~1 second is ~100ns per deallocate. A few 100 instructions.

If you need faster timings (who doesn't) can you keep the allocations and only realloc for expansion (iow reuse prior allocation if it is >= new requirement?

If need be, you can carry the allocation with TARGET, and return a pointer containing the desired subscripts.

Jim Dempsey

Martin1 · ‎01-30-2021

Hi Jim,
thanks for the maths, it shows that deallocate it actually faster than I thought based on my assembly debugging. It felt like there were way more instructions to execute.
Anyway, except for tight loops, where alloc overhead becomes measurable and we are using reuse/resize technique (e.g provide reset methods for primitive fast hashmaps) as you describe, performance of alloc is absolutely fine. Except if some lock on windows brings the whole computation to almost a stop. For complex data structures (like facet and cell based linked 3d Delaunay mesh) I would rather avoid having my own memory management layer. We have done that too (somewhat resembling a scalable allocator but special purpose and thus much simpler). I hope that someone at Intel cares about such a performance bug...