The use of kmp_malloc and kmp_free functions in a Fortran code

volosch · ‎02-08-2017

In Intel Guide for Developing Multithreaded Application, p.57, the use of two functions: kmp_malloc and kmp_free is recommended. These functions maintain a per-thread heap attached to each thread utilized by OpenMP and avoid the use of the lock that protects access to the standard system heap.

Unfortunately, no example of the use of kmp_malloc and kmp_free functions was given.

Please, present an example of a dummy Fortran code, where these functions are used.

jimdempseyatthecove · ‎02-08-2017

See: https://www.atmos.washington.edu/~ovens/junk/ifc7docs/f_ug/par_ext.htm

The return of kmp_malloc is a kmp_pointer_kind, which in turn is c_intptr_t. Once you have the C pointer you then call C_F_POINTER to convert it into the appropriate Fortran pointer or pointer to type or pointer to array descriptor.

I agree with your sentiments that the IVF documentation needs to be fleshed out with good examples.

Jim Dempsey

volosch · ‎02-10-2017

Dear Mr. Jim Dempsey,

Thank you for your help!

Your comment helps me to construct a dummy Fortran code, where kmp_malloc function is invoked:

PROGRAM TEST_OPENMP

USE,INTRINSIC::ISO_C_BINDING

IMPLICIT NONE

#ifdef _OPENMP

INCLUDE 'omp_lib.h'

#endif

integer(8), pointer :: kmp_pointer_kind_f

INTEGER(8), TARGET :: KMP

LOGICAL STATUS

INTEGER :: NTHR, STACKSIZE

! Pointer assignment

kmp_pointer_kind_f => KMP

STATUS = ASSOCIATED (kmp_pointer_kind_f)

write(*,1005) status

1005 format(1x,'status=',L3)

CALL C_F_POINTER(C_LOC(kmp_pointer_kind), kmp_pointer_kind_f)

STATUS = ASSOCIATED (kmp_pointer_kind_f)

write(*,1005) status

write(*,1006) kmp_pointer_kind_f

1006 format(1x,'kmp_pointer_kind_f=',i20)

#ifdef _OPENMP

kmp=kmp_malloc(4194304)

write(*,1007) kmp

1007 format(1x,'kmp=',i20)

#endif

#ifdef _OPENMP

!$omp parallel

NTHR = OMP_GET_NUM_THREADS()

STACKSIZE=KMP_GET_STACKSIZE()

!$omp end parallel

#endif

WRITE(*,1002)NTHR

1002 FORMAT(1X,'Number of threads=',I2)

WRITE(*,1003)STACKSIZE

1003 FORMAT(1X,'The stack size by thread, bytes, =',I10)

STOP

END PROGRAM TEST_OPENMP

It was compiled under Windows 10, installed on Intel i7-4960X based (6 cores, 12 threads) PC, by Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.0.109.

The following listing is generated after entering:

test_openmp > test_openmp.out 2>&1:

status= T

kmp_pointer_kind_f= 282643207815176

kmp= 1920034931072

Number of threads=12

The stack size by thread, bytes, = 4194304

Explain, please, is this call of kmp_malloc function sufficient to maintain a per-thread heap attached to each thread utilized by OpenMP and avoid the use of the lock that protects access to the standard system heap?

Is there any tool to display amount of heap memory attached to each thread utilized by OpenMP?

Andrei Voloshchenko

jimdempseyatthecove · ‎02-10-2017

I will have to defer the implementation details of kmp_malloc heap design to those who designed the code. I will make this educated guess:

In one of the Fortran applications I am involved with, it used OpenMP nested Task levels. This application suffered a crash situation. In looking at the traceback information, I found that the (Fortran) OpenMP task implementation made use of the TBB concurrent memory management/heap functions. With this information at hand, I strongly suspect that kmp_malloc/kmp_free makes use of the TBB concurrent memory management/heap functions. If so, then, look at the TBB documentation relating to heap management functions, define an appropriate interface block, and make a call. If the linker resolves the address, and if calls notices changes as you kmp_malloc, then the suspicion is correct.

Jim Dempsey

Alexander_R_3 · ‎02-11-2017

Dear Jim, would you be so kind as to clarify the usage of kmp_malloc? On the dedicated page we could read, that it is used to enable threads to allocate memory from a heap local to each thread. The principal question is, where it should be called, in usual region or by each thread in OpenMP-parallel region. Further questions follows from the principal one, e.g. how we should treat pointer, returned by it, in usual region or by each thread in OpenMP-parallel region. The name of subroutine is derived from C-intrinsic function "malloc", which is used for memory allocation and returnes pointer to allocated chunk of memory.

Thank you.

Sincerely, Alexander.

jimdempseyatthecove · ‎02-11-2017

As previously stated: look at the TBB documentation relating to heap management functions

IIF the TBB scalable allocator is used, which overloads malloc, ... free, new/delets, new[]/delete[]
Then the scalable allocator heap works approximately like this:

The compiled program instantiates one or more empty scalable allocator heap context(s) to be duplicated in thread local storage (or other means to be accessible differently by different threads).

The thread specific heap is initially empty.

The compiler generated code overloads (iow replaces) all allocation and deallocation routines such that they redirect to a thread private heap.

The initial thread private heaps (one for each thread) are empty.

Upon allocation, the thread first attempts to allocate from its thread private heap(s). If (when) that fails, the overloaded function then obtains a reasonably sized heap (presumably larger than the allocation request). In TBB nomenclature this is called a slab, in some versions of TBB the size of the slab was 16KB, but this may vary. The new slab is added to the list of slabs maintained by the specific thread as an extension of the thread's heap.

The scalable allocator is much more complicated than the generalization described above. The scalable allocator maintains lists of similar sized allocations, the granularity of which (you look this up with google) may be in multiple units of 16, 32 or 64 byte sizes. The program allocation for n bytes, is converted into an allocation of the number of granularity units sufficient to hold n bytes. If (when) the specific list has an available node, the node is quickly returned as the allocation. When the list is empty, then one or more new nodes will be obtained from the slab (or optionally from larger nodes), that failing, a new slab is allocated from the shared heap (or OS memory manager), that failing, then something like a garbage collection occurs to return unused slabs or portions thereof from other threads, that failing a NULL is returned.

Again, the above is a generalization. Implementation details will likely vary from version to version and from platform to platform and from bit-ness to bit-ness of application.

Jim Dempsey

volosch · ‎02-12-2017

Dear Mr. Jim Dempsey,

Thank you for your help!

A modified dummy Fortran code, where kmp_malloc function is invoked by every thread, is constructed:

PROGRAM TEST_OPENMP

USE,INTRINSIC::ISO_C_BINDING

IMPLICIT NONE

#ifdef _OPENMP

INCLUDE 'omp_lib.h'

#endif

! integer(C_INT), pointer :: kmp_pointer_kind_f (:)

integer(8), pointer :: kmp_pointer_kind_f (:)

! INTEGER(C_INT), ALLOCATABLE, TARGET :: KMP (:)

INTEGER(8), ALLOCATABLE, TARGET :: KMP (:)

INTEGER(4), ALLOCATABLE :: shape_kmp (:),kmp_final (:)

LOGICAL STATUS

INTEGER :: NTHR, STACKSIZE, TID, I

#ifdef _OPENMP

!$omp parallel

NTHR = OMP_GET_NUM_THREADS()

STACKSIZE=KMP_GET_STACKSIZE()

!$omp end parallel

#endif

WRITE(*,1002)NTHR

1002 FORMAT(1X,'Number of threads=',I2)

WRITE(*,1003)STACKSIZE

1003 FORMAT(1X,'The stack size by thread, bytes, =',I10)

ALLOCATE (kmp_pointer_kind_f(NTHR),KMP(NTHR),shape_kmp(NTHR),

+kmp_final(NTHR))

! Pointer assignment

kmp_pointer_kind_f => KMP

STATUS = ASSOCIATED (kmp_pointer_kind_f)

write(*,1005) status

1005 format(1x,'status=',L3)

shape_kmp=SHAPE(kmp_pointer_kind_f)

CALL C_F_POINTER(C_LOC(kmp_pointer_kind), kmp_pointer_kind_f,

+shape_kmp)

STATUS = ASSOCIATED (kmp_pointer_kind_f)

write(*,1005) status

write(*,1006) kmp_pointer_kind_f

1006 format(1x,'kmp_pointer_kind_f='/10(i20))

kmp_final=0

1 continue

!$OMP PARALLEL PRIVATE(TID)

!$OMP DO SCHEDULE(DYNAMIC,1)

DO I=1,NTHR

TID=OMP_GET_THREAD_NUM()

! write(*,1008)I,TID

!1008 format(1x,'i=',i3,' tid=',i3)

if(kmp_final(i).eq.0) then

if(TID.eq.I-1) then

kmp(i)=kmp_malloc(1048576)

kmp_final(i)=1

write(*,1007) TID,kmp(1+TID)

1007 format(1x,'Thread number=tid=',i3,' kmp(1+tid)=',i20)

end if

END DO

!$OMP END DO

!$OMP END PARALLEL

do i=1,NTHR

if(kmp_final(i).eq.0) go to 1

end do

END PROGRAM TEST_OPENMP

It was compiled under Windows 10, installed on Intel Core 2 Duo E6750 based (1 core, 2 threads) PC, by Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.0.109.

The following listing is generated after entering:

test_openmp > test_openmp.out 2>&1:

Number of threads= 2

The stack size by thread, bytes, = 4194304

status= T

kmp_pointer_kind_f=

282643207815176 287019779489792

Thread number=tid= 1 kmp(1+tid)= 2328270643456

Thread number=tid= 0 kmp(1+tid)= 2328272699776

A similar algorithm was tested for our production code KATRIN under Windows 10, installed on Intel i7-4960X based (6 cores, 12 threads) and Intel i7-6950X based (10 cores, 20 threads) PC with a negative result. No improvement in the run time was observed. On the contrary, the run time slightly increase.

Andrei Voloshchenko

jimdempseyatthecove · ‎02-13-2017

Andrei,

The TBB scalable allocator, any/all scalable allocators, have initialization overhead, but are designed to optimize:

loop: allocate objects, deallocate objects, end loop

Where you have repeatable object sizes. You may have 100's, 1000's of such sizes.

The benefit (less time) is observed on the 2nd and later allocations of the same size object *** provided that an available (prior returned) node is in the threads private heap.

Also, as stated previously, the scalable allocator works best on reasonable sized objects (on the order of 10 to 4000 bytes).

Your code is allocating an object of size 1048576. This is likely outside the "cacheable" node size and is therefore not cached (on future return).

I suspect that you are assuming kmp(i)=kmp_malloc(1048576) is allocating a heap of that size. Not true, it allocates a node of that size.

When your code allocates once, then runs with the initial allocation for the duration of the application, then do not use a scalable allocator (not what it is designed for).

When your code is multi-threaded, .AND. each thread has a high frequency of allocation/deallocation of repeating sizes of objects (say under 4KB) then consider using a scalable allocator. Remember, the benefits only come on reallocation of a node that was previously returned to the thread's scalable allocator heap (context).

Jim Dempsey

volosch · ‎02-14-2017

Dear Mr. Jim Dempsey,

Thank you for your help!

A few variants of the size of allocated by kmp_malloc function memory block: 1048576, 65536 and 8192 where tested with a similar result: no benefit is received. Intel TBB scalable allocator seems a good memory manager for a C/C++ code, but our code KATRIN is a Fortran multithreaded code. The sizes of objects allocated by threads are different in different parts of the code. So, I shall look other remedies to improve the code performance.

Andrei Voloshchenko

jimdempseyatthecove · ‎02-14-2017

>>The sizes of objects allocated by threads are different in different parts of the code.

The important requirements of scalable allocator are:

Object sizes must be reasonable (to the allocator). Probably under 16KB, but this may vary.
Second, and more importantly, these objects must be repeatedly allocated, deallocated, allocated, deallocated.

For larger objects, the allocator may work well, but each thread in your application could do as well using linked lists of returned objects (for use in reallocation of same sized node). The TBB scalable allocator may have an advantage because it pools similar sized objects.

Also, in any test (benchmark) you run to evaluate kmp_malloc, you must remember to points:

First time allocation that requires a new slab to be allocated will take longer than second allocation that comes from a slab.
First touch of memory, regardless of allocation method, has excessive overhead due to requiring O/S to map physical RAM, page file, and optionally wipe on first touch. Additional access to same page has no such overhead.

Therefor, your test should have a loop of at least 3 iterations of actually using the allocated memory. Discard the first, use the timings of the others to determine performance.

Jim Dempsey