Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29277 Discussions

How to control NUMA locality of ALLOCATE'd memory in OpenMP programs ?

mriedman
Novice
5,628 Views

For an OpenMP program I want to enforce the use of NUMA local memory with threads that are permanently bound tothe sameCPU. These threads do frequent ALLOCATE's and DEALLOCATE's.

It seems likekmp_malloc() is the only way of enforcing NUMA locality (same as thread locality in this context) for memory that is frequently allocated and freed. The first confusing issue is that kmp_malloc() is not mentioned in ifort documentation although it's in the libraries.

Fortran ALLOCATE does call malloc(). Even ifmalloc()along with proper array initialization gets me local memory at the first invocation that does not help for long. Aftersome time offrequent malloc() and free()it ends up with a fragmentedbag of local and remote memory pages. malloc() has no knowledge of locality. Please correct me if I'm wrong.

Now I don't see a supported way of forcing ALLOCATE to usekmp_malloc() instead of malloc(). Apparently the only waymight be intercepting malloc()using LD_PRELOAD. That's not a clean way of doing things.

While MALLOC is available as a Fortran intrinsic that is not the case for KMP_MALLOC.A moredesirable solution could be an environment variable that switches ALLOCATE from malloc() to kmp_malloc(). Any other ideas ? Highly appreciated.

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
5,618 Views

Michael,

Look under C/C++ interoperability (not portable) functions.

Principally C_F_POINTER

Something like this untested code

REAL, POINTER :: ARRAY(:,:,:)
C_PTR :: CallocatedArray
...
CallocatedArray =YourCMalloc(nX * nY * nZ * SIZEOF(ARRAY(1,1,1))
if(CallocatedArray == 0) call Oops()
CALL C_F_POINTER(CallocatedArray, ARRAY, /nX,nY,nZ/)


Don't forget to return the allocated memory when you are done with it.

Jim Dempsey

View solution in original post

0 Kudos
22 Replies
mriedman
Novice
479 Views
Jim, yes indeed, the next logical step would be putting a fast multi arena allocator (maybe one instance per thread) on top of numa_alloc_onnode() which wouldeven allow to intercept malloc() and have a transparent solution without introducing a special API. You may argue whether or not intercepting malloc() is a legalsolution ...

For C++ programs that is mandatory.Formost of myclassic Fortran style programswhich do all theirallocsin a central place during startup (just recently migrated from COMMON blocks to ALLOCATABLEs)it is o.k. to makejust a fewchanges as outlined above.

Anyway - your suggestions have helped develop this quite a bit. Thanks.
Michael
0 Kudos
jimdempseyatthecove
Honored Contributor III
479 Views
From my experience in writing one of these, the most efficient method (fastest) comes with a 3-tiered method

per-thread
per-node
per-system

Where the deallocate supplies not only the pointer but also the size of the memory node to be returned.
These are nested pools. Per-thread pool access is without locks. per-thread to/from per node has locks but only one lock perstore/fetch of pool (not node). Larger pool size means larger memory requirements but less overhead. A tradeoff between size and speed.

On 8 thread system, the pool based system is about 2.8x faster than malloc. And appears to have a linear scaling factor of 0.532225 (measured on Core i7 920).

Jim Dempsey
0 Kudos
Reply