Solved: How to control NUMA locality of ALLOCATE'd memory in OpenMP programs ? - Page 2

mriedman · ‎03-02-2010

For an OpenMP program I want to enforce the use of NUMA local memory with threads that are permanently bound tothe sameCPU. These threads do frequent ALLOCATE's and DEALLOCATE's.

It seems likekmp_malloc() is the only way of enforcing NUMA locality (same as thread locality in this context) for memory that is frequently allocated and freed. The first confusing issue is that kmp_malloc() is not mentioned in ifort documentation although it's in the libraries.

Fortran ALLOCATE does call malloc(). Even ifmalloc()along with proper array initialization gets me local memory at the first invocation that does not help for long. Aftersome time offrequent malloc() and free()it ends up with a fragmentedbag of local and remote memory pages. malloc() has no knowledge of locality. Please correct me if I'm wrong.

Now I don't see a supported way of forcing ALLOCATE to usekmp_malloc() instead of malloc(). Apparently the only waymight be intercepting malloc()using LD_PRELOAD. That's not a clean way of doing things.

While MALLOC is available as a Fortran intrinsic that is not the case for KMP_MALLOC.A moredesirable solution could be an environment variable that switches ALLOCATE from malloc() to kmp_malloc(). Any other ideas ? Highly appreciated.

jimdempseyatthecove · ‎03-03-2010

Michael,

Look under C/C++ interoperability (not portable) functions.

Principally C_F_POINTER

Something like this untested code

REAL, POINTER :: ARRAY(:,:,:)
C_PTR :: CallocatedArray
...
CallocatedArray =YourCMalloc(nX * nY * nZ * SIZEOF(ARRAY(1,1,1))
if(CallocatedArray == 0) call Oops()
CALL C_F_POINTER(CallocatedArray, ARRAY, /nX,nY,nZ/)

Don't forget to return the allocated memory when you are done with it.

Jim Dempsey

View solution in original post

mriedman · ‎06-01-2010

Jim, yes indeed, the next logical step would be putting a fast multi arena allocator (maybe one instance per thread) on top of numa_alloc_onnode() which wouldeven allow to intercept malloc() and have a transparent solution without introducing a special API. You may argue whether or not intercepting malloc() is a legalsolution ...

For C++ programs that is mandatory.Formost of myclassic Fortran style programswhich do all theirallocsin a central place during startup (just recently migrated from COMMON blocks to ALLOCATABLEs)it is o.k. to makejust a fewchanges as outlined above.

Anyway - your suggestions have helped develop this quite a bit. Thanks.
Michael

jimdempseyatthecove · ‎06-01-2010

From my experience in writing one of these, the most efficient method (fastest) comes with a 3-tiered method

per-thread
per-node
per-system

Where the deallocate supplies not only the pointer but also the size of the memory node to be returned.
These are nested pools. Per-thread pool access is without locks. per-thread to/from per node has locks but only one lock perstore/fetch of pool (not node). Larger pool size means larger memory requirements but less overhead. A tradeoff between size and speed.

On 8 thread system, the pool based system is about 2.8x faster than malloc. And appears to have a linear scaling factor of 0.532225 (measured on Core i7 920).

Jim Dempsey