Example with kmp_malloc in Fortran

per · ‎06-01-2017

I am running the my fortran code across two NUMA nodes equipped by Xeon(R) CPU E5-2690v4 processors in Linux under SLURM. The part which uses MKL functions scales very good but my OpenMP code does not. I think it is due to the fact that I use a large memory allocated by the zero thread and then partitioned over the threads. In such a scenario, the memory is being allocated to on the node with thread 0. I believe that allocating the memory inside OpenMP threads will help to solve the problem. Could you please provide some examples of how this can be correctly done in Fortran? My attempts of doing it return zero pointers from kmp_malloc regardless the argument.
P.S.
I saw a similar thread in forum, but it does not contain the solution. I use malloc, not ALLOCATE.
I was referred to the forum by the technical support, which we pay for.

TimP · ‎06-02-2017

It should not matter which allocator you use. Memory locality should be set by "first touch," i.e. which CPU is running the thread which first initializes the data. You will need to set affinity, e.g. by "export OMP_PLACES=cores" recognizing that the MKL default is equivalent to that, if you are setting NUM_THREADS accordingly.

per · ‎06-02-2017

Thank you, Tim. So you are saying there is no difference between malloc and kpm_malloc as long as OMP_PLACES=cores? In my case, the scheduler itself ensures CPU and memory binding through CGROUPS, so I do not specify the affinity explicitly. Does it make any difference? Thank you again.

TimP · ‎06-02-2017

It's not evident from the sparse documentation of cgroups whether it is intended to provide a means for setting affinity equivalent to current OpenMP or KMP_AFFINITY facilities used by MKL. Your observation seems to indicate the contrary. In order to get the benefit of local memory, the allocation must always be initialized and accessed by threads on the same CPU.

In practice, good performance sometimes happens until a task is swapped out and returns on another CPU, after which performance remains degraded. It can also happen that if the RAM attached to one CPU is over-subscribed, first touch will not prevent remote memory affinity.

I neglected to point out that the allocator should be set for 32-byte or better alignment.

jimdempseyatthecove · ‎06-05-2017

A process at program start time is provided with a very large Virtual Memory address space. Almost all of it is undefined/not-mapped. This includes heap and stack address space. The first time a given address, granularity == page size, default is 4KB, is allocated from heap, the heap node headers (node being allocated, and typically new free node following allocated node), are updated in the context of the allocating thread. For a large allocation, this may "first touch" two 4KB pages of virtual memory, and map those to pages to the allocating thread. (the two node headers belong to the heap), The remainder pages are not mapped until some thread in the process touches virtual memory in those address ranges. It behooves you to construct a parallel region, who's team members are same in number and partition the memory in the same manner as you will subsequently partition and use the memory. The first touch parallel region may zero the array or initialize it. When initialized content is not of importance, your loop could stride write zeros in 4KB intervals (or page size if you determine this).

Note, if you programically have the O/S use huge (4MB) or gigantic (4GB) page sizes, then the first touch is once per touch. IOW when programming to for improved NUMA access, then smaller page size may be more effective (YMMV).

And, when you return the memory to the heap, those addresses maintain the NUMA association of the thread of first touch. Consider keeping the allocation around for reuse.

Jim Dempsey

per · ‎06-14-2017

Thank you for the responses. Prior to reprogramming according to the received suggestions, I ran a case in which I declared static arrays of sufficiently large size for this test-case in each of the threads. My intention was to estimate the expected outcome of the programming effort. To my surprise, I could not see any measurable improvement. Aren't the static arrays located in the thread-mapped regions regardless the size?

TimP · ‎06-14-2017

If your data set is small enough for L3 cache, the difference between local and remote placement may occur only once. So lack of measurable difference may be normal.

jimdempseyatthecove · ‎06-14-2017

Static arrays are globally mapped to process address space (shared).
Static Thread Private arrays (Thread Local Storage) are also globally mapped to process address space. however, accessing the array (descriptor) is thread-specific at the expense of additional overhead. The actual technique for TLS access is implementation dependent.

Jim Dempsey