I'm working on a memory allocator that should be optimized for use on NUMA architectures. It tries to take memory from the node where the current thread resides, and if it can't find there, it "steals" from other nodes.
Here comes the problem: I want to take memory from the nearest node (lowest hop value) to the farthest (so the access is as low as possible), but I couldn't find an API (at least in Windows) to get this type of information (only which CPUs are in a node, not relationships betweennodes).
Is there any way to get the distance between nodes ? It must not benecessarily an API, CPUID/APIC would be very good if there is no other way.
Thanks in advance!
Such a NUMA aware memory allocator is available in QuickThread (http://www.quickthreadprogramming.com). This allocator can be used independent of the QuickThread threading environment (OpenMP, Windows thread, TBB, etc...) as well as within the QuickThread threading environment. QuickThread programming environment permits taskscheduling to within NUMA nodes.
// distribute one instance of task to each NUMA node
parallel_distribute( OneEach_M0$, fooNode, A, B, C);
void fooNode(intptr_t teamMember, intptr_t teamSize, A_t A, B_t B, C_t C)
// within this NUMA node run slice of array
parallel_for( M0$, foo, 0, countFoo, teamMember, teamSize, A, B, C);
void foo(intptr_t teamMember, intptr_t teamSize, intptr_t iBegin, intptr_t iEnd A_t A, B_t B, C_t C)
... // run slice of arrays (iBegin, iEnd]
Such a NUMA aware allocator is more effective when the problem has a majority of random memory access as opposed to a majority of serial memory access. On serial access problems most processors can be configured for pre-fetch and many serial-like access programs will not out pace the pre-fetch (regardless of inter-node access). On this subset of problems the aware NUMA allocator is not as beneficial for reducing time for memory access. This noted, when the application has a high flux in allocation/de-allocation the NUMA aware allocator will "dilute" the frequency of CAS/Interlocked/Locked operations by 1/number of NUMA nodes (assuming allocations/de-allocations evenly distributed). Most such scalable allocators allocate blocks of similarly sized items and the frequency of CAS/Interlocked/Locked operations are already reduced to when the allocations consume/fill a block.
On Linux you can use numa_distance() function from libnuma:
On Windows I am not aware of any such API. Sad. You may try to investigate sources of libnuma, maybe it will give you the solution.
mentions the various facilities provided by Windows for NUMA aware allocation, although it's unusually cryptic.
gives some recommendations, in case you want assurance of multi-vendor platform support. Also, there are references on the differences between application level usage on Windows and other OS.
Thanks for the answers.
I have found a solution for Windows (though not tested yet). The distances can be found in the SLIT - System Locality Information Table - an extension provided by OEMs and available through ACPI. It contains a N_CPU x N_CPU matrix that describes the distances ( is the distance between nodes 1 and 3;
Windows Vista+ has the method GetSystemFirmwareTable that can be used to retrieve this table (it seems that under XP it can be retrieved from the Registry, but it's more difficult).
The structure of this table can be found in the ACPICA package (http://www.acpica.org).