i run my multi-thread application on a XEON E5_2620 that has NUMA on/off option in the BIOS.
i profile my application using pcm-numa of intel PCM.
when i disable hyperthreading as well as the cores of one socket , and i run the app with NUMA on there are no remote accesses reported (actually very small amount)
contrary when NUMA option is off there are almost as many remote as local. No performance difference whatsoever.
does this mean that:
NUMA on or off does not disable any memory ? it just makes OS aware of memory physical location and allocates accordingly the threads?
Thanks in advance?
NUMA enabled/disabled performance is kind of a complicated thing.
Lets start with NUMA disabled in the bios. If I recall correctly, in this case, the hardware will round-robin cachelines from each socket. That is, when the OS allocates a page of memory, it will get the 1st 64bytes from say socket 0, the next 64 bytes from socket 1, the next from socket 2, etc. This results in all memory allocations being spread over all sockets. This can be useful in some cases, say where the software isn't NUMA aware and it somehow falls into the "runs slower with numa enabled" case. I think the level of interleaving may be different from 'per cacheline' on some systems or OS's (for instance you might be able to interleave on a 'per page' level).
The NUMA enabled case is more complicated. On linux there is a 'numactl' command to display numa settings and memory allocation strategies. Here is the simplest, best case. Say numa is enabled in the bios and the OS supports numa and there is more than 1 socket occupied. The software can get to 'local' memory quickest. So if the sw thread allocates memory on the local numa node and the sw thread accesses the memory from the same numa node, it will be able to access the memory as quickly as possible. If the sw thread migrates to another numa node for some reason after allocating the memory, then the sw thread will not be able to access the memory as quickly. Whether this slowdown actually impacts the application performance is very application specific.
On linux (and maybe on windows nowadays... I just haven't checked) you can set per process or per thread numa policies. That is, you can boot with numa enabled and then tell the OS to do a particular memory allocation 'as if numa were disabled'. That is, the OS can interleave the allocation across numa nodes (I think at a per-page interleave level). This can useful if say the allocation exceeds the amount of memory on any single node.
Hope this helps,
could you please elaborate how you round-robin of cache lines work ? I thought the minimum allocation would be one page. Your answer implies the OS allocates one page in 64 byte cache lines across all sockets. How is this address translation working for the TLB ?
My naive thought was that pages are allocated round-robin, not cache-lines.