Processors
Intel® Processors, Tools, and Utilities
14395 Discussions

Much lower memory bandwidth with NUMA disabled

111alan
Novice
6,134 Views

Since some applications could not effectively utilize more than one NUMA node, I'm testing the situation when NUMA was disabled with a 2S cascade-lake-sp system(48 cores, 12channels total). 

The memory latency impact was expected, but the bandwidth was also severely lowered to about 73GB/s, that is 1/3 of the same system with NUMA enabled, or almost half that of a 1S 6-channel system.

Why disabling NUMA has such huge effect on memory bandwidth, even more than on latency? Are there anything I can do to improve on this?

Thanks.

8259CLx2_mlc.JPG

0 Kudos
21 Replies
Alberto_Sykes
Employee
5,527 Views

111alan, Thank you for posting in the Intel® Communities Support.

 

In order for us to be able to provide the most accurate assistance on this matter, we just wanted to confirm:

What is the model of the Intel® Processor that you are using?

Where is the Intel® Processor located, on a node or in a server system?

What is the model of it?

 

Any questions, please let me know.

 

Regards,

Albert R.

 

Intel Customer Support Technician

A Contingent Worker at Intel


0 Kudos
111alan
Novice
5,520 Views

1. originally I used two Xeon Platinum 8259CL processors with 12 Hynix 2Rx8 memory modules running at DDR4-2666 , but I also tried with Xeon Platinum 8260, The results are almost the same.
2. It's a workstation, the motherboard is EP2C621D16-4LP.

3. Just noticed about mirror mode, is there a setting where I could maintain the same copy of memory data on both processors, so I can at least get a 6-channel performance?

Now with some vtune analysis I can basically conclude that, the reason why in some applications, even when the efficiency-per-core of a 48 core EPYC is far inferior than a 24 core Xeon, but it still beats a 2S Xeon platform, is because of their NUMA awareness and the memory bound(when NUMA was disabled).

Thanks.

0 Kudos
Alberto_Sykes
Employee
5,516 Views

111alan, Thank you very much for providing that information.

 

In this case, I will transfer your thread to the proper department for them to be able to further assist you with this matter.

 

Regards,

Albert R.

 

Intel Customer Support Technician

A Contingent Worker at Intel

 

0 Kudos
Emeth_O_Intel
Moderator
5,504 Views

Hello 111alan, 


Thank you for contacting Intel Xeon community, 



Memory access time and effective memory bandwidth varies depending on how far away the cell containing the CPU or IO bus making the memory access is from the cell containing the target memory. For example, access to memory by CPUs attached to the same cell will experience faster access times and higher bandwidths than accesses to memory on other, remote cells. NUMA platforms can have cells at multiple remote distances from any given cell.


Platform vendors don’t build NUMA systems just to make software developers’ lives interesting. Rather, this architecture is a means to provide scalable memory bandwidth. However, to achieve scalable memory bandwidth, system and application software must arrange for a large majority of the memory references [cache misses] to be to “local” memory–memory on the same cell, if any–or to the closest cell with memory. 


For bandwidth-limited, multi-threaded code, the behavior in a NUMA system will primarily depend how "local" each thread's data accesses are, and secondarily on details of the remote accesses.


In a typical 2-socket server system, the local memory bandwidth available to two NUMA nodes is twice that available to a single node. (But remember that it may take many threads running on many cores to reach asymptotic bandwidth for each socket.)


"Local" bandwidth (to DRAM) in most systems is approximately symmetric (for reads and writes) and relatively easy to understand. "Remote" bandwidth is much more asymmetric for reads and writes, and there is usually significant contention between the read/write commands going between the chips and the data moving between the chips.


Example from the Intel Xeon Platinum 8160 (2 UPI links between chips):


Local Bandwidth for Reads (each socket) ~112 GB/s

Remote Bandwidth for Reads (one-direction at a time) ~34 GB/s

Local bandwidth scales perfectly in two-socket systems, and remote bandwidth also scales very well when using both sockets (each socket reading data from the other socket).


Of course different ratios of local to remote accesses will change the scaling. Timing can also be an issue -- if the threads are doing local accesses at the same time, then remote accesses at the same time, there will be more contention in the remote access portion (compared to a case in which the threads are doing their remote accesses at different times).


Please let me know if you have more questions and I will be more than happy to assist you. 


Regards, 


Emeth O. 

Intel Server Specialist. 


111alan
Novice
5,487 Views

Thank you for the detailed reply.

But since all the cores of both CPUs are requesting access to the memory, and all 12 channels are interleaved, we should at least get something close to 1S' bandwidth?

And is the bottleneck of remote memory bandwidth the speed UPI? 34GB/s is still much lower than the total BW of three UPI links though.

I still want to see if there's any possibility of a memory mirror mode when NUMA is disabled, like SLI for graphic cards. Although it wastes half of total memory capacity, it's still better than halve the memory bandwidth and having 50% more latency. Since UPI will only be maintaining memory coherence instead of transfering large amount of data, its latency will be better too. And there are also some optane DCPM laying around so capacity is the least of our concern.

Thank you.

0 Kudos
Emeth_O_Intel
Moderator
5,478 Views

Hello 111alan,


Thank you for the information provided.


Let us take a deep look at the details provided, as soon as possible I will be getting back to you in order to proceed with the next step.


Regards,


Emeth O.

Intel Server Specialist.


0 Kudos
Emeth_O_Intel
Moderator
5,462 Views

Hello 111alan,


I would like to ask you some details in order to have a better understanding of this thread.


  • What applications are you using?
  • Are you a software developer?

 

If you are a software developer, we can recommend that you post your questions as well in the Intel software developer community forum: https://community.intel.com/t5/Intel-Moderncode-for-Parallel/bd-p/moderncode-parallel-architectures


Please let me know the information in order to proceed with the next step.


Regards,


Emeth O.

Intel Server Specialist.


0 Kudos
111alan
Novice
5,439 Views

Mostly ANSYS CFD, but I've also done some other tests. For example, here's some results from PTS toolset, including 1S, 2S and 2S NUMA off:

https://openbenchmarking.org/result/2010047-VE-2009185VE58

And no, I did some database related coding but it's not my profession. But I would often try some performance optimization and testing for the hardware platforms I have. That's when I found out the scalability problem, where 2S intel Xeon platform would sometimes lose to 1S AMD EPYC2, even as its inter-socket latency is lower than the later's local latency.

Thanks.

0 Kudos
Emeth_O_Intel
Moderator
5,448 Views

Hello,


I was reviewing your thread and I would like to know if you still have some questions about this matter? If so, please do not hesitate and let me know and I will be more than happy to assist you.


Regards,


Emeth O.

Intel Server Specialist.


0 Kudos
Emeth_O_Intel
Moderator
5,433 Views

Hello 111ala,


Understood, thank you for the information provided.


I will double-check the details and as soon as possible I will contact you back.


Regards,


Emeth O.

Intel Server Specialist.


0 Kudos
IntelSupport
Community Manager
5,416 Views

Hello 111alan, 


I would like to provide an update on this thread.


In general, for best performance, NUMA should be enabled. Some software like the Ansys software does depend on good NUMA locality.


NUMA disabled means all memory allocations are interleaved across NUMA domains, so every process suffers higher memory latency and lower memory bandwidth. The Ansys software is sensitive to both latency and bandwidth. Even if undersubscribed, so that bandwidth per process is higher, the performance will still suffer from higher memory latency.


Now, we understand that you disabled NUMA because "some applications could not effectively utilize more than one NUMA node", but could you please let us know other reasons why you require NUMA to be disabled?


Wanner G.

Intel Server Specialist.


0 Kudos
111alan
Novice
5,392 Views

Because in the case of CFD, there will be a 30% performance uplift with NUMA disabled. My friend also reported that the same goes for Matlab. This is kind of proved by some benchmarks in PTS. Sometimes the difference between 1S and 2S are minimal. In these situations simply disabling NUMA will help, but this will make the memory run at about 1/4 of the bandwidth so it'll still be bottlenecked.

If there's any way to optimize the memory performance it'll be extremely appreciated.

Thank you.

0 Kudos
IntelSupport
Community Manager
5,388 Views

Hello 111alan, 


Thank you for providing more details about your request. 


Please allow us to look into these details.


We will update this thread soon.


Wanner G.

Intel Server Specialist.


0 Kudos
IntelSupport
Community Manager
5,372 Views

Hello 111alan,


To continue working on your request, we would like to have additional information about your environment:


Is one of the reasons that you need NUMA disabled (despite the severely lowered memory bandwidth you see) to get the 30% performance uplift with the Ansys CFD software?


Or have you heard or read somewhere that there will be a 30% performance uplift with CFD if NUMA is disabled but not seeing this uplift?


Is this related to the results from PTS toolset?

https://openbenchmarking.org/result/2010047-VE-2009185VE58


What issues are you seeing with the CFD software when NUMA is enabled?


Wanner G.

Intel Server Specialist.


0 Kudos
IntelSupport
Community Manager
5,321 Views

Hello 111alan,


We are currently working on addressing your questions. However, we would like to have more details to provide accurate information.


Were you able to review the details we requested on our previous post?


Wanner G.

Intel Server Specialist.


0 Kudos
IntelSupport
Community Manager
5,303 Views

Hello 111alan,


Since we have not seen an update for several days, we will proceed to close this thread.


If you need further assistance, please do not hesitate to update this thread, and we will be happy to help you.


Wanner G.

Intel Server Specialist


0 Kudos
111alan
Novice
5,301 Views

I'm conducting more tests, I'll update this thread tomorrow. Thx.

0 Kudos
IntelSupport
Community Manager
5,291 Views

Hello 111alan,


Thank you for the update. 


We look forward to hearing from you.


Wanner G.

Intel Server Specialist


0 Kudos
111alan
Novice
5,275 Views

Update:

I've optimized the system and made a cleaner result:

https://openbenchmarking.org/result/2010205-VE-2009185VE44

Fell free to try selecting two results you want to analyse and hide the others if needed. the There are many situations where disabling NUMA will produce performance uplift, some of those can be more than 50% just like what I saw in CFD. Also we can notice that in quite a few workloads, having 1 CPU is almost the same as having 2, while a large, glued EPYC2 produces better results. That's why I think memory performance when NUMA was disabled is important, there are so many cases where programs can't properly use more than 1 node. Many system builders I know are starting to tell their customers to try disabling NUMA and re-evaluate performance recently.

Thank you again for the attention.

 

0 Kudos
IntelSupport
Community Manager
5,008 Views

Hello 111alan,


Thank you for providing detailed information. We will review it, and update this thread soon.


Wanner G.

Intel Server Specialist


0 Kudos
Reply