My apologies in advance. This is the first time I'm posting on the Intel developer forums, and in all fairness, I'm not 100% sure that all parts of my question belong in this forum. Please bear with me.
I work in a university CS and Eng. department. A faculty member recently purchased an Intel server with the W2600CR motherboard, and dual Xeon E5-2620 processors along with 4 x Nvidia GTX670 cards to be used with GPU computing experiments (with CUDA) under Linux (Red Hat Enterprise). The system has 64 GB of DDR3 memory. Because of the system architecture, 32 GB is "attached" to each CPU, and 2 PCI-E slots, hence two GPUs are attached to each CPU (1 x16 and 1 x8 on the first, 2 x8's on the second).
After getting the system setup, the faculty member was shocked that the performance of paged memory tests between the system and the graphics cards were extremely poor (simply running CUDAs bandwidthTest program with the --memory=pageable option). The results were in the range of 1500 MB/s instead of 4000 MB/s that he was getting on a much older Core i7 system from 2009. The GTX670 graphics card was taken out of this newer system, and moved to the 2009 system, and it worked at the proper performance level there. The GTX580 graphics card was taken out of the 2009 system and tried in the new system, and it also performed equally as poorly as the GTX670.
I spoke to the vendor who we had purchased the Intel solution from who opened up a ticket with Intel, and we've been going back and forth for quite a few weeks now with no resolution to the problem. I tried many different setups, but finally, I removed one of the Xeon processors from the new system, the second 32 GB of memory, and 3 of the graphics cards, and now the performance tests worked perfectly giving the expected results. This led me to wonder if this was some kind of NUMA issue -- maybe I was somehow running the performance test on CPU 0, using a graphics card connected to CPU 1, and memory connected to CPU 1. Could this be the reason why the performance numbers were so poor?
I put the second Xeon processor back in the system, but left out the second 32 GB memory, and left only one GPU in the system connected to CPU 0. I used "numactl --hardware", and expected to see two nodes with 12 cores each (E5-2620 is dual 6 core, and hyperthreading is enabled). Instead, I saw 1 node with 24 cores! I ensured that NUMA was enabled in the BIOS. I did try running the bandwidthTest on all 24 cores, and while there was some minor variation in numbers, nothing even close to the 4000 MB/s result that I wanted to see.
I decided that since I didn't have much luck getting Intels help to solve the GPU problem, that maybe if we could solve the "NUMA" issue that the GPU performance issue would be gone. After more back and forth with the vendor, and Intel, Intel apparently setup a W2600CR system with Linux (RH6.1 since apparently Intel doesn't "support" later versions) in a test lab. I excitedly waited for the response. Would it be an O/S bug? A BIOS issue that needed to be fixed??? My vendor called me back and explained the the support person said that after testing all the boards in the W2600 series, this NUMA behaviour was normal for this board and processors. My problem is, I just don't understand *WHY* it's normal. Another vendor which has an excellent tutorial on using numactl under Linux on their web site has a system with the same chipset, and same processors, yet they show 2 nodes! Nobody seems to be able to answer *why* it's different for this Intel board.
In the end, I've been unable to rectify the situation, and have lost a LOT of time trying. Is my only choice at this point to buy a second server board, move the second Xeon processor, 32 GB of memory, and two GPUs there? I really don't believe in "solving" problems this way...I suspect the answer *IS* out there, and maybe, just maybe you've got it ... please? ! :)
Thanks Sergey and Iliyapolak,
Sergey - First, I'm not sure if the problem is or is not NUMA related. However, I know that Intel has specialists in this area, and it's not clear why my support ticket doesn't get escalated to someone who can directly help me in this respect, and explain the problem. At least someone at Intel should be explain to me why it's "normal" for this board to see all cores attached to one node. Unfortunately, I can't return the board. I wish I could.
Thanks Iliya for the posts -- they are very very helpful, and include some links that I haven't yet read, especially the one that you marked "very insightful article".
However, in the end, I suspect that "maybe" my problem wouldn't be a problem if I could use numa properly with this board.
As it happens, I have now identified that the problem is not actually NUMA after all. After adding back 32 GB on the second processor, numactl reports 2 nodes with 12 cores on each node, and 32 GB on each. Apparently by removing the memory from the second memory bank (in order to force memory access to CPU0), it is normal NUMA behaviour to show all cores on one node. It was actually a person who wrote one of the papers that Iliyapolak (above) had brought to my attention that explained this...
For testing, if I have only one GPU that is connected to a PCI-E port connected to CPU 0, if I use numactl to run bandwdithTest on the first CPU, I get one result, and then if I use numactl to run bandwidthTest on the second CPU, the result is a little bit slower (since it has to talk to the GPU connected to CPU0). This makes sense, This doesn't explain the 1000 MB/s reduction in speed in bandwidthTest when a second CPU is inserted. I still believe this is a chipset flaw. I need to speak to the vendor about escalating this request. It would sure be nice to have the problem resolved since I've been working on it for weeks!!! It would be nice if an Intel engineer who reads this message might be able to help.
>>>Thanks Iliya for the posts -- they are very very helpful, and include some links that I haven't yet read, especially the one that you marked "very insightful article".>>>
You are welome.
>>>This doesn't explain the 1000 MB/s reduction in speed in bandwidthTest when a second It would sure be >>>
Such a behaviour(when second CPU is present) can be traced back to NUMA memory distances.Bear in mind that NUMA functionality at the physical level resembles small network with its own protocol ,error check and correction ,hardware arbitration etc...
I still don't have any response from Intel. Someone (non-Intel) suggested that with one processor installed, there's no QPI. When I install the second processor, QPI is enabled, and with a "slower" processor (2.0 Ghz), it is possible that this isn't enough to run QPI at full speed, hence affecting the result. Even though MY program isn't using QPI. This is, by the way, running bandwidthTest on CPU 0, using memory bank 0, using a GPU in an x16 slot that is controlled by CPU 0. The problem is not NUMA related because running the test using memory bank 1 or CPU 1 shows the true effects of NUMA. I suspect I will never really 100% know the answer!
>>>Someone (non-Intel) suggested that with one processor installed, there's no QPI>>>
Not exactly.On single CPU system QPI is used for intreconnecting processor with I/O hub (X58 chipset).On multi-processor system QPI is used to interconnects nodes and I/O hubs.
>>> Even though MY program isn't using QPI>>>
How can you know that your program does not use QPI?
Good point. My program isn't using the processor/memory interconnection of QPI whether running on single processor/dual processor configuration. I'm told that Intel will still setup a trial and get back to me. It just takes a long time.
>>>My program isn't using the processor/memory interconnection of QPI whether running on single processor/dual processor >>>
Sorry I was looking at wrong chipset.I assume that your motherboard is build around C600 chipset?
Jason K. wrote:
Yes, C600-A to be specific..
So I was wrong I thought that graphics data (text strings or fonts) is moved over QPI to I/O hub which sends it to gpu for text rendering.
I waited for the Intel response to try the single and dual Xeon configuration in the W2600CR along with NVIDIA GTX580 card, and the CUDA bandwidthTest program. In particular, I want to understand the reason why the pageable memory test with dual CPU runs at 1 GB/s reduction in speed. I gave explicit instructions on where to download CUDA, how to install it, how to run bandwidthTest if a demonstration was needed.
The first response that I got back from Intel:
Thank you for contacting Intel Technical Support.
The Intel® QPI technology is not an issue in the Intel® S2600 series of Server boards. The performance and benchmarks you received with one or two CPU are correct.
We ran the tests with several Intel® Server board with the C600 chipset and the Intel® QPI technology and we received the same behavior with 2 CPUs in the configuration.
So I write back, and say that I *know* that they can replicate the result. I want to know the *reason* for the result. My ticket gets "escalated" to someone else. Now I get:
The way I read this is the customer’s app or bench is poorly (multi) threaded.
Tell them to try a different multi-threaded benchmark and/or different app.
We have some performance bench available here: http://www.intel.com/content/www/us/en/benchmarks/server/xeon-e5-2600-summary.html
Perhaps try one of them to see if there’s really performance degradation with the second processor installed.
Take note that I have seen some HPC benchmark performance affected by HyperThreading but this is again due to an app related issue, not board or processor.
I now give up. I told my vendor to just close the ticket. I've been working on this for weeks, getting back responses like above that don't address my question whatsoever. All I wanted was for a hardware expert to be able to explain what was causing the discrepency. Unfortunately, the Intel support that *I* have access to is unable to explain this to me.
I understand you, but I do not think that simple benchmark can diagnose the problem.In order to investigate the problem someone must dig deep into internal implementation of Uncore,QPI ,NUMA and GPU's front end probably at the machine code level.As far as there is no large scale performance penalty with various apps nobody will really investigate such a issues and software will be blamed for performance degradation.
I have an experiment for you to try.
I've noticed that you have experimented with removing one CPU's attached memory and running with 2 CPUs, as well as removing 1 CPU and good or reasonably good performance. Good detective work by the way. The additional experiment is to configure with 2 CPU's with NUMA enabled (the configuration you want).
Run the CUDA bandwidth test with the bandwidth test app constricted to one NUMA node (and one of the GPUs). Essentially you have done this already. However the twist is, on the other NUMA node, make a dummy app, that engagues all threads on that node performing _mm_pause();
Get results, repeat test using same nodes for apps, different GPU
Get results, repeat test using swapped nodes for apps, different GPU
Get results, repeat test using same nodes for apps, different GPU
Essentially what the test is doing is assuring that virtuallly all hardware threads on other CPU are minimally iterfereing with QPI bus
Should the GPU bandwith performance improve then this may provide some insight for the Intel support people to follow-up on. However, this will not fix your underlaying problem.
An additional test to run is
Run 2 CUDA bandwidth test apps concurrently, each constricted to a NUMA node and using the GPU attached to the CPU on that node.
This test would more likely be representative of your application (as opposed to testing each individually).
Even though this is an old topic, I thought I should append a proper explanation in case anyone looks here for guidance on this topic.
There are several issues here, and I want to try to be clear about what is dependent on the hardware versus what may have significant software dependencies.
The initial complaint was a large reduction in performance for transfers between system memory and the graphics cards on the new system compared to the old system. While there is a possible NUMA issue here, the problem is made much more difficult to diagnose by the use of "pageable" memory. The DMA drivers used for the bulk transfers between the graphics card and processor only understand physical memory addresses and not virtual memory addresses, so if you access pageable memory there is a lot of extra overhead required to confirm that the user data has not been moved to different physical pages between the initial setup and the subsequent use. This overhead can easily be comparable to the transfer time for a 4KiB page (4 KiB / 4 GB/s = 1 microsecond), so can easily result in the observed performance degradation. The difference between the "old" and "new" systems may be due to different treatment of the two cards by the same driver software. In one case the driver may force the pages to be pinned -- which would provide a one-time overhead followed by consistently high performance -- while in the other case the driver might not require that the pages be pinned, but pay a significant run-time overhead on DMA transfers. This is just speculation, of course, but it is a reminder that if you can't see all the software, then you can't be certain that it is actually behaving the same way.
The second complaint was that the transfer bandwidth is decreased by 1 GB/s with the second processor installed. There are possible NUMA issues here as well, but there is also an important set of hardware mechanisms that have changed. The first mechanism is that QPI is always being used in the two-socket system to ensure cache coherence. If the GPU executes a DMA read targeting memory on the attached processor chip, a "snoop" signal is sent across the QPI interface to the other socket to check to see if there is a modified copy of the requested cache line in any of the caches on the other chip. This happens in parallel with reading the data from local DRAM, but takes longer, and the PCIe controller cannot return the data to the GPU until a QPI "snoop response" is received from the other socket telling the PCIe controller that there are no modified copies of the data on that other chip. This increases the latency of each DMA read transaction, and if the number of concurrent DMA reads is limited, then the increased latency can cause a linear reduction in throughput.
There is a second mechanism that exacerbates the increased latency due to snooping. The Xeon E5-26xx processors support a hardware feature called the "package C1E state". If all of the cores in a socket are in the "C1 halt" state (because the OS has no task to run on them), the hardware will automatically transition all of the cores to the minimum supported frequency. (This should be 1.2 GHz on the Xeon E5-26xx processors.) On the Xeon E5-26xx processors, the "uncore" (containing the L3 cache, ring interconnect, memory controllers, etc), runs at a speed that is no faster than the fastest core, so the "package C1E state" causes the uncore to also drop to 1.2 GHz. When in this state, the chip takes longer to respond to QPI snoop requests, which increases the effective *local* memory latency seen by the processors and DMA engines on the other chip! The amount of the increase depends on the base frequency of the systems, but it looks like the snoop response latency includes 24 cycles in the clock domain of the remote chip. On my Xeon E5-2680 chips, the "package C1E" state increases *local* latency on the other chip by almost 20%. This is in addition to the latency increase that occurs from the addition of the second chip to the system. The "package C1E state" also reduces sustained bandwidth to memory located on the "idle" chip by up to about 25%, so any NUMA placement errors generate even larger performance losses.
While Intel recommends against disabling the "package C1E state" on the Xeon E6-26xx processors, it is easy to prevent it from occurring. I wrote a simple "spinner" program that simply increments a register in an infinite loop. Using "taskset" under Linux, I bind this spinner program to a core on the "remote" socket and leave it running while I am doing my work on the "local" socket. The "spinner" program keeps at least one core in the C0 (operational) state, which prevents the hardware from entering the package C1E state, and therefore keeps the "uncore" running at full speed. You can either remember to kill the "spinner" program when you are finished with your local testing or you can set it to the minimum possible priority so that it can stay in the background forever with minimal (but not quite zero) impact to the scheduling of processes at "normal" priority.
None of these items are secret, but the information is scattered in bits and pieces across many documents, and it can take quite an effort to pull enough of it together to form a coherent picture. The basic principles apply to all recent multi-socket Xeon systems, but many of the details differ. For example, I don't know if the "package C1E state" increases the memory latency on the Xeon E4-26xx v3 (Haswell) processors because on these newer processors there is a BIOS option to disable the "package C1E state", so I don't have to worry about it. (Presumably there is a slight increase in power consumption in some cases, but that is not a problem in my environment.)