Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

NUMA on Xeon E5-2620

Jason_K_1
Beginner
7,289 Views

My apologies in advance.  This is the first time I'm posting on the Intel developer forums, and in all fairness, I'm not 100% sure that all parts of my question belong in this forum.  Please bear with me.

I work in a university CS and Eng. department.  A faculty member recently purchased an Intel server with the W2600CR motherboard, and dual Xeon E5-2620 processors along with 4 x Nvidia GTX670 cards to be used with GPU computing experiments (with CUDA) under Linux (Red Hat Enterprise).  The system has 64 GB of DDR3 memory.  Because of the system architecture, 32 GB is "attached" to each CPU, and 2 PCI-E slots, hence two GPUs are attached to each CPU (1 x16 and 1 x8 on the first, 2 x8's on the second).

After getting the system setup, the faculty member was shocked that the performance of paged memory tests between the system and the graphics cards were extremely poor (simply running CUDAs bandwidthTest program with the --memory=pageable option).  The results were in the range of 1500 MB/s instead of 4000 MB/s that he was getting on a much older Core i7 system from 2009.  The GTX670 graphics card was taken out of this newer system, and moved to the 2009 system, and it worked at the proper performance level there.  The GTX580 graphics card was taken out of the 2009 system and tried in the new system, and it also performed equally as poorly as the GTX670.

I spoke to the vendor who we had purchased the Intel solution from who opened up a ticket with Intel, and we've been going back and forth for quite a few weeks now with no resolution to the problem.  I tried many different setups, but finally, I removed one of the Xeon processors from the new system, the second 32 GB of memory, and 3 of the graphics cards, and now the performance tests worked perfectly giving the expected results.  This led me to wonder if this was some kind of NUMA issue -- maybe I was somehow running the performance test  on CPU 0, using a graphics card connected to CPU 1, and memory connected to CPU 1.  Could this be the reason why the performance numbers were so poor? 

I put the second Xeon processor back in the system, but left out the second 32 GB memory, and left only one GPU in the system connected to CPU 0.  I used "numactl --hardware", and expected to see two nodes with 12 cores each (E5-2620 is dual 6 core, and hyperthreading is enabled).  Instead, I saw 1 node with 24 cores!  I ensured that NUMA was enabled in the BIOS.  I did try running the bandwidthTest on all 24 cores, and while there was some minor variation in numbers, nothing even close to the 4000 MB/s result that I wanted to see. 

I decided that since I didn't have much luck getting Intels help to solve the GPU problem, that maybe if we could solve the "NUMA" issue that the GPU performance issue would be gone.  After more back and forth with the vendor, and Intel, Intel apparently setup a W2600CR system with Linux (RH6.1 since apparently Intel doesn't "support" later versions) in a test lab.  I excitedly waited for the response.  Would it be an O/S bug? A BIOS issue that needed to be fixed???  My vendor called me back and explained the the support person said that after testing all the boards in the W2600 series, this NUMA behaviour was normal for this board and processors.  My problem is, I just don't understand *WHY* it's normal.  Another vendor which has an excellent tutorial on using numactl under Linux on their web site has a system with the same chipset, and same processors, yet they show 2 nodes!  Nobody seems to be able to answer *why* it's different for this Intel board.  

In the end, I've been unable to rectify the situation, and have lost a LOT of time trying.  Is my only choice at this point to buy a second server board, move the second Xeon processor, 32 GB of memory, and two GPUs there?  I really don't believe in "solving" problems this way...I suspect the answer *IS* out there, and maybe, just maybe you've got it ... please? ! :)

Jason.

0 Kudos
24 Replies
jimdempseyatthecove
Honored Contributor III
1,302 Views

Thanks John. It light of this, it would appear to be appropriate for the user application to discover if it were on a NUMA system and then creatw and schedule these background (low priority) spinner threads. Thus when the application exits, the spinner threads quit.

Would you happen to know if the spinner thread looped with an _mm_pause() would alter the effectiveness of the C1E state blocking?

Jim Dempsey

0 Kudos
McCalpinJohn
Honored Contributor III
1,302 Views

The PAUSE instruction should be useful for reducing the power overhead and sibling-thread-performance overhead of having a spinner thread, though I have not tested it.

From Intel's documentation it is a bit hard to tell what the PAUSE instruction actually does.  Agner Fog's "instruction_tables" document says that on Sandy Bridge the PAUSE instruction generates 7 uops and has a throughput of one instruction per 11 cycles.

The C1 HALT state is used when the core is not expected to be needed for (at least) several microseconds, which is thousands of times longer than the delays introduced by the PAUSE instruction.

0 Kudos
TimP
Honored Contributor III
1,302 Views

As John said, pause instruction doesn't appear to wait long enough to enable power-saving state, although perhaps it avoids unnecessary turbo activation.  I believe it was introduced to allow a sibling thread (in the case of Hyper-threading) to get more than 50% of the core processing resources and solve some usage cases where HyperThreading might otherwise be useless.

It seemed to be a long struggle to get this incorporated into the major operating systems but that is ancient history now, aside from those people who persist in using XP.

0 Kudos
McCalpinJohn
Honored Contributor III
1,302 Views

Hmmm... Interesting point about Turbo activation that I had not considered....

On the Sandy Bridge processors, local memory latency is minimized when the processor in the other socket is running its uncore at maximum frequency, which means that at least one core on that socket is running at maximum frequency.   My "spinner" program keeps a core busy incrementing a register, so if maximum Turbo boost is requested it is delivered.   

On Xeon E5 systems I activate the fixed-function counter in the Uncore UBOX (MSR 0x703 to enable and MSR 0x704 for the counts) to measure Uncore clock cycles.  I check the counts before and after a 10 second sleep to get the average uncore frequency.   Note that you need to have at least one process running on the socket during this interval to prevent the processor from dropping into a deep package C state (which appears to prevent this counter from counting).

0 Kudos
Reply