Solved: >>OMP_PLACES='sockets(2)'

RKraw · ‎07-26-2017

Hello,

I am developing an application with dual Intel Xeon CPU E5-2630 v4 @ 2.20GHz with Intel S2600CWTS MOBO. I managed to accelerate my application for single socket. For single-thread the computation takes 2.7 second. For running the application on single-socket CPU (I have set it using OMP_PLACES='sockets(1)' and monitored by setting OMP_DISPLAY_ENV=verbose) for 10 threads (which is equal to physical number of cores) i managed to accelerate to 0.39 second (approx 7 times speedup), with hyper-threading and close affinity I managed to further accelerate application to 0,27 second (10 times speedup over sequential).

However, when I try to run dual-socket execution ( OMP_PLACES='sockets(2)' ), my minimal execution time is equal to 0.41 second. This indicates that althoug acceleration over sequential version was achieved, a deceleration over single-socket was observed.I compiled the code with Intel Compiler and I used OpenMP.

Now the questions. Not to elaborate too much over my algorithm, I can say that I am dealing with a solution that easily scales in terms of running the task in parallel. For analysis of 100000 events I can effectively run them separately. I also tried to mitigate the issue of being memory-bound and, despite relatively low computation-to-data-access ratio, I rendered the solution scalable (see the accelerations above).

The issue I am witnessing is that I am obtaining the data from a custom PCIe FPGA card and hold the data in a single buffer. So long as I run single-socket, I do not have to witness QPI data transfer overhead. If I use dual-socked, the efficiency drops. I can run with dual-buffers or split the buffer into two chunks of data but I do not know how to explicitly manage such data transfers.

So far I have been using OpenMP to accelerate. I attempted to use numactl but with dual-socket system I see only a single node. I am making an overview how to effectively split computations and the data.

So my question is the following - Is there any efficient method to explicitly transfer data between CPU sockets ? Can I explicitly move data to a second socket via QPI so that the data will reside on it and no further latencies for data transfer will not be necessary ? Is there any API to do it ? If not, what are the software/ API possibilities to effectively split such computations ? I am collecting data from single PCIe card and upon data transmission I would like to transfer data into second socket memory space, or better, into its L3 cache (having Data Direct I/O in mind).

Best Regards

jimdempseyatthecove · ‎07-26-2017

>>OMP_PLACES='sockets(2)'

This will restrict the process to 2 sockets, but will not necessarily affinity pin the threads. Consider using OMP_PROC_BIND=close .OR. OMP_PLACES=threads.

>>I attempted to use numactl but with dual-socket system I see only a single node.

Your motherboard BIOS will have a setting to enable or disable memory interleaving. Note, some user guides and BIOS screens I've seen list the setting backwards... lost in translation thing. Find the setting and flip it and see what happens.

What appears to be your situation is that the FPGA device is serviced by one socket. Consider constructing your affinity placement such that the master thread is located in the CPU that services the FPGA. The master thread, performs the duty of a server thread and instantiates consumer threads using OpenMP tasks. The consumer threads wait for something to do or an indication of "done". Note, while you can issue a new task each time there is something to do, depending on the amount of work per thing to do you may find it more efficient to wait for a flag/indicator.

Now then, the server thread can copy from FPGA to one of several local buffers to its NUMA node, then pass the buffer address to an available thread via a table of thread states (.or. enqueue a task). Or you can use a ring buffer of buffer addresses where the consumer threads can compete using an XCHG (or compare and exchange) to atomically take an entry and replace with NULL. Note, you have a single-server multiple-consumer setup. The server thread need only to write the buffer pointer. You can determine if you want a flush following the write.

Now then, the consumer threads can pre-determine the socket they reside within, knowing this, the threads on the same NUMA node anc process the data directly. You experiment to see if it is more efficient for the non-local threads to process the data directly or to copy the data to a node-local buffer. Nothing was said about how you process the data, the extra copy may improve throughput at the expense of latency. Perform your "expense" test cases using the full complement of threads and not a single thread.

Jim Dempsey

View solution in original post

TimP · ‎07-26-2017

It should be sufficient to allocate and first fill the buffer in a thread pinned to the cpu .

jimdempseyatthecove · ‎07-26-2017

>>OMP_PLACES='sockets(2)'

This will restrict the process to 2 sockets, but will not necessarily affinity pin the threads. Consider using OMP_PROC_BIND=close .OR. OMP_PLACES=threads.

>>I attempted to use numactl but with dual-socket system I see only a single node.

Your motherboard BIOS will have a setting to enable or disable memory interleaving. Note, some user guides and BIOS screens I've seen list the setting backwards... lost in translation thing. Find the setting and flip it and see what happens.

What appears to be your situation is that the FPGA device is serviced by one socket. Consider constructing your affinity placement such that the master thread is located in the CPU that services the FPGA. The master thread, performs the duty of a server thread and instantiates consumer threads using OpenMP tasks. The consumer threads wait for something to do or an indication of "done". Note, while you can issue a new task each time there is something to do, depending on the amount of work per thing to do you may find it more efficient to wait for a flag/indicator.

Now then, the server thread can copy from FPGA to one of several local buffers to its NUMA node, then pass the buffer address to an available thread via a table of thread states (.or. enqueue a task). Or you can use a ring buffer of buffer addresses where the consumer threads can compete using an XCHG (or compare and exchange) to atomically take an entry and replace with NULL. Note, you have a single-server multiple-consumer setup. The server thread need only to write the buffer pointer. You can determine if you want a flush following the write.

Now then, the consumer threads can pre-determine the socket they reside within, knowing this, the threads on the same NUMA node anc process the data directly. You experiment to see if it is more efficient for the non-local threads to process the data directly or to copy the data to a node-local buffer. Nothing was said about how you process the data, the extra copy may improve throughput at the expense of latency. Perform your "expense" test cases using the full complement of threads and not a single thread.

Jim Dempsey

RKraw · ‎07-26-2017

Many thanks for the information. I will soon give feedback.

Efficient dual-socket utilization with parallel threads