Hello Nils,

Nils_M_1 · ‎06-13-2013

Hello,

I am evaluating the performance of communication between Host/MIC and MIC/MIC using uSCIF. SCIF offers two means of communication. One is the message system (scif_send and scif_recv), the other one is DMA (scif_writeto and scif_readfrom). I am especially interested in fast exchange of small messages, say some kb, between the devices. The SCIF user guide advertises the messages system for small message sizes and DMA for large messages. However, the performance of messages is rather poor both in terms of latency and throughput. On the other hand, DMA achieves high throughput only for large message > 100k and has much lower performance otherwise.

I was quite surprised that the most efficient way to copy small data sizes from one device to another is to parallelize store operations to remote memory windows using multiple threads and double-buffering. My implementation of the copy operation is really simple. I set up the window into the remote memory via scif_mmap, and then apply memcpy to move data from the local memory to the remote memory.

I have attached a plot of the throughput of copy operations from the MIC to the host processor. I vary the number Nt of threads that perform the copy operation in parallel and also the amount of data. The total message size is simply the sum of the amount of data copied by each thread. This allows to compare the performance with a single DMA write operation.

The performance is really good for passing small messages, but there are a couple of questions I like to address:

The throughput scales somewhat with the number of threads as long as Nt <= 4. However, there is an some upper limit that I don't understand. The limit is 4.5 to 5 GB/s (about 60% of PCIe peak). I have seen this 'feature' on several MIC systems, and it appears for any combination of devices (i.e. MIC -> host, host -> MIC and MIC -> MIC communication exhibit the same bottleneck as long as the devices are attached to the same PCIe root complex). What is the limiting factor here? Can this bottleneck be removed?
Is there any suggestion for a copy-kernel that performs better than memcpy for access of remote memory? I would expect memcpy to be highly optimized, but I'm not quite sure if this is really the case (I include string.h, but I'm not aware of the actual implementation)
memcpy works fine to copy 64, 128, and N*256 bytes (N integer). Other sizes, e.g. 196 bytes, are problematic. The data never arrives at the remote node. It is not clear to me why this happens and how to avoid it (I use data arrays aligned to 4k boundaries).
What about ordering of the data? Is it guaranteed that the data arrives in-order on the remote node? If so, it would be very easy to check for completion of the operation by polling of the last data element on the remote node.

Thanks!

jimdempseyatthecove · ‎06-14-2013

>>The limit is 4.5 to 5 GB/s (about 60% of PCIe peak).

Peak performance is generally described as (hypothetical zero wait input buffer) -> memory, and the other direction. The missing 40% may be the memory banwidth limitations to the non-(hypothetical zero wait input buffer).

I don't have a MIC so this is just a hypothesis.

>>What about ordering of the data? Is it guaranteed that the data arrives in-order on the remote node?

If you are using the Nt copy, you are not sure what the order of writes are. In this situation, if you have a barrier after the block write, then one of the threads could then write a done flag.

In the event of DMA, you have to check to see if the DMA is single channel or multi-channel. Additionally, for single channel DMA, the driver may be smart enough to start sending pagedin portions of the block, while a background thread touches the paged out portions (to force the O/S to page in). So even on single channel DMA the order may not be assured (i.e. it is not within your control). What works today may not work after next driver upgrade.

Jim Dempsey

Nils_M_1 · ‎06-17-2013

Jim, thanks for your comment. I have already bin speculating about the limitations originating from memory. But then one should see these limitations also using DMA, right? I could also imagine some limitations imposed by PCIe (like running out of credits). Not sure how to check that.

My formulation about ordering was somewhat sloppy. Of course, there is no ordering between the data sent by different threads. But what about a single thread (Nt = 1) copying the data via memcpy. If the data is not guaranteed to be ordered in that case, then in the case of multiple threads a barrier would not help much.

I don't know much details of the DMA implementation. I haven't read all the SCIF source code. The code is quite complicated and there is no documentation available. However, there is a comment in one of the documents that ordering is not guaranteed in the case of DMA. That's all I know. I hope that Intel will improve the performance in the near future. With our dual-MIC setup we see that the performance of data transfer between the MICs is much better if the data is copied by an external InfiniBand controller (if the message size is sufficiently small; but this data path is not suitable for us).

Florian_R_ · ‎06-18-2013

It would be really helpful if somebody from Intel would respond to this thread. We've experimented a lot with SCIF and our conclusion is that at the moment SCIF is not the most efficient way of transferring data, however, a fast data transfer is somewhat mandatory in order to archieve high performance.

jimdempseyatthecove · ‎06-18-2013

Program control memory transfers have no setup time, however, program control memory transfers interact with the core's/CPU's cache and memory controller and therefore may be detrimental to other threads.

DMA transfers generally involve an O/S call to a device driver. So there is some overhead to recover. But you do not affect the cache or memory controller.

Overall system impact has to be taken into consideration and not just the benifit of the thread performing the xfer.

Note, some enterprising controller manufacturer could produce a device that maps a page to an application VM address space. Similar to a memory window, except what's mapped is not RAM in the controller, rather it is a register set of the DMA engine (to be defined). An application would thus have a single call to the O/S which would collect the VM's mapping tables for the app that reside in the host and all attached MIC's participating in the application. This may require that open device locks in specific address ranges from paging.

To rephrase: for a typical device the device driver locks the VM's pages referenced in I/O, mapps the virtual address to physical address(s) and performs scatter/gather/direct I/O. The new device moves those procedures into the device, and thus permitting the application to supply only the source virtual address, destination virtual address, and byte count. Note, it could also supply an operation such as: copy, swap, add, subtract, or, ....

Jim Dempsey

Frances_R_Intel · ‎06-19-2013

I'll find someone more knowledgeable to get you more information, but what I can tell you right now -

- scif_send/scif_recv have the highest latency of the SCIF data transfer methods and should be used for non-latency/BW sensitive messages.

- All the examples I have found using scif_mmap use memcpy, so I do not believe there is another approved method.

- Transfers from the host to the coprocessor using a single thread have lower latency than transfers from the coprocessor to the host. So I would have expected them to be faster, although that doesn't seem to be what you are seeing.

- Scif messages are ordered; the DMA is not - although there is an option to guarantee that the last data sent is the end of the transfer (so that when you see the end of the transfer show up, you know it is complete.) I haven't found anything on the memory mapping.

- DMA transfers are cache line oriented so the amount of data transferred will always be a multiple of that. I didn't find any documentation restricting the length of data transfers using memory mapping so I would consider the failure of memcpy for transfers that are not 64, 128 or N*256 to be a bug.

Nils_M_1 · ‎06-20-2013

Frances, thanks for your help. Your statements confirm my current knowlegde: scif_send/send_recv supports for ordering, but does not provide low latency/high throughput; DMA is not ordered, but one can supply the flag SCIF_RMA_ORDERED to ensure that the last cache line arrives after all the others (I just started some experiments with that); all examples for memory mapping rely on memcpy, but there is no comment on ordering.

I have already seen that the performance of SCIF Messages/DMA/PIO is asymmetric, i.e., the performance signatures for communication between MIC/Host and MIC/MIC can be different. I have attached a plot that shows the throughput of PIO (memcpy) for Nt = 1 and 16 for communication Host->MIC, MIC->Host and MIC->MIC (in the latter case the MICs are attached to the same PCIe bus, so there is no performance issue due to QPI). Interestingly, for the data path Host->MIC a single thread is enough to see the bottleneck around 5 GB/s. The data paths MIC->Host and MIC->MIC exhibit comparable signatures, but the performance is not as good as in the case Host->MIC.

I still haven't checked the implementation of memcpy. I started reading about streaming stores, maybe that's an alternative to memcpy...?

Nils_M_1 · ‎06-27-2013

I have tried to beat the memcpy implementation on the MIC using streaming stores. However, the execution time of my implementation is larger than the execution time of the optimized intel_fast_memcpy. I only tried to copy data between main memory on the MIC, but I expect copying to I/O would also be slower than using memcpy.

I have also played with the SCIF_RMA_ORDERED flag using scif_writeto. My implementation essentially looks like this:

scif_writeto(epd, local_buffer, size, remote_buffer, flags);

scif_fence_signal(epd, local_signal, local_tag, remote_signal, remote_tag, \

SCIF_FENCE_INIT_SELF | SCIF_SIGNAL_LOCAL | SCIF_SIGNAL_REMOTE);

If I set flags=0 then repetitively executing the lines above works fine, however, if I set flags=SCIF_RMA_ORDERED I get errno 12 (ENOMEM). This error is not listed in the SCIF manuals. All buffers are 4k-aligned and size is chosen 64 * 2^N (N=0,1,2,..). What works for me is to set flags=SCIF_RMA_USECPU | SCIF_RMA_ORDERED, but in that case the performance is significantly degraded. I haven't found any example code on how to use the ordering flag for DMA, so this looks to me like a bug (I use the latest MPSS).

Frances_R_Intel · ‎07-03-2013

The memcpy of sizes like 192 that failed to complete were from the coprocessor to the host?

Frances_R_Intel · ‎07-03-2013

One of our developers wrote up the following summary. It doesn't answer all your questions but it does have some really useful information.

General guidelines for using SCIF

scif_send()/scif_recv() are two sided and are good for sending small pieces of data. Both operations involve a ring transition (user -> kernel) to send/recv data. Data is copied from the user buffer into a ring buffer that the other end point can see. An interrupt is then sent to the other side which is used to copy data from the ring buffer into the user buffer before waking up the blocked recv call. Overall, there are a few copy (memcpy) operations and an interrupt plus the ring transitions. The size of the ring buffer is 4K, so if you’re copying data that is sized larger than that, the ring will fill up and the sender will block (or return depending on the flags passed in) until the receiver has had a chance to empty the ring. It must be noted that to distinguish between full/empty conditions, 4K – 1 bytes of the ring are actually used. For example if you were sending a 4KB buffer using this mechanism, you’ll first send 4K – 1 bytes followed by 1B.
scif_readfrom()/scif_writeto() and their “v” variants use the DMA engine if the buffer is large. Obviously the user application pays the cost of registering the memory somewhere before calling these APIs or within the API for the “v” variants. Even though posted writes travel in-order on the PCIe bus to their destination, the DMA engines can pull cachelines from local memory on MIC out-of-order (memory pages coming from different memory controllers for example) and consequently things can go out on the bus out of order and appear that way. We found an issue when running netPIPE over MPI over SCIF (within the chassis) that was resulting in bandwidth higher than what is possible in HW. What we found is that the NetPIPE implementation over MPI expects the last byte of a buffer to arrive last and polls the last byte to figure out if the transfer is complete. In this case when SCIF is given a buffer to move to the destination, and this is particularly noticeable when the buffers are not cacheline aligned and there is a head-body-tail type of situation, the driver programs the DMA engine to transfer the body (cachelines aligned) of the buffer and moves the head/tail with the CPU if present. Now if the body is large and the DMA takes a while, it is possible for the tail to appear sooner than the DMA is actually done. To help with this situation, SCIF added the SCIF_RMA_ORDERED flag which basically guarantees that the last byte of a buffer will arrive last. Clearly SCIF does some serialization to make this work and there is a performance penalty associated with it. I’m not sure what else is expected in terms of ordering when using the DMA engines.
scif_vwriteto()/scif_vreadfrom() also benefit from this feature called “registration caching” where they avoid registering/unregistering memory everytime by remember the virtual address and the previously registered offset. In such cases SCIF has to handle a situation where an application malloc’s memory, calls scif_vwriteto() (for example) and then frees the memory. But it calls malloc again, gets the exact same virtual address and calls scif_vwriteto again. In this case SCIF cannot simply use the pages that were backing the original virtual allocation.
Having said all of this, the highest BW path through the system is via scif_readfrom/writeto. Memcpy based RMAs (if the SCIF_RMA_USECPU flag is specified) outperform DMA based transfers for small buffers (and there is a cross over point). For large transfers, DMA is clearly better, approaching the 7GB/s peak on PCIe Gen2. One limitation, which was pointed out by you, is that the buffers have to be cacheline aligned (and multiples of cacheline size in length) to get best BW. If this is not the case, SCIF will still work but a combination of DMA and CPU (and if the RMA_ORDERED flag is specified ensure that the entire buffer is done and then update the last byte) is needed for correctness. This simply comes from the limitation of our DMA hardware where the length of a descriptor has to a multiple of cachelines.
Another consideration that helps here is the use of huge pages – 2MB pages. DMA is far more efficient in this case.
Lastly, it is worth noting that the DMA engine on MIC (via SCIF) can be programmed from the host or the card. Using scif APIs on the host provides some additional performance only because the single threaded performance of a Xeon is way higher and it can keep the DMA engine fed with descriptors faster than the MIC can. So if an application can SCIF APIs on the host to pull/push data from/to the card, that is recommended.
scif_mmap() followed by a memcpy offers the lowest latency method for small buffers. For example to write a 8B pointer across the PCIe bus, a ptr = scif_mmap() followed by a *ptr works really well – The latency is of the oreder of 500-600ns in this case. If this is done on the host, and since the host maps MIC memory write combined (WC) a flush of the write combining buffer is needed to make the buffer visible on the card. On the host this can be done with a fence/sfence/cupid type of serializing operation. The card, on the other hand, sees host memory as uncacheable (UC). Using intel’s fast mempy (you can verify this by building your app, objdumping it and looking for intel_fast_memcpy – it is vecotized) provides the best performance as the cores are able to write cachelines worth of posted writes across PCIe (we support 256B TLP payloads across PCIe).

In all of this one point to note is that the performance between two MIC cards is dependent on how the two MICs are located relative to each other – i.e. if they on the same socket or across QPI. There are some peer-to-peer write limitations across QPI on SNB platforms. SCIF tries to proxy peer-to-peer writes into reads but that happens when you call scif_writeto/scif_readfrom or their v variants (not with scif_send/scif_recv)

Jens_K_ · ‎01-08-2015

Hello Nils,

i just found your interesting post, now. I am also evaluating the performance of the communication between HOST<->MIC. I wanted to ask you if you could publish the code of your benchmark, please.

Thanks in advance.
Jens

Performance of write to remote memory window