Ask recommendation for socket-like and efficient api to communicate with mic

Xu_F_ · ‎06-14-2015

I am porting a server-client program to mic, which has high concurrency and massive data to transmit.

The server side will be running on mic and supply computing service for client on host.

There are more than 100 threads to transmit large than 10G data in total together. And it was using socket api to implement on clusters.

So i was wondering if there is some socket-like and efficient api for me to adapt this program to mic easily and efficiently?

Could you list some methods, and give some reference from which i can learn more?

Thank a lot.

JJK · ‎06-15-2015

I'd start out with "just porting" the client/server model - TCP network performance between host and Phi is of the order of 4 Gbps; if you change everything to SCIF or an infiniband-like link you'd manage up to 6.4 Gbps. Is that worth the effort?

Xu_F_ · ‎06-17-2015

JJK wrote:

I'd start out with "just porting" the client/server model - TCP network performance between host and Phi is of the order of 4 Gbps; if you change everything to SCIF or an infiniband-like link you'd manage up to 6.4 Gbps. Is that worth the effort?

6.4 Gbps means 0.8GB/s?

I think the speed is not enough in my application. In fact, I want the speed as fast as possible.

I care about the efficiency most. Anyway to speedup the transmit is worth the effort.

So what is the proper transmit method for my app?

Thanks.

JJK · ‎06-17-2015

Yes, that's 6.4 Gbps = 0.8 GB/s

The Xeon Phi is a PCI Express rev2 card, which gives you a theoretical maximum transfer rate of 8 Gbps = 1.0 GB/s ; in practice you'll never achieve more than ~ 6.4 Gbps - this also applies to all other PCI Express rev2 cards such as GPUs.

McCalpinJohn · ‎06-17-2015

Don't forget the width of the interface! The Xeon Phi has a PCIe gen2 x16 interface. PCIe gen2 has a raw bit rate of 5 Gbits/sec per lane per direction, with 4 Gbits/sec/lane/direction after 10/8 conversion. Multiplying by the width gives 4x16=64 Gbits/second/direction = 8 GB/second/direction.

This 8 GB/s per direction has to include PCIe commands, responses, and headers in addition to data, so the effective data bandwidth is limited to somewhere in the range of 6.5 GB/s (unidirectional) or around 5.5 GB/s in each direction simultaneously. The exact values depend on packet size, data address (which determines whether 32-bit or 64-bit addresses are used in the headers), transaction types, the use of various optional PCIe header packets, etc.

Low-level (SCIF) benchmarks show user data transfer rates from host to Xeon Phi (or Xeon Phi to host) of well over 5 GB/s in one direction. If I recall correctly, this is more than 10x faster than TCP/IP transfers.

JJK · ‎06-17-2015

I humbly bow my head... I've mixed GB/s and Gbps in my tests. For my 5110P I get

tcp throughput: ~ 4 Gbps
offload throughput: 6.5 GiB/s
OpenCL bandwidth (pinned memory): 6.3 GiB/s

Sorry about the noise.

Frances_R_Intel · ‎06-17-2015

You might want to look at:

http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-pci-transfer.html

The numbers there were obtained by running an optimized, internal test program. A couple of interesting points - the data transfer rates differ depending on the coprocessor version being used and going from the coprocessor to the host is a little bit faster than going from the host to the coprocessor. And speaking of going faster, you will notice that the highest speed is 6.98 GB/s (5120D coprocessor to host) and the slowest is 6.70 GB/s (5120D host to coprocessor), both a bit faster than the 6.5 GB/s - the number JJK found with his informal testing,

I don't generally like to recommend that people use the SCIF api but if you really want the best transfer rate, that will be what you end up with. You can find a SCIF User Guide in the docs directory that comes with the MPSS.

JJK · ‎06-19-2015

For the record: I ran the 'micprun' test on my 5110P card and got 6.97 GB/s as well.

This is (again) a minor unit conversion thing: my simplistic benchmark results in 6.5 GiB/s which is 6.98 GB/s . A GiB is 1024**3 bytes wheras GB is 1 billion bytes. Thus, the upper limit on host-to-device and device-to-host bandwidth seems to be around 6.5 GiB/s == 6.98 GB/s. This is also what I'd expect from a 16 line PCI Express Rev2 card (5.0 GT/s max).