MIC to MIC to HOST MPI bandwidth issue

marek_kaletka · ‎08-09-2013

Running Intel's IMPI benchmark (mpi ver 4.1.0.024) i've got some strange results.

mpirun -genv I_MPI_FABRICS=shm:dapl -np 2 -ppn 1 -hosts mic0,mic1 ./IMB-MPI1 PingPong

36 us lattency for 0 bytes messages , max 868 Mbytes/sec for 4MB messages.

using tcp instead of dapl (i have external bridge config for mic's ethernet ports with MTU of 1500):

mpirun -genv I_MPI_FABRICS=shm:tcp -np 2 -ppn 1 -hosts mic0,mic1 ./IMB-MPI1 PingPong

496 us lattency for 0 Bytes and 16 MBytes/sec max throughput for 4MB messages!!!

I've expected much better numbers (especially for tcp) - anyone with an idea what's wrong ?

Dale_Wang · ‎08-20-2013

I met the same problem when I tested the bandwidth between MIC & host via Unix TCP Socket directly. It is 18MB/s.

TimP · ‎08-20-2013

Current releases of Intel MPI should improve DAPL performance over such older ones; latest mvapich with the mic to mic communication over host QPI may be better yet.

Vladimir_Dergachev · ‎08-20-2013

I see the same problem which cripples NFS performance:

http://software.intel.com/en-us/forums/topic/404743#comment-1746053

About to write a custom library for file access over SCIF.. But if I had time the right way is to fix the network driver or write an ethernet over SCIF driver.

Gregg_S_Intel · ‎08-21-2013

Those latencies are too high and bandwidth too low.

I get DAPL latency close to 10us, bandwidth greater than 1300MB/s.

For TCP, latency is around 300us, bandwidth close to 80MB/s.

I'm using Intel(R) MPI 4.1.1.036.

See this article for cluster configuration tips: http://software.intel.com/en-us/articles/configuring-intel-xeon-phi-coprocessors-inside-a-cluster

marek_kaletka · ‎08-22-2013

switched to Intel(R) MPI 4.1.1.036 and latest MPSS - got slightly better results for dapl (16-20 usec and 885MB/s), but tcp still very slow.
Using dd to benchmark read/write speed from/to nfs share (filer known to be able to stream > 800MB/s) gives 20/21 MB/s.

IMO something's wrong with virtual nics and/or ip stack implementation in MPSS, or MIC's cores are simply not powerfull enough to handle more ip traffic.

IMB PingPong between two machines hosting MIC coprocessors using plane 1 Gbit Ethernet (i350) connection results in min latency of 50 usec and max bandwidth of 112MB/s (as exepcted, 1Gbit limit). That's roughly 10x quicker then between 2 MIC cards connected through PCIe and MPSS's virtual network stack.

Gregg, what MTU size do you use in your environment ? I've double-checked my config, but can't go beyond 16MB/s using TCP.

Gregg_S_Intel · ‎08-22-2013

The article links to configuration notes directly from the administrator who set up the cluster whose latencies and bandwidth I quoted. It's good, first-hand information. From the notes, "The MTU in this network is generally set to 9000. Please adapt this to your settings."

aazue · ‎08-22-2013

Hi
I read (I see the same problem which cripples NFS performance)
About the share mounted
I don't know if it's possible for you with Phi to compile CIFS (Samba) sometimes he
will give better result that NFS
Advantage the last version is able to read write in differed.
you have also more parameter options in his smb.conf for solve
when you discover the weird performances.
Problem of Samba it's little complex with an number options gigantic..
Personally i use always with fiber or with the copper also with wireless
and i am very satisfied of him.
I don't know but i have the doubts that MTU corrected will create the miracle
in your case precise.
Regards