Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Frederik_C_
Beginner
68 Views

Using multiple Xeon Phi together

Hi,

I'm currently trying to evaluate the performance of Xeon Phi cards for CFD applications. I was able to compile OpenFOAM (openfoam.org) to be used with a Xeon Phi card. I'm currently able to run a single CFD simulation in parallel (using multiple processes) either on a single card or on multiple cards.

For example, I can run succesfully both :

  • mpirun -np 8 -host mic0 icoFoam -parallel
  • mpirun -np 8 -host mic0,mic1 icoFoam -parallel

When using the first command, I get good parallel efficiency. The problem is that when running the second command, I get really slow calculation times. It actually becomes slower than when using a single core on a single card. I also noted that when transferring files between the host and MICs, the rate lingers around 10 MB/s. Why is it so low?

One of the reason I want to add more cards to a single simulation is to increase the RAM available for a single problem. As of now, I was able to use around 5,700,000 cells within the 8 GB of RAM.

From what I understand, the MPI messages are passed through shared memory within the card and through virtual TCP between cards (I'm using $I_MPI_FABRICS=shm:tcp). I think the slowness is caused by the virtual tcp network between the cards.

From that idea, I searched a bit on the web and found references about using OFED to pass these MPI messages faster. Is that notion correct?

Unfortunately, I was not able to test since every attempt to install a version of OFED on my cluster failed. I tried the following versions:

  • Intel TSR (says missing glibc although yum says I have it)
  • OFED 3.12-1 (throws an error while compiling)
  • OFED 3.18-1 (says missing libudev-devel although yum says I have it)
  • Mellanox OFED (OS not supported)
  • OFED 3.18-2 (I need a newer kernel)

My installation consists of the following:

  • CentOS 7.2 (uname -r: 3.10.0-327.36.1.el7.x86_64)
  • MPSS 3.7.2 (installed from packages, MPSS user guide section 3.4)
  • OpenFOAM 2.3.0 (as compiled following these instructions http://machls.cc.oita-u.ac.jp/kenkyu/ryuki/wpjp/openfoam-v2-3-0-on-xeon-phi)
  • 8 Xeon Phi cards

So I would like to know:

  1. Is it possible to increase the MPI communication speed between MICs?
  2. Is it possible to increase the scp transfer rate?
  3. Do I have to install an OFED package? How? Any prerequisites?
  4. Is it possible to share the host RAM to the MIC so I can fit larger simulations on the card? Do I have to expect a large performance hit?

Best regards,

Frederik

0 Kudos
2 Replies
JJK
New Contributor III
68 Views

If you've only started evaluating the first generation of Xeon Phi cards (KNC) for CFD then my advice would be: don't start and try to get your hands on the second generation (Knight's Landing aka KNL).  The first generation of cards will be obsolete within a year or two. KNL has numerous advantages over KNC although there's no card form factor yet.

As for your questions:

  1. Is it possible to increase the MPI communication speed between MICs?

you'd need to use an OFED driver shim for this - I have never attempted to install this on CentOS 7, however

  1. Is it possible to increase the scp transfer rate?

TCP performance between host and the card is fairly bad - although the OFED driver route might help you there as well

  1. Do I have to install an OFED package? How? Any prerequisites?

As I said, I've never done this on CentOS 7; the admin guide for the Xeon Phi does list some extensive instructioins, IIRC,

  1. Is it possible to share the host RAM to the MIC so I can fit larger simulations on the card? Do I have to expect a large performance hit?

Yes, this is theoretically possible (by using host RAM as a swap memory on the MIC) but you'll see a huge performance hit - so bad that you might not want to use the MIC in the first place.

 

 

areid2
New Contributor I
68 Views

Have you tried running an MPI benchmark to get a better idea of the performance for both latency and bandwidth for various message sizes? I've found this one useful when trying to optimize MPI applications that are sensitive to communication delays:

https://software.intel.com/en-us/node/561902

That'll at least give you a baseline that you could compare with other clusters, or if you manage to install OFED.

 

Reply