I'm currently trying to evaluate the performance of Xeon Phi cards for CFD applications. I was able to compile OpenFOAM (openfoam.org) to be used with a Xeon Phi card. I'm currently able to run a single CFD simulation in parallel (using multiple processes) either on a single card or on multiple cards.
For example, I can run succesfully both :
When using the first command, I get good parallel efficiency. The problem is that when running the second command, I get really slow calculation times. It actually becomes slower than when using a single core on a single card. I also noted that when transferring files between the host and MICs, the rate lingers around 10 MB/s. Why is it so low?
One of the reason I want to add more cards to a single simulation is to increase the RAM available for a single problem. As of now, I was able to use around 5,700,000 cells within the 8 GB of RAM.
From what I understand, the MPI messages are passed through shared memory within the card and through virtual TCP between cards (I'm using $I_MPI_FABRICS=shm:tcp). I think the slowness is caused by the virtual tcp network between the cards.
From that idea, I searched a bit on the web and found references about using OFED to pass these MPI messages faster. Is that notion correct?
Unfortunately, I was not able to test since every attempt to install a version of OFED on my cluster failed. I tried the following versions:
My installation consists of the following:
So I would like to know:
If you've only started evaluating the first generation of Xeon Phi cards (KNC) for CFD then my advice would be: don't start and try to get your hands on the second generation (Knight's Landing aka KNL). The first generation of cards will be obsolete within a year or two. KNL has numerous advantages over KNC although there's no card form factor yet.
As for your questions:
you'd need to use an OFED driver shim for this - I have never attempted to install this on CentOS 7, however
TCP performance between host and the card is fairly bad - although the OFED driver route might help you there as well
As I said, I've never done this on CentOS 7; the admin guide for the Xeon Phi does list some extensive instructioins, IIRC,
Yes, this is theoretically possible (by using host RAM as a swap memory on the MIC) but you'll see a huge performance hit - so bad that you might not want to use the MIC in the first place.
Have you tried running an MPI benchmark to get a better idea of the performance for both latency and bandwidth for various message sizes? I've found this one useful when trying to optimize MPI applications that are sensitive to communication delays:
That'll at least give you a baseline that you could compare with other clusters, or if you manage to install OFED.