MPI issues with 2 KNCs on a single host

Simon_H_2 · ‎09-19-2014

Hi,

I am testing my MPI application on 2 KNCs attached to the same host CPU. I observe a *strongly* fluctuating performance (by a factor of 10 or even more) --- for example between 10 and 160 Gflop/s (per card). This variation is observed within a loop doing the same computation in every iteration. When it runs at 160 Gflop/s one loop iteration takes around 0.05 seconds, which means the fluctuations occur at a timescale longer than that.

I am using:
I_MPI_FABRICS_LIST=dapl
I_MPI_DAPL_PROVIDER_LIST=ofa-v2-scif0

Observations:

If I use the Infiniband card instead but still both cards on the same host, (I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u), the performance is consistent.
If I reduce the number of cores used by my application the performance gets more stable. With 56 cores I still see fluctuations, with 48 cores it is mostly fine, but still visible.
I did not observe anything peculiar with the "osu" bandwidth-benchmark, apart from a "dip" at 8 kB (which can be reduced by changing I_MPI_DAPL_DIRECT_COPY_THRESHOLD --- but this parameter shows no influence for my actual application).
I tried 2 hardware setups: (1) a dual socket server board, where (I think) data has to pass through the southbridge and/or QPI(?) and (2) a system with a PLX PCIe switch. The fluctuations happen on both systems.

Is there anything wrong with my configuration? Is this an known issue? Any suggestions?

Thanks
Simon

Frances_R_Intel · ‎09-30-2014

And you have mpxyd running in both cases (ofa-v2-scif0 and ofa-v2-mlx4_0-1u), right?

With ofa-v2-mlx4_0-1u, you are using the buffers for the adapter regardless of whether you are going from a coprocessor to another node or to another coprocessor on the same node. With ofa-v2-scif0, you are using RDMA buffers set up in host memory. At least, that is my understanding. In any event, you are definitely following a different path through the host.

I suspect the answer to what is going on would show up if you set log_level to 0x4 (log data operations) and/or 0x10 (log perf) in /etc/mpxyd.conf, if you want to try that. And the solution will probably be modifying some of the buffer settings in that file. I will see if I can find someone who knows more about this to provide some guidance.

Frances_R_Intel · ‎09-30-2014

One more thing - the developers I talk to are going to want to know the host OS version, MPSS version and IMPI version. Could you let me know?

Simon_H_2 · ‎09-30-2014

Hi Frances, Regarding mpxyd: I actually tried with and without mpxyd and could not see a significant difference. If I understand you correctly, you think the data path is the issue. But how can that explain that the performance gets better if I use fewer cores on the MICs? From micinfo:

                HOST OS                 : Linux
                OS Version              : 2.6.32-358.el6.x86_64
                Driver Version          : 3.3-1
                MPSS Version            : 3.3
                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.3

Intel MPI version 4.1.3.