I am experiencing a severy performance loss when using multiple rails in Intel MPI 5.0 and the KNC and an mlx5 adapter (which has 2 ports). With Intel MPI 4.1 it was much better.
Let me give an example of the performance of our application (per KNC):
- Intel MPI 4.1, single-rail (I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx5_0-1u): 220 Gflop/s
- Intel MPI 4.1, dual-rail (-IB I_MPI_OFA_ADAPTER_NAME=mlx5_0 I_MPI_OFA_NUM_PORTS=2): 270 Gflop/s
- Intel MPI 5.0, single-rail (I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx5_0-1u): 220 Gflop/s
- Intel MPI 5.0, dual-rail (-IB I_MPI_OFA_ADAPTER_NAME=mlx5_0 I_MPI_OFA_NUM_PORTS=2): 150 Gflop/s
- Intel MPI 5.0, single-rail (-IB I_MPI_OFA_ADAPTER_NAME=mlx5_0 I_MPI_OFA_NUM_PORTS=1): 150 Gflop/s
With DAPL the performance is unchanged, but apparently there is no way to use it with dual-rail support. With OFA I got the best performance in v4.1, but with v5.0 it is extremely low. In particular it is the same for 1 or 2 ports.
Is there anything I am overlooking in the documentation?
thanks for your reply. To reproduce, you can use, for example, the OSU bandwidth benchmark: http://mvapich.cse.ohio-state.edu/benchmarks/. My original tests were done on the KNC, but the same problem shows up on the Xeon (Haswell) host.
You can see the result in the attached figure. You can see that for message sizes around 100 kB and above Intel MPI 4.1 with "dual rail" is by far the best (blue solid squares). Intel MPI 5.0 is much much worse.
Could you please specify the exact versions of Intel MPI Library (4.x, 5.x) and OS/MPSS/OFED/DAPL.
Also could you please provide test scenarios you used. Which compute nodes were involved in each run (MPI ranks only on HOST, or only on KNC, or both on HOST and KNC).
Regarding to DAPL - try to run the same scenarios with default DAPL provider (without I_MPI_DAPL_PROVIDER_LIST).
I used two scenarios, the issue shows up in both cases:
- HOST <-> HOST
- KNC <-> KNC
- Intel MPI 4.1.3.045 and 5.0.2.044
- OS is Linux (CentOS)
- OFED 3.5.2
- DAPL 2.1.2
- MPSS 3.3.3 (I guess this is irrelevant, since the issue shows up also if only HOSTS are involved)
I think I had tried in the past to run without I_MPI_DAPL_PROVIDER_LIST, but Intel MPI tried to default to an mlx4 device (which does not exist on our system), and would not use the mlx5 device, so using I_MPI_DAPL_PROVIDER_LIST was mandatory. I will try again.
Remark: Should this topic be moved to the general forum, since by now we know that it is not MIC-specific?
Loc and I have been communicating internally about this since you initially submitted it.
Just as an FYI, I'm moving this issue over to the regular Intel® Clusters and HPC Technology forum since it's not Phi-specific. That way I can keep track of the internal bug I submitted and update you on current status.
We've made several fixes to the Intel MPI Library in regards to multi-rail support. Just wondering if you've tried the latest Intel MPI 5.1.2 with any better success regarding performance?