we have an application with basically two (last) sequence of actions in the domain decomposition:
one set of tasks (subset a) calls
call mpi_win_lock(some_rank_from_subset_b) call mpi_win_get(some_rank_from_subset_b) call mpi_win_unlock(some_rank_from_subset_b)
the others (subset b) are stuck in the MPI_Barrier at the end of the domain decomposition. This performs nicely (passes domain decomposition within seconds) with MVAPICH on our new Intel Xeon machine and on another machine with IBM BlueGene/Q hardware.
Unfortunately with Intel MPI on the same machine that breeezes through with MVAPICH, we get significantly less performance out of the same code: it hangs at this stage of domain decomposition for approximately 10 minutes. The hardware in question is equipped with Mellanox ConnectX-3 IB HCAs. We run RHEL6.
How can I improve passive target performance?