the others (subset b) are stuck in the MPI_Barrier at the end of the domain decomposition. This performs nicely (passes domain decomposition within seconds) with MVAPICH on our new Intel Xeon machine and on another machine with IBM BlueGene/Q hardware.
Unfortunately with Intel MPI on the same machine that breeezes through with MVAPICH, we get significantly less performance out of the same code: it hangs at this stage of domain decomposition for approximately 10 minutes. The hardware in question is equipped with Mellanox ConnectX-3 IB HCAs. We run RHEL6.