A customer is asking for more references on parallel scalability on 4 CPU machines such as E5-4620 (4x8 core). I noted on his box that the sibling logicals under hyperthread are numbered 32 apart under linux. At least up to 8 threads, I don't run into any issue with spreading threads across consecutive cores, although there seems to be unexpected HT speedup for 1 thread. I could get at best a 20% performance gain with his application by using all cores under OpenMP on 2 CPUs vs. 1 CPU, which doesn't entirely surprise me. We have a 3x performance gain from 1 to 8 threads, on top of the best the production ifort can do with AVX vectorization. Needless to say, he didn't hear about this when deciding on the purchase (and maybe the sales force doesn't care to know this).
I pointed out that crossing among CPUs involves possible double remote memory hop, increasing latency and reducing bandwidth, and this might be alleviated by using MPI ranks pinned to individual CPUs. Are there other likely methods for gaining parallel performance?
We found that ifort 16.0.3 needs -fno-inline-functions to avoid hidden race conditions (which are flagged indirectly by turning on subscript range check). The application passes range check (and incurs none of the 16.0 crashes) with all flags on with 15.0 and 17.0 compilers. Needless to say, I used Inspector to fix OpenMP to its satisfaction.
On a Westmere 2x6 core linux box (HT disabled) it scaled well to 8 threads, no further. None of my usual hypotheses about pinning to favorite cores (there are 2 varieties of cache connection) made any difference. On a back to back Windows vs. linux comparison on a single Nehalem CPU, linux gave a 25% performance increase, which I attribute to Transparent Huge Pages. Haswell page prefetch seems to reduce but not eliminate the importance of THP.