How are applications run on hyper-threading enabled multi-core machines?
I'm trying to gain a better understanding of how hyper-threading enabled multi-core processors work.Let's say I have an app which can be compiled with MPI or OpenMP or MPI+OpenMP. I wonder how it will be scheduled ona CentOS 5.3 box withfour Xeon X7560 @ 2.27GHz processors and each processor core has Hyper-Threading enabled.
The processor is numberedfrom 0 to 63 in /proc/cpuinfo. For my understanding, there are FOUR 8-cores physical processors, the total PHYSICAL CORES are 32, each processor core has Hyper-Threading enabled, the total LOGICAL processors are 64.
1. Compiled with MPICH2 How many physical cores will be used ifI run with mpirun -np 16? Does itget divided up amongst the available 16PHYSICAL cores or16 LOGICAL processors ( 8 PHYSICAL cores using hyper-threading)?
2. compiled with OpenMP How many physical cores will be used if I set OMP_NUM_THREADS=16? Does it will use 16 LOGICAL processors ?
3. Compiled with MPICH2+OpenMP How many physical cores will be used if I set OMP_NUM_THREADS=16 and run with mpirun -np 16?
4. Compiled with OpenMPI
OpenMPI has tworuntime options
-cpu-set which specifieslogical cpus allocated to the job, -cpu-per-procwhich specifies number of cpu to use for each process.
Ifrun withmpirun -np 16 -cpu-set 0-15,will itonly use 8 PHYSICAL cores ? If run with mpirun -np16 -cpu-set0-31 -cpu-per-proc 2, how it will be scheduled?
You'll have to read up on the treatment each implementation of OpenMP or MPI gives in these cases. In my limited knowledge of MPICH2, it hasn't had any way of its own to deal with affinity or show its view of your core and HT layout. Perhaps this has changed. You could run /sbin/irqbalance -debug to verify the numbering of cores and logical processes. You would need to figure out which logical processors belong to each physical CPU, and which are siblings, so as to assign the ranks which communicate most heavily to the same CPU, and spread the assignments out across physical cores. It used to be that taskset was the tool which could do this.
On the Xeon 7560 I have seen, the primary number for each core stays the same when HT is disabled, while the secondary number for the HT sibling disappears. This is a different scheme from what is normally used on quad core CPUs.
OpenMP hasn't standardized a way of setting affinity. libgomp should recognize the combination e.g. export OMP_NUM_THREADS=16 export GOMP_CPU_AFFINITY= while Intel libiomp5 gives you additional options, including verbose option to show the interpretation of your affinity setting. OpenMP jobs I have tested didn't scale beyond 16 cores on 2 of the 4 CPUs on Xeon 7560. So you would run separate jobs simultaneously, each assigned to different CPUs (or combine MPI FUNNELED with OpenMP). Of course, if your OpenMP programming is perfect, or the application is embarrassingly parallel, you would expect to do better.
With combination of MPICH2 and OpenMP, you would have to give each rank the appropriate OpenMP affinity mask setting for efficient assignment of cores within a single CPU, and figure out how to place each rank on a different group of cores. In principle, the current Intel MPI supports automatic placement in concert with Intel OpenMP, but I haven't heard of much practical usage on this platform.
Current OpenMPI, like Intel MPI, provides a separate tool to display its view of the topology of your platform, and a canned (optional) way of setting affinity which is fairly effective on a cluster with HT disabled. According to what I saw on their mailing list, they are just beginning to consider specific support for HT. I doubt whether Xeon 7560 specific stuff is likely to be implemented by OpenMPI.