We are currently setting up a 8 nodes SGI altix (xe 310) cluster using intel compilers and librairies. Each node contains 2 quad-core processors intel Xeon.
We use a Linux SUZE 2.6 distribution, C intel compiler version 10.01.13, intel mpi librairies.
We correctly suceed in compiling a C++ code (already tested on many architectures) and linking it with the intel MPI librairies. The job correctly run on one node : each MPI process is correctly distributed on the available CPU. However when running two parallel jobs, for instance, two jobs (job1 and job2) using 4 CPUs, the two jobs only distribute on the first 4 CPUs : which means that on a 8 CPU node, only 4 CPUs works at 100%, and 4 do nothing. And each working CPUs spend 50% of its time for job1 and 50% for job2.
Here is a copy of the output of the top command :
Tasks: 233 total, 10 running, 212 sleeping, 0 stopped, 11 zombie
Cpu0 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 99.3%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
Cpu2 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
Cpu3 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si,
Cpu5 : 0.0%us, 0.3%sy, 0.0%ni, 97.0%id, 2.7%wa, 0.0%hi, 0.0%si,
Cpu6 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si,
Cpu7 : 0.3%us, 2.0%sy, 0.0%ni, 95.0%id, 2.3%wa, 0.0%hi, 0.3%si,
Mem: 16425276k total, 16333376k used, 91900k free, 442112k buffers
Swap: 4200988k total, 20k used, 4200968k free, 14319964k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
30845 combe 25 0 31040 8672 3916 R 51 0.1 0:13.78
30698 combe 25 0 30976 8816 4056 R 51 0.1 0:55.35
30699 combe 25 0 30992 8896 4112 R 51 0.1 0:39.45
30842 combe 25 0 31044 8604 3840 R 50 0.1 0:13.68
30697 combe ; 25 0 30976 8616 3840 R 50 0.1 0:39.33
30844 combe 25 0 31064 8896 4112 R 49 0.1 0:13.77
30700 combe 25 0 30980 8692 3916 R 49 0.1 0:55.33
30843 combe 25 0 31044 8816 4056 R 49 0.1 0:13.75
I am really confused about this problem and I would be very greatful if someone could help me to solve it.
Thank you very much in advance.
If you are using a scheduler to start 2 independent jobs on a cluster, there should be an exclusivity option to lock out other jobs from sharing the nodes assigned to a job, even though the job doesn't use all the cores on its nodes. If you are starting the jobs without a scheduler, you would specify separate groups of nodes in your mpirun or mpiexec command, or start from separate accounts, with machine files edited to comment out nodes used by other accounts. You would restart mpd on an account after changing the machine file (mpirun will start its own mpd, if mpdallexit has been used when necessary to terminate an mpd for that account).
You may be confused by the BIOS numbering of the cores on Xeon platforms. The numbering is arranged so that the first 4 cores, according to BIOS numbering, are distributed so that one core is in use on each L2 cache on the node (2 cores from each socket). /proc/cpuinfo will show part of the story; if your system has implemented /usr/sbin/irqbalance -debug, you can check that the first 4 are spread across 4 cache.
This does imply that a taskset assignment of a job to the first 4 logical cores may optimize performance on Xeon dual quad core, but not on other brands. But this may not be of interest to you, if the question is simply how to stop independent jobs from sharing nodes.
I think you may be referring to cores as CPUs; various viewpoints on terminology conflict here.
Yes, you are right. By default two jobs started on the same node (set of nodes) would have identical process layout even undersubscribed. That leads to jobs sharing same set of processor cores and half of cores standing idle - MPI processes are 'pinned' to the processor cores for performance benefit reasons.
To move apart process of two different jobs on the same node, you'd need to manually set pinning strategy for job running.
Intel MPI Library 3.1.038 provides advanced facilities for intelligent process pining described in section 3.2.2 of Library Reference Manual. The major instrument is I_MPI_PIN_PROCESSOR_LIST variable that defines pining strategy.
For particular case of two 4-rank jobs running on the same 8-core node, pinning strategy would be "allcores:grain=sock,offset=4" meaning "count all processor cores within processor socket boundaries with offset of 4 cores". Respecting command line may be like this:
> mpirun -genv I_MPI_PIN_PROCESSOR_LIST allcores:grain=sock,offset=0 -np 4 job1
> mpirun -genv I_MPI_PIN_PROCESSOR_LIST allcores:grain=sock,offset=4 -np 4 job2
To choose best pinning strategy for your application basing on processor cache sharing, you can use 'cpuinfo' utility shipping with Intel MPI Library and variety of intelligent pinning strategies built in the Library.
You may control actual process pining enabling level 3 of library debug messaging, for example:
> mpirun -genv I_MPI_DEBUG 2 -np 4 job1 | grep CPU
PS: as process pinning interface for older releases differs from the one described, please let me know the version you use so I can provide respective update.
PPS: we do recommend Intel MPI Library build 3.1.038 as it offers higher application performance then ever.
Thank you very much.
I now understand how to manually pin jobs to cores.
My next question is then : can schedulers automatically manage this pinning process to optimize the load on each core and thus on each node ? Or is it useless to try this optimization ?
Thank you very much
I suppose that PBS, with non-exclusive scheduling, would be more successful in assigning 2 jobs per node, when the platform defaults to a pinning strategy equivalent to the one Grigory recommended, without requiring PBS to insert an offset to avoid conflict with another job scheduled on the same node.
As far as I know, PBS-like job managers can't schedule for cores. They define node count and processes per nodes, but not cores per processes.
In general, MPD daemons can do this cores load balancing for two and more jobs running same node. We'd take this as feature request for Inetl MPI Library.
I am having the same problem. You say:
"The solution you proposed that lock out a job from sharing the nodes assigned to a job is actually possible, but, waste to my point of view".
Could you tell me how to do that?