Job distribution problem

combe · ‎04-25-2008

Dear all,

We are currently setting up a 8 nodes SGI altix (xe 310) cluster using intel compilers and librairies. Each node contains 2 quad-core processors intel Xeon.

We use a Linux SUZE 2.6 distribution, C intel compiler version 10.01.13, intel mpi librairies.

We correctly suceed in compiling a C++ code (already tested on many architectures) and linking it with the intel MPI librairies. The job correctly run on one node : each MPI process is correctly distributed on the available CPU. However when running two parallel jobs, for instance, two jobs (job1 and job2) using 4 CPUs, the two jobs only distribute on the first 4 CPUs : which means that on a 8 CPU node, only 4 CPUs works at 100%, and 4 do nothing. And each working CPUs spend 50% of its time for job1 and 50% for job2.
Here is a copy of the output of the top command :

Tasks: 233 total, 10 running, 212 sleeping, 0 stopped, 11 zombie
Cpu0 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 99.3%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu2 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu3 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu5 : 0.0%us, 0.3%sy, 0.0%ni, 97.0%id, 2.7%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu6 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu7 : 0.3%us, 2.0%sy, 0.0%ni, 95.0%id, 2.3%wa, 0.0%hi, 0.3%si,
0.0%st
Mem: 16425276k total, 16333376k used, 91900k free, 442112k buffers
Swap: 4200988k total, 20k used, 4200968k free, 14319964k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
30845 combe 25 0 31040 8672 3916 R 51 0.1 0:13.78
lmp_job2
30698 combe 25 0 30976 8816 4056 R 51 0.1 0:55.35
lmp_job1
30699 combe 25 0 30992 8896 4112 R 51 0.1 0:39.45
lmp_job1
30842 combe 25 0 31044 8604 3840 R 50 0.1 0:13.68
lmp_job2
30697 combe ; 25 0 30976 8616 3840 R 50 0.1 0:39.33
lmp_job1
30844 combe 25 0 31064 8896 4112 R 49 0.1 0:13.77
lmp_job2
30700 combe 25 0 30980 8692 3916 R 49 0.1 0:55.33
lmp_job1
30843 combe 25 0 31044 8816 4056 R 49 0.1 0:13.75
lmp_job2

I am really confused about this problem and I would be very greatful if someone could help me to solve it.

Thank you very much in advance.

N. Combe

TimP · ‎04-25-2008

Intel MPI option -perhost is the simplest way to distribute a job without using all cores. On a dual quad core node, -perhost 8 is the default (except for a bug in the first Intel MPI 3.1 release). -perhost 4 will assign 4 ranks to the first node, 4 to the second, ....The current release (3.1.038) has improved defaults in these modes of operation. You may have chosen an appropriate method here; your description is confusing.
If you are using a scheduler to start 2 independent jobs on a cluster, there should be an exclusivity option to lock out other jobs from sharing the nodes assigned to a job, even though the job doesn't use all the cores on its nodes. If you are starting the jobs without a scheduler, you would specify separate groups of nodes in your mpirun or mpiexec command, or start from separate accounts, with machine files edited to comment out nodes used by other accounts. You would restart mpd on an account after changing the machine file (mpirun will start its own mpd, if mpdallexit has been used when necessary to terminate an mpd for that account).
You may be confused by the BIOS numbering of the cores on Xeon platforms. The numbering is arranged so that the first 4 cores, according to BIOS numbering, are distributed so that one core is in use on each L2 cache on the node (2 cores from each socket). /proc/cpuinfo will show part of the story; if your system has implemented /usr/sbin/irqbalance -debug, you can check that the first 4 are spread across 4 cache.
This does imply that a taskset assignment of a job to the first 4 logical cores may optimize performance on Xeon dual quad core, but not on other brands. But this may not be of interest to you, if the question is simply how to stop independent jobs from sharing nodes.
I think you may be referring to cores as CPUs; various viewpoints on terminology conflict here.

combe · ‎04-28-2008

Thank you very much Tim,

Indeed, my description is may be confusing.

We are actuallyplanning to use a sheduler (PBS pro)but our first trials have failed for the same reason : The sheduler considers 8 core per node and starts jobsonone nodeuntil the node is full. Thus two 4 core jobs start on the same node and use the same cores of the nodes.

The solution you proposed that lock out a job from sharing the nodes assigned to a job is actually possible, but, waste to my point of viewa lot of computer facilities (especially if I plan to send sequential jobs on a cluster: each sequential job would monopolize one full node !!). I guess I have missed something.

Since our trials with the sheduler failed, we started to start job interactively.

I am very astonished that if I send interactively a 8core job (mpirun -np 8...) on one node, all of the cores work at 100% but that if I send two 4 core jobs (mpirun -np4 job1; mpirun -np 4 job2)without any caution, only 50% of the nodefacilities are used. Do you mean there is no solution to this problem except tousetwo nodes(working at 50%)?

My question is not how "how to stop independent jobs from sharing nodes" but how to stop independent jobs from sharing cores on the same node ?

Moreover, I am not sure to fully understand your answer (and unfortunatly, I am not able to make the test now), but would using two different user accounts for sending interactively independent jobs on the SAME nodesolvethe problem ?

user1@node1 > mpirun -np4 job1

user2@node1 > mpirun -np4 job2

In this case, would job1 and job2 use the same cores or not ?

Thank you very for your help

Nicolas

grigoryzagorodnev · ‎04-28-2008

Hi Nicolas,

Yes, you are right. By default two jobs started on the same node (set of nodes) would have identical process layout even undersubscribed. That leads to jobs sharing same set of processor cores and half of cores standing idle - MPI processes are 'pinned' to the processor cores for performance benefit reasons.

To move apart process of two different jobs on the same node, you'd need to manually set pinning strategy for job running.

Intel MPI Library 3.1.038 provides advanced facilities for intelligent process pining described in section 3.2.2 of Library Reference Manual. The major instrument is I_MPI_PIN_PROCESSOR_LIST variable that defines pining strategy.

For particular case of two 4-rank jobs running on the same 8-core node, pinning strategy would be "allcores:grain=sock,offset=4" meaning "count all processor cores within processor socket boundaries with offset of 4 cores". Respecting command line may be like this:

> mpirun -genv I_MPI_PIN_PROCESSOR_LIST allcores:grain=sock,offset=0 -np 4 job1

> mpirun -genv I_MPI_PIN_PROCESSOR_LIST allcores:grain=sock,offset=4 -np 4 job2

To choose best pinning strategy for your application basing on processor cache sharing, you can use 'cpuinfo' utility shipping with Intel MPI Library and variety of intelligent pinning strategies built in the Library.

You may control actual process pining enabling level 3 of library debug messaging, for example:

> mpirun -genv I_MPI_DEBUG 2 -np 4 job1 | grep CPU

Best regards,

- Grigory

PS: as process pinning interface for older releases differs from the one described, please let me know the version you use so I can provide respective update.

PPS: we do recommend Intel MPI Library build 3.1.038 as it offers higher application performance then ever.

combe · ‎04-29-2008

Hi Gregory,

Thank you very much.
I now understand how to manually pin jobs to cores.

My next question is then : can schedulers automatically manage this pinning process to optimize the load on each core and thus on each node ? Or is it useless to try this optimization ?

Thank you very much
Best regards.

Nicolas

TimP · ‎04-29-2008

When you run a 4 process job on an 8 core node, taking the favored pinning layout, you are consuming 70% or more of the resources of that node, so there is fairly little incentive to schedule another job on the remaining cores. A better strategy for 2 separate jobs might be to pin each to its own socket, so as to even out the distribution of resources, and running each job with up to 70% of the performance it would have running by itself with the best layout. You would likely need more memory per node to make such strategies work, and I'm not aware of them being carried out successfully.
I suppose that PBS, with non-exclusive scheduling, would be more successful in assigning 2 jobs per node, when the platform defaults to a pinning strategy equivalent to the one Grigory recommended, without requiring PBS to insert an offset to avoid conflict with another job scheduled on the same node.

grigoryzagorodnev · ‎04-29-2008

Nicolas,

As far as I know, PBS-like job managers can't schedule for cores. They define node count and processes per nodes, but not cores per processes.

In general, MPD daemons can do this cores load balancing for two and more jobs running same node. We'd take this as feature request for Inetl MPI Library.

Thank you!

- Grigory

grenfell · ‎06-13-2008

Hello,

I am having the same problem. You say:

"The solution you proposed that lock out a job from sharing the nodes assigned to a job is actually possible, but, waste to my point of view".

Could you tell me how to do that?

Thanks, Grenfell