Problem with intel mpi: same job 6x slower on 4 cores of 1 node than on 2 cores of 2 nodes

Guillaume_De_Nayer · ‎09-25-2010

Hi,

We have a small linux cluster Oscar/CentOS 5.5: a master and 4 nodes. We are computing only on the nodes. The nodes are identical: 2 hexacores X5650 (so 2*6 cores per node). There are 24 Gb of RAM per nodes. All the cluster is connected with infiniband and the driver of open linux fabrics is used. The intel cluster Toolkit is installed on the master and on all the nodes.

The Problem:
- with intel Cluster Toolkit:
-- 2 jobs 2x8 are running on the nodes; so all the nodes are busy with jobs but there are 4 cores free per nodes. there is enough free RAm on all the nodes.
-- I start a 2x2 job: this is a CFD program. I just notice the duration of a time step: ~ 2.0x10^-1 s (very good!)
-- I start the same job but not 2x2, 1x4. I read the duration of time steps: ~ 1.2 s (very veyr bad! 6x slower)

- with openmpi:
-- the same 2x8 jobs are running.
-- I start the same 2x2 job: the duration of a time step: ~ 3.0x10^-1 s (good but intel mpi does better)
-- I start the same 1x4 job. I read the duration of time steps: ~ 3.0x10^-1 s (so much much better thant inte mpi)

I have probably done a configuration error...but I don't find it. Have anyone a idea ? Where can I start to search ?

Thx a lot,
Best regards

Guillaume_De_Nayer · ‎09-25-2010

I have tested with -env I_MPI_DEBUG 5:

For the case 1x4 it tells me:
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[0] MPI startup(): I_MPI_DEBUG=5
[1] MPI startup(): set domain to {4,5,6,7} on node n04
[2] MPI startup(): set domain to {8,9,10,11} on node n04
[0] MPI startup(): set domain to {0,1,2,3} on node n04
[0] Rank Pid Node name Pin cpu
[0] 0 23415 n04 {0,1,2,3}
[0] 1 23413 n04 {4,5,6,7}
[0] 2 23414 n04 {8,9,10,11}

So it is using share memory. and that's good.

For the case 2x2 it tells me: shm and ofa transfert so it is good too (I set I_MPI_FABRICS=shm:ofa)

Guillaume_De_Nayer · ‎09-25-2010

I have tested with another program (the IMB-MPI1 provided in the intel cluster toolkit):

it is quite surprising (I do not copy/paste all the log):
-- for the job 1x4:
#----------------------------------------------------------------
# Benchmarking Scatterv
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.11 0.12 0.11
1 1000 447.40 447.73 447.64
2 1000 0.98 0.98 0.98
4 1000 0.97 0.97 0.97
8 1000 1.08 1.08 1.08
16 1000 0.97 0.97 0.97
32 1000 0.97 0.97 0.97
64 1000 44.12 44.12 44.12
128 1000 1.09 1.09 1.09
256 1000 1.19 1.19 1.19
512 1000 1.24 1.24 1.24
1024 1000 40.89 41.59 41.42
2048 1000 1.80 1.80 1.80
4096 1000 326.15 326.97 326.76
8192 1000 1393.64 1393.65 1393.64
16384 1000 1064.20 1162.20 1100.94
32768 1000 1434.28 1443.94 1441.52
65536 640 10194.96 10227.67 10210.60
131072 320 7958.28 8018.32 7973.31
262144 160 10653.16 10797.76 10750.83
524288 80 19088.16 19260.49 19199.91
1048576 40 15891.75 16334.65 16115.53
2097152 20 29414.09 29636.35 29540.07
4194304 10 96673.39 104203.61 101227.36

-- for the 2x2 job:
#----------------------------------------------------------------
# Benchmarking Scatterv
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.14 0.14 0.14
1 1000 1.91 1.91 1.91
2 1000 1.92 1.92 1.92
4 1000 1.88 1.88 1.88
8 1000 1.91 1.91 1.91
16 1000 1.91 1.91 1.91
32 1000 1.97 1.97 1.97
64 1000 1.99 1.99 1.99
128 1000 2.21 2.21 2.21
256 1000 2.89 2.90 2.90
512 1000 3.17 3.18 3.18
1024 1000 3.79 3.79 3.79
2048 1000 5.04 5.05 5.05
4096 1000 7.42 7.44 7.43
8192 1000 14.30 14.32 14.31
16384 1000 27.08 27.11 27.09
32768 1000 57.79 57.91 57.85
65536 640 110.94 111.05 111.00
131072 320 173.11 173.54 173.37
262144 160 471.64 473.28 472.65
524288 80 885.68 893.69 890.77
1048576 40 1809.90 1836.95 1826.64
2097152 20 3471.26 3581.45 3539.99
4194304 10 6918.19 7352.30 7189.42

So we see that in the case 1x4 shm it is total unstable and slooooowww. I just copy Benchmarking Scatterv but it is the same with the others tests.

The previous tests with IMB-MPI1 was started in a batch job with TORQUE/MAUI. But I get exactly the same behaviour when I start mpirun directly on the node without TORQUE/MAUI.

any ideas ?
Best regards

TimP · ‎09-25-2010

I'm not familiar with the means for setting up openmpi to run in this fashion (with cores on each node busy with other jobs). However, one of the big differences in defaults between openmpi and Intel mpi is that the former doesn't attempt any affinity settings unless you ask for them, while the latter sets affinity by default (on Intel CPUs), as you quoted, without regard to what might already be running. Your quoted assignments don't agree with your assertion that the jobs are being submitted 1 per node (-perhost 1, or maybe by the more recent scatter option), which conflicts with your statement that you want shared memory. For any chance of useful results, you would have to disable affinity, or, preferably, restrict each job to its own set of CPUs (better yet, to its own nodes, and let them use their standard affinity schemes).

Guillaume_De_Nayer · ‎09-26-2010

Hi,
thx for your answer. I did a mistake when I copy/paste the I_MPI_DEBUG 5 report. It was for a job 1x3 (so 3 processes on one node). I will try to disable the affinity in order to compare.

Regards

TimP · ‎09-26-2010

OK, I misunderstood what you meant by 2x2 and 1x4, but it does seem likely that performance suffered from multiple jobs pinned to same cores.
Your point about usefulness of shared memory actually is well taken; as more powerful multi-core nodes are introduced without major improvements in inter-node communication, it becomes more important.

Guillaume_De_Nayer · ‎09-26-2010

Hi,

I have installed "htop" on the master and the nodes. With htop we can see which cores are in use or not:

-- I have the 2x8 jobs. So on each nodes I have 4 cores free.
-- I start a 1x4 on one node (so 4 process on a node) with IMB-MPI1;
-- I open htop and see that 8 cores are in use with the "big" job. and just 2 cores for the the IMB-MPI1. so the 4 processes of the IMB-MPI1 are running just on 2 cores...and I have 2 cores which do nothing...so I think that the problem...any idea to solve that ?

thx for your help!

Guillaume_De_Nayer · ‎09-27-2010

It's really strange:

let's say I have 2 nodes totally free. so 12 free cores on each nodes. I'm starting a 1x4 job, so 4 processes just on one node (mpirun -4 IMB-MPI1). WHen I open htop, i'm seeing that 4 cores are busy. so one for each processes. so no problem! I stop the job.

now I start the "big" job on the nodes: 2x8. WHen I open htop, I'm seeing that 8 cores are busy on each nodes. So that's correct. the big job is running and I have 4 free cores on each nodes. I decide to start the 1x4 IMB-MPI1 job. So I log onto the node and type mpirun -4 IMB-MPI1. When I open htop, i'm seeing that 2 cores are busy and 2 are free. So that's not correct...strange...

Guillaume_De_Nayer · ‎09-27-2010

When you write "disable the affinity", you mean I_MPI_PIN = 0 ???

When I disable the processes pinning and start my 1x4 job I get 4 cores busy, so the correct behaviour:
mpirun -genv I_MPI_PIN 0 -np 4 IMB-MPI1

Is it really without risk to disable processes pinning ?

Or is there a better solution ?

Guillaume_De_Nayer · ‎09-27-2010

I tried I_MPI_PIN_DOMAIN=auto but it isn't optimal too...mpirun -genv I_MPI_PIN_DOMAIN auto -np 4 IMB-MPI1 starts on 3 cores...

but mpirun -genv I_MPI_PIN_DOMAIN node -np 4 IMB-MPI1 starts on 4 cores...so it seems good. Is there any problems to use I_MPI_PIN_DOMAIN=node ??

Regards

TimP · ‎09-27-2010

The default (auto) options of I_MPI_PIN_DOMAIN and I_MPI_PIN_PROCS assume that all visible cores are available to your job, so they aren't suitable when running multiple jobs on the same nodes, as you found. The default action of the OS scheduler may be better, if you can't take the care to set affinity of each job to a non-overlapping group of cores.
You take risks in running multiple MPI jobs on the same nodes; I don't see that you increase it by improving scheduling.

Dmitry_K_Intel2 · ‎09-28-2010

Hi Guillaume,

>Is it really without risk to disable processes pinning ?
Yes it is. You can disable pinning and an Operating System will place processes on different cores. I_MPI_PIN_DOMAIN equal to 'node' means the same - pinning is diabled.
Unfortunately, there is no better solution for now if you are going to run several application at the same time on a node.

Regards!
Dmitry

Guillaume

Guillaume_De_Nayer · ‎09-29-2010

ok. I played a little bit with I_MPI_PIN_DOMAIN, but I don't find better solution...so pinning is now disabled. too bad :'( with pinning the calculations are faster :'(

Regards,
Guillaume

TimP · ‎09-29-2010

You must pin all the jobs running on the same nodes so as to minimize mutual interference. It's hardly worth the trouble until platforms like Xeon EX become cost effective for cluster computing.
Wouldn't running each job on its own nodes be a better solution?

Guillaume_De_Nayer · ‎09-30-2010

Hi,

We don't have a lot of nodes (just 4) at the moment. So when there is free cores on our nodes, we would like to use them...

I don't understand clearly the pinning at the moment:
For example I set I_MPI_PIN_DOMAIN=core. I start a job 1x4 (1 node: 4 cores). The first cores are selected and the processes are pinned on these cores (0,1,2,3). Just for the test I start another job 1x4 on the same node with the same I_MPI_PIN_DOMAIN=core. And it select the (0,1,2,3) cores too...so I have 2 jobs which are running on the same 4 first cores...and the 8 another one are totally free. Why mpirun doesn't see that the 4 first cores are busy and so select the 4 next one ?

Regards

Dmitry_K_Intel2 · ‎09-30-2010

Hi Guillaume,

I_MPI_PIN_DOMAIN is mainly used for hybrid applications (openMP). Each MPI process in this case may create several threads. Setting I_MPI_PIN_DOMAIN to 'core' means that you create a 'domain' which consists of one processor only and openMP threads will be executed on this core (processor).
It seems to me that explanation in the Reference Manual is good enough.

Intel MPI library was developed to run in exclusive mode and cannot check workload of a system. There is a feature request to add such functionality but it was not implemented yet.

The easiest way for you is to disable pinning. I think you'll benefit by running 2 (1x4) applications on the same node which will run in parallel in this case. What is the perfomance degradation with disabled pinning (one 1x4 application)?

Regards!
Dmitry

Guillaume_De_Nayer · ‎09-30-2010

Hi,

yeah I have read the Reference Manual. I don't know If I have understood it correctly...but that's another point ;)

Ok. I didn't know the point that "Intel MPI library was developed to run in exclusive mode and cannot check workload of a system".

Ther performance degradation without pinning is:
-- with pinning a time step with our fluid solver is 2x10^-1
-- without pinning a time step with our fluid solver is 3x10^-1.

So it is relativ correct. But with htop I'm seeing that the processes jump from a core to another. I will take a look to the taskset command...perhaps I can do self pinning :)

Best Regards,
Guillaume

Dmitry_K_Intel2 · ‎09-30-2010

Guillaume,

Intel MPI library was developed to run one MPI application at a time.
Taskset will work only in case of disabled pinning. You should be very careful due to the cpu numbers used by this utility have BIOS ordering and may be different on different clusters.

I think you may run taskset command explicitly. For instance,
$ mpiexec -perhost 4 -n 4 -env I_MPI_PIN disable taskset -c 0-3 application_name1
$ mpiexec -perhost 4 -n 4 -env I_MPI_PIN disable taskset -c 4-7 application_name2

Give it a try.

Regards!
Dmitry

Guillaume_De_Nayer · ‎09-30-2010

Thx! I will try it.

if I start the command explicitly, I can use the I_MPI_PIN enable with I_MPI_PIN_PROCESSOR_LIST set to the right cores, can't I ?

I find an interesting link on the web:
autopin - Automated Optimization of Thread-to-Core Pinning on Multicore Systems
http://www.hipeac.net/system/files?file=carsten.pdf

I don't know where I can find this tool...but it seems interesting.

Dmitry_K_Intel2 · ‎09-30-2010

>if I start the command explicitly, I can use the I_MPI_PIN enable with I_MPI_PIN_PROCESSOR_LIST set to the right cores, can't I ?
Yes, of cause you can.

>I don't know where I can find this tool...but it seems interesting.
Yeah, the tool is quite interesting. You probably need to get in contact with the author. I'm not sure that this is a product.

Best wishes,
Dmitry