Intel Micro Benchmark with MPI_Barrier

phonlawat_k_ · ‎05-08-2014

Hi everyone. I have any questions about Barrier Collective Communication. Have anyone ever test Barrier in Intel Micro Benchmark in large number of processes (in case 1 process/node and more than 100 nodes) ? I would like to know about the average time (usec) Does it chance about average time will decrease in large scale ?

Thank you.
Phonlawat

James_T_Intel · ‎05-09-2014

On one of our internal clusters, I tested with 128 nodes, 1 rank per node. Configuration:

[plain]Dual Intel® Xeon™ E5-2697 v2

8*8GB 1600MHz Reg ECC DDR3

Mellanox* MCX353A-FCAT adapter

Mellanox* MSX6025F-1BFR switch

Red Hat* Enterprise Linux* 6.4

Intel® C++ Composer XE 2013 SP1 Update 2

Intel® MPI Library 4.1 Update 3

Intel® MPI Benchmarks 3.2.4[/plain]

I don't know exactly where the nodes were in the routing.

My numbers:

[plain]2 ranks - 1.44 usec

4 ranks - 3.51 usec

8 ranks - 5.53 usec

16 ranks - 7.29 usec

32 ranks - 9.61 usec

64 ranks - 12.29 usec

128 ranks - 14.65 usec[/plain]

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

phonlawat_k_ · ‎05-09-2014

Oh Thank you for your information. it seem that you make me a quite sure about large number of processes will increase reasonably and suppose that if your results occur like this

2 ranks - 1.44 usec
4 ranks - 3.51 usec
8 ranks - 5.53 usec
16 ranks - 20 usec
32 ranks - 35 usec
64 ranks - 55 usec
128 ranks - 12 usec

What do you think about this problem ? some node have something wrong ?

Thank You

PHONLAWAT

James_T_Intel · ‎05-09-2014

Interesting. How busy is your cluster? I noticed a similar behavior on one run here, but only one. Our cluster was in use by others then (and usually is), so I'm willing to attribute the one run to the IB switches being in use by someone else at that time. If you are consistently getting it, then we can dig further.

phonlawat_k_ · ‎05-10-2014

Oh, wow. same situation, i ran Intel Micro Benchmark with MPI_Barrier many many time about 10 round (all the result gave the same way) while someone are using other benchmark program and scientific program (it just a benchmark program) in other machine and I have 2 IB switches which are in same and not same switches. I try to find out about the effect of FDR IB interconnection among multiple programs running simultaneously in same IB switches like this picture are attached below and i found that Infiniband network guarantees full-bisection bandwidth. It can isolate the effect of multiple programs so i have 1 assumption about this problem. First Assumption, i ever use High Performance Linpack(HPL) with intel compiler and i try to run HPL 1 node in the same way in other machine, the results is quite difference. Some machine gave the best performance nearly theoretical performance (Eff~97%) and some machine gave poor performance (Eff~89%). i should try other set of 16 Node and 64 Node and still not sure. Second Assumption, i think that some machine after someone has already finish for running scientific program, his program may be running and it doesn't stop or it stop but something wrong in that machines.