topic Hi Ruibang (I think I know in Software Archive

host to mic bandwidth using MPI

Mian_L_ — Thu, 09 May 2013 08:57:49 GMT

Hi, anyone has the result of using mpi to test the host<-> mic bandwidth? I tried on my machine, the bandwidth is quite low (~0.4GB/sec). I just send data from host to the mic card using blocking function and measure the time. The downloadspeed test in the shoc benchmark can generate up to 10GB/sec. Any idea about the low bandwidth using MPI? Thanks a lot!

btw, I download a third-part

Mian_L_ — Thu, 09 May 2013 09:51:06 GMT

btw, I download a third-part benchmark http://mvapich.cse.ohio-state.edu/benchmarks/

the result is similar to my program. i doubt there are some issues in my configuration, anyone has ideas?

I was able to achieve ~6.5G/s

Ruibang_L_ — Thu, 09 May 2013 10:01:01 GMT

I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60

#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02

Hi Ruibang (I think I know

Mian_L_ — Thu, 09 May 2013 10:27:00 GMT

Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!

Ruibang L. wrote:

I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60

#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02

i found the benchmark, thanks

Mian_L_ — Thu, 09 May 2013 10:36:25 GMT

i found the benchmark, thanks

Hi Mian,

QIAOMIN_Q_ — Thu, 09 May 2013 10:48:55 GMT

Hi Mian,

Usually benchmarks underestimites the bandwidth sometimes about a quarter less than the actual hardware's bandwidth, so to know the really what happening in your MIC processor's runtime ,i think you should get known of the Vtune software tool, which helps you to monitor your processors'events when you run your applications and help you tune your application's performance,.

In your case as to the bandwidth measurement ,Vtune chould show to you the events it sampled in/out the memory bus during your application's running.

1.Read bandwidth (bytes/clock)=(L2_DATA_READ_MISS_MEM_FILL + L2_DATA_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED

2.Write bandwidth= (bytes/clock)(L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED

3.TotalBandwith (GB/Sec)=(Read bandwidth + Write bandwidth) * freq (in GHZ)

So you can easily figure out your wanted bandwidth number based on the events Number. I wish this link could give you more insight(http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding).

Thanks, QIAOMN. But I want to

Mian_L_ — Fri, 10 May 2013 01:19:00 GMT

Thanks, QIAOMN. But I want to measure the bandwidth between the host and mic cards (through PCIE), not the memory bandwidth. Here is my output using the intel mpi benchmark. Compared to Ruibang's result, the bandwidth is very low.... any one has suggestions? thanks very much! #--------------------------------------------------- # Benchmarking PingPong # #processes = 2 #--------------------------------------------------- #bytes #repetitions t[usec] Mbytes/sec 0 1000 129.99 0.00 1 1000 139.75 0.01 2 1000 131.26 0.01 4 1000 126.89 0.03 8 1000 134.76 0.06 16 1000 129.93 0.12 32 1000 131.42 0.23 64 1000 133.16 0.46 128 1000 131.38 0.93 256 1000 131.24 1.86 512 1000 132.47 3.69 1024 1000 139.96 6.98 2048 1000 169.95 11.49 4096 1000 151.08 25.85 8192 1000 167.11 46.75 16384 1000 215.50 72.51 32768 1000 309.60 100.94 65536 640 464.70 134.50 131072 320 654.15 191.09 262144 160 1099.20 227.44 524288 80 2159.97 231.48 1048576 40 3675.76 272.05 2097152 20 7585.00 263.68 4194304 10 13317.50 300.36 #----------------------------------------------------------------------------- # Benchmarking Exchange # #processes = 2 #----------------------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 372.48 372.55 372.52 0.00 1 1000 362.01 362.11 362.06 0.01 2 1000 377.90 377.93 377.92 0.02 4 1000 380.43 380.44 380.43 0.04 8 1000 370.37 370.40 370.38 0.08 16 1000 366.83 366.84 366.84 0.17 32 1000 380.09 380.14 380.11 0.32 64 1000 379.95 379.96 379.96 0.64 128 1000 367.15 367.37 367.26 1.33 256 1000 358.58 358.66 358.62 2.72 512 1000 385.53 385.55 385.54 5.07 1024 1000 393.08 393.11 393.10 9.94 2048 1000 401.01 401.16 401.08 19.47 4096 1000 385.83 385.88 385.85 40.49 8192 1000 412.42 412.48 412.45 75.76 16384 1000 466.70 466.78 466.74 133.90 32768 1000 595.14 595.34 595.24 209.96 65536 640 1217.73 1217.80 1217.76 205.29 131072 320 1897.78 1898.52 1898.15 263.36 262144 160 3511.84 3520.93 3516.38 284.02 524288 80 7320.48 7332.59 7326.53 272.76 1048576 40 12666.30 12708.85 12687.58 314.74 2097152 20 23141.99 23311.20 23226.59 343.18 4194304 10 48067.19 48803.71 48435.45 327.84

Hi Mian, sorry for the late

Ruibang_L_ — Tue, 14 May 2013 03:07:01 GMT

Hi Mian, sorry for the late reply, yes we should know each other in Hong Kong via BGI.

I guess you are using tcp as a frabic between the MIC card and the host thus 450MB/s at maximum is what you've got and also what I've got.

Installing the ofed stacks in the mpss driver package will enable you to use direct memory access feature. I run the benchmark with "mpiexec.hydra -genv I_MPI_FABRICS=shm:dapl -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic" so it's fast.

By default (I guess it's your case), it's using I_MPI_FABRICS=shm:tcp.

Mian L. wrote:

Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!

Quote:

Ruibang L.wrote:
I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60

#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02

Hi Ruibang, thanks very much!

Mian_L_ — Wed, 15 May 2013 01:23:05 GMT

Hi Ruibang, thanks very much! Yes, I think you are right. The dapl model is not supported on our server, when I try to run your command, it outputs :

MPI startup(): dapl fabric is not available and fallback fabric is not enabled

Do you know how to install the ofed package? Is it supposed to be installed together with MPSS? If it can be installed separately, can you give me a link, please? Since I google it and cannot find correct information. Thanks very much.

Ruibang L. wrote:

Hi Mian, sorry for the late reply, yes we should know each other in Hong Kong via BGI.

I guess you are using tcp as a frabic between the MIC card and the host thus 450MB/s at maximum is what you've got and also what I've got.

Installing the ofed stacks in the mpss driver package will enable you to use direct memory access feature. I run the benchmark with "mpiexec.hydra -genv I_MPI_FABRICS=shm:dapl -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic" so it's fast.

By default (I guess it's your case), it's using I_MPI_FABRICS=shm:tcp.

Quote:

Mian L.wrote:
Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!

Quote:

Ruibang L.wrote:

I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60

#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02

The ofed rpms are distributed

Ruibang_L_ — Thu, 16 May 2013 06:02:23 GMT

The ofed rpms are distributed with the MPSS driver. The installation guide is in http://registrationcenter.intel.com/irc_nas/3156/readme-en.txt

You'd better check if the kernel version required by the precompiled driver is the same with your server, or you have to recompile using rpmbuild --rebuild on src rpms from the driver sources. Please notice that "2.6.32-358.el6.x86_64" is totally different from "2.6.32-358.6.1.el6.x86_64".

BTW, It seems that to utilize MPI on Xeon Phi one have to install the proprietary Intel® MPI package (the compiler is also the case). This is not good. I'm a poor researcher that can only afford the card, lol :>

Mian L. wrote:

Hi Ruibang, thanks very much! Yes, I think you are right. The dapl model is not supported on our server, when I try to run your command, it outputs :

MPI startup(): dapl fabric is not available and fallback fabric is not enabled

Do you know how to install the ofed package? Is it supposed to be installed together with MPSS? If it can be installed separately, can you give me a link, please? Since I google it and cannot find correct information. Thanks very much.

Quote:

Ruibang L.wrote:
Hi Mian, sorry for the late reply, yes we should know each other in Hong Kong via BGI.

I guess you are using tcp as a frabic between the MIC card and the host thus 450MB/s at maximum is what you've got and also what I've got.

Installing the ofed stacks in the mpss driver package will enable you to use direct memory access feature. I run the benchmark with "mpiexec.hydra -genv I_MPI_FABRICS=shm:dapl -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic" so it's fast.

By default (I guess it's your case), it's using I_MPI_FABRICS=shm:tcp.

Quote:

Mian L.wrote:

Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!

Quote:

Ruibang L.wrote:

I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60

#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02