- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi, anyone has the result of using mpi to test the host<-> mic bandwidth? I tried on my machine, the bandwidth is quite low (~0.4GB/sec). I just send data from host to the mic card using blocking function and measure the time. The downloadspeed test in the shoc benchmark can generate up to 10GB/sec. Any idea about the low bandwidth using MPI? Thanks a lot!
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
btw, I download a third-part benchmark http://mvapich.cse.ohio-state.edu/benchmarks/
the result is similar to my program. i doubt there are some issues in my configuration, anyone has ideas?
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60
#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!
Ruibang L. wrote:
I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Mian,
Usually benchmarks underestimites the bandwidth sometimes about a quarter less than the actual hardware's bandwidth, so to know the really what happening in your MIC processor's runtime ,i think you should get known of the Vtune software tool, which helps you to monitor your processors'events when you run your applications and help you tune your application's performance,.
In your case as to the bandwidth measurement ,Vtune chould show to you the events it sampled in/out the memory bus during your application's running.
1.Read bandwidth (bytes/clock)=(L2_DATA_READ_MISS_MEM_FILL + L2_DATA_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED
2.Write bandwidth= (bytes/clock)(L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED
3.TotalBandwith (GB/Sec)=(Read bandwidth + Write bandwidth) * freq (in GHZ)
So you can easily figure out your wanted bandwidth number based on the events Number. I wish this link could give you more insight(http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding).
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Mian, sorry for the late reply, yes we should know each other in Hong Kong via BGI.
I guess you are using tcp as a frabic between the MIC card and the host thus 450MB/s at maximum is what you've got and also what I've got.
Installing the ofed stacks in the mpss driver package will enable you to use direct memory access feature. I run the benchmark with "mpiexec.hydra -genv I_MPI_FABRICS=shm:dapl -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic" so it's fast.
By default (I guess it's your case), it's using I_MPI_FABRICS=shm:tcp.
Mian L. wrote:
Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!
Quote:
Ruibang L.wrote:I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Ruibang, thanks very much! Yes, I think you are right. The dapl model is not supported on our server, when I try to run your command, it outputs :
MPI startup(): dapl fabric is not available and fallback fabric is not enabled
Do you know how to install the ofed package? Is it supposed to be installed together with MPSS? If it can be installed separately, can you give me a link, please? Since I google it and cannot find correct information. Thanks very much.
Ruibang L. wrote:
Hi Mian, sorry for the late reply, yes we should know each other in Hong Kong via BGI.
I guess you are using tcp as a frabic between the MIC card and the host thus 450MB/s at maximum is what you've got and also what I've got.
Installing the ofed stacks in the mpss driver package will enable you to use direct memory access feature. I run the benchmark with "mpiexec.hydra -genv I_MPI_FABRICS=shm:dapl -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic" so it's fast.
By default (I guess it's your case), it's using I_MPI_FABRICS=shm:tcp.
Quote:
Mian L.wrote:Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!
Quote:
Ruibang L.wrote:
I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The ofed rpms are distributed with the MPSS driver. The installation guide is in http://registrationcenter.intel.com/irc_nas/3156/readme-en.txt
You'd better check if the kernel version required by the precompiled driver is the same with your server, or you have to recompile using rpmbuild --rebuild on src rpms from the driver sources. Please notice that "2.6.32-358.el6.x86_64" is totally different from "2.6.32-358.6.1.el6.x86_64".
BTW, It seems that to utilize MPI on Xeon Phi one have to install the proprietary Intel® MPI package (the compiler is also the case). This is not good. I'm a poor researcher that can only afford the card, lol :>
Mian L. wrote:
Hi Ruibang, thanks very much! Yes, I think you are right. The dapl model is not supported on our server, when I try to run your command, it outputs :
MPI startup(): dapl fabric is not available and fallback fabric is not enabled
Do you know how to install the ofed package? Is it supposed to be installed together with MPSS? If it can be installed separately, can you give me a link, please? Since I google it and cannot find correct information. Thanks very much.
Quote:
Ruibang L.wrote:Hi Mian, sorry for the late reply, yes we should know each other in Hong Kong via BGI.
I guess you are using tcp as a frabic between the MIC card and the host thus 450MB/s at maximum is what you've got and also what I've got.
Installing the ofed stacks in the mpss driver package will enable you to use direct memory access feature. I run the benchmark with "mpiexec.hydra -genv I_MPI_FABRICS=shm:dapl -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic" so it's fast.
By default (I guess it's your case), it's using I_MPI_FABRICS=shm:tcp.
Quote:
Mian L.wrote:
Hi Ruibang (I think I know you :-) ), are you willing to share your benchmark? any special configuration optimization? Thanks!!
Quote:
Ruibang L.wrote:
I was able to achieve ~6.5G/s and ~12G/s one and bi-directionally respectively.
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 20.54 0.00
16777216 2 2606.99 6137.35
33554432 1 5063.06 6320.29
67108864 1 9898.54 6465.60#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 81.96 81.97 81.97 0.00
16777216 2 5644.44 5648.49 5646.47 11330.45
33554432 1 10926.96 10939.12 10933.04 11701.12
67108864 1 21586.89 21597.86 21592.38 11853.02
