- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am running my application on a cluster with 2 MICs in each node (Salomon cluster). However, I've noticed that the MPI communication between both MICs is extremely slow whenever the message sizes are somewhere between 4KB and 128KB. I've been able to reproduce the problem with the Intel MPI benchmarks using these arguments: "PingPong -off_cache 16". This is the output for the benchmarks with 2 ranks, one on each MIC of the same node:
#------------------------------------------------------------ # Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part #------------------------------------------------------------ # Date : Tue Nov 8 16:37:10 2016 # Machine : k1om # System : Linux # Release : 2.6.38.8+mpss3.7.1 # Version : #1 SMP Tue May 24 06:35:05 EDT 2016 # MPI Version : 3.1 # MPI Thread Environment: # New default behavior from Version 3.2 on: # the number of iterations per message size is cut down # dynamically when a certain run time (per message size sample) # is expected to be exceeded. Time limit is defined by variable # "SECS_PER_SAMPLE" (=> IMB_settings.h) # or through the flag => -time # Calling sequence was: # /home/chaulio/IMB-MPI1.mic PingPong -off_cache 16 # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # PingPong #--------------------------------------------------- # Benchmarking PingPong # #processes = 2 #--------------------------------------------------- #bytes #repetitions t[usec] Mbytes/sec 0 1000 5.06 0.00 1 1000 5.48 0.17 2 1000 5.54 0.34 4 1000 5.49 0.69 8 1000 5.46 1.40 16 1000 5.58 2.73 32 1000 6.17 4.95 64 1000 6.20 9.84 128 1000 6.81 17.93 256 1000 7.91 30.86 512 1000 9.72 50.23 1024 1000 13.86 70.46 2048 1000 22.39 87.24 4096 1000 37.53 104.07 8192 1000 2130.88 3.67 <----- Problem starts here 16384 1000 2104.48 7.42 32768 1000 2235.86 13.98 65536 640 1027.19 60.85 131072 320 3057.71 40.88 262144 160 257.40 971.26 524288 80 325.60 1535.62 1048576 40 459.33 2177.11 2097152 20 737.25 2712.78 4194304 10 1188.40 3365.88 # All processes entering MPI_Finalize
Some additional information that might be helpful for those reading this:
- I am using the fabrics configuration as suggested in the cluster documentation, i.e:
$ export I_MPI_FABRICS=shm:dapl $ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1
- I have experienced the same problem when running the same benchmark on a different cluster, also with 2 MICs per node (SuperMIC cluster).
- I have the same problem when trying Host-MIC communication (although it is a bit faster).
- If I omit the flag "-off-cache 16" when running the benchmark, this problem doesn't happen. But I am using it intentionally, so that the communication pattern is very similar to that of my application. (Without this flag the benchmark uses a very small buffer, which is not the case in my application).
- At first I thought it was a cache issue, but I also tried the benchmark with two MPI ranks running on the same MIC, and I didn't experience any slow-down then.
- I also experimented with setting I_MPI_EAGER_THRESHOLD to different values such as 512, 16KB, 64KB, among others, and the slow-down for messages larger than 4KB happened in all cases. So it seems that the Eager protocol threshold is not an issue.
Has anyone experienced the same problem? Could you please try to reproduce this problem in other environments and check if you are able to solve it? Maybe there is an Intel MPI option that could fix this? I would appreciate any advices with this issue.
Thank you very much,
Chaulio
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page