We develop MPI algorithms on the SuperMUC supercomputer . We compile our algorithms with Intel MPI 2018. Unfortunately, it seems like the message transfer between two processes which have not exchanged a message before is slower than the message transfer between two processes which have already exchanged a message before by a factor of up to 1000.
I want to give several examples:
1.: Let benchmark A perform the following operations: "First, execute a barrier on MPI_COMM_WORLD. Second, start a timer. Third, process 0 sends 256 messages of size 32 kB each. Message i is sent to process i + 1. Finally, stop the timer." The first execution of benchmark A takes about 686 microseconds on an instance of 2048 processes (2048 cores on 128 nodes). Subsequent executions of A just take 0.85 microseconds each.
Insight: If we perform a communication pattern (here 'partial' broadcast) the first time, the execution is slower than subsequent executions by a factor of about 800. Unfortunately, if we execute benchmark A again with a different communication partner, e.g., process 1 sends messages to process 1..256, benchmark A is slow again. Thus, an initial warm up phase which executes benchmark A once does not speed up communication in general.
2.: Let benchmark B perform the following operations: "First, execute a barrier on MPI_COMM_WORLD. Second, start a timer. Third, invoke MPI_Alltoall with messages of size 32 kB each. Finally, stop the timer." The first execution of benchmark B takes about 42.41 seconds(!) on an instance of 2048 processes (2048 cores on 128 nodes). The second execution of B just takes 0.12 seconds.
Insight: If we perform a communication pattern (here MPI_Alltoall) the first time, the execution is slower than subsequent executions by a facto of about 353. Unfortunately, the first MPI_Alltoall is unbelievable slow and gets even much slower on larger machine instances.
3. Let benchmark C perform the following operations: "First, execute a barrier on MPI_COMM_WORLD. Second, start a timer, Third, execute an all-to-all collective operation with messages of size 32 kB each. Finally, stop the timer." The all-to-all collective operation we use in benchmark C is an own implementation of the MPI_Alltoall interface. We now execute benchmark C first and then benchmark A afterwards. Benchmark C takes about 40 seconds and benchmark A takes about 0.85 seconds.
Insight: The first execution of our all-to-all implementation performs similar to the first execution of MPI_Alltoall. Surprisingly, the subsequent execution of benchmark A is executed very fast (0.85 second), compared to the case where we do not have a preceding all-to-all. It seems like the all-to-all collective operation sets up the connections between each process which results in a fast execution of benchmark A. However, as the all-to-all collective operation (MPI_Alltoall as well) is unbelievable slow, we don't want to execute the all-to-all collective operation as a warm up on large scale.
We already figured out that the environment variable I_MPI_USE_DYNAMIC_CONNECTIONS=no avoids these slow running times on small scale (up to 2048 cores). However, setting I_MPI_USE_DYNAMIC_CONNECTIONS to the value 'no' does not have any effect on larger machine instances (number of cores > 2048).
We think that these benchmarks give interesting insights into the running time of Intel MPI. Our supercomputer might be configured incorrectly. We tried to adjust several environment variables but did not find a satisfying configuration. We also want to mention that those fluctuations in running time does not occur with IBM MPI on that machine.
If you have further suggestions to handle this problem, please let us know. If required we run additional benchmarks, apply different configurations, and provide debug output, e.g. I_MPI_DEBUG=xxx.