Solved: I_MPI_ASYNC_PROGRESS and I_MPI_PIN_PROCESSOR_LIST combination

alexey-medvedev · ‎10-19-2020

On Lomonosov-2 supercomputer (http://hpc.msu.ru/node/159, partition "Pascal"), I observe strange slowdowns in processing with I_MPI_ASYNC_PROGRESS=1/I_MPI_PIN_PROCESSOR_LIST=... parameters combination.

The reproducer code is like:

---
int main(int argc, char **argv)
{
MPI_Request request[1];
MPI_Status status;
MPI_Init(&argc, &argv);
int rank = 0;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

constexpr int size = 512;
constexpr size_t N = 10000;

//---

char wbuf[size];
MPI_Win win;
MPI_Win_create(wbuf, size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

//---

char sbuf[size], rbuf[size];
MPI_Barrier(MPI_COMM_WORLD);
double t1 = MPI_Wtime();
for (size_t i = 0; i < N; i++) {
MPI_Iallreduce((char *)sbuf, (char *)rbuf, size, MPI_CHAR, MPI_SUM, MPI_COMM_WORLD, request);
MPI_Wait(request, &status);
}
double t2 = MPI_Wtime();
MPI_Barrier(MPI_COMM_WORLD);

//---

if (rank == 0) {
std::cout << (t2-t1)*1e6/(double)N << std::endl;
}
MPI_Finalize();

return 0;
}
-----

The MPI_Win_create part is here to prevent crash as it is described here: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/I-MPI-ASYNC-PROGRESS-segfault/m-p/1219077#M7247

Running it like this:
# export I_MPI_ASYNC_PROGRESS=1
# mpiexec.hydra -f hosts -np 4 -ppn 2 --errfile-pattern=err.%r --outfile-pattern=out.%r ./simple-iallreduce

I've got timing results on stdout from 3 consecutive executions like: 10.2775; 10.1705; 10.243

If I add explicit pinning like this:

# export I_MPI_ASYNC_PROGRESS=1
# export I_MPI_PIN_PROCESSOR_LIST=0,1,2,3
# mpiexec.hydra -f hosts -np 4 -ppn 2 --errfile-pattern=err.%r --outfile-pattern=out.%r ./simple-iallreduce

Timing results changes to something like: 331.826; 10.4; 134.422

I'm not sure why for this way of execution the timing is changing like this and at such a scale? Is there anything incorrect with I_MPI_PIN_PROCESSOR_LIST=... setting?

--
Regards,
Alexey

PrasanthD_intel · ‎11-01-2020

Hi Alexey,

It has been mentioned in the developer reference (https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html) that only release_mt and debug_mt versions support asynchronous progress threads.

"Intel® MPI Library supports asynchronous progress threads that allow you to manage communication in parallel with application computation and, as a result, achieve better communication/computation overlapping. This feature is supported for the release_mt and debug_mt versions only."

Since your issue has been resolved could you please confirm so we close this thread?

Regards

Prasanth

View solution in original post

PrasanthD_intel · ‎10-20-2020

Hi Alexey,

As suggested in the previous post (https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/I-MPI-ASYNC-PROGRESS-segfault/m-p/1219077#M7247) use MPI_Waitall(N,request, &status) instead of MPI_Wait(request, &status) and also place the waitall outside the for loop, then the wait will be called once and the time taken will be reduced considerably.

When you use I_MPI_PIN_PROCESSOR_LIST only the specific cpu's will be used to launch processes. So the overall time will be increased since you are using only the given cpu's instead of using all the available cpu's .

We will get back to you with more information. Meanwhile please refer to the following link for more info on using nonblocking collectives: https://techdecoded.intel.io/resources/hiding-communication-latency-using-mpi-3-non-blocking-collectives/#gs.j5khtt

Regards

Prasanth

alexey-medvedev · ‎10-20-2020

Hi Prasanth,

1) changing to MPI_Waitall(N,request, &status) is not an equivalent transformation of a program I have. Look at two variants:

----
1.
MPI_IAllreduce(&in1, &out1,..., &req1);
do_smth(out0);
MPI_Wait(&req1);

MPI_IAllreduce(&in2, &out2,..., &req2);
do_smth(out1);
MPI_Wait(&req2);

MPI_IAllreduce(&in3, &out3,..., &req3);
do_smth(out2);
MPI_Wait(&req3);

----
2.

MPI_IAllreduce(&in1, &out1,..., &req1);
do_smth(out0);
MPI_IAllreduce(&in2, &out2,..., &req2);
do_smth(out1);
MPI_IAllreduce(&in3, &out3,..., &req3);
do_smth(out2);

MPI_Wait(&req1);
MPI_Wait(&req2);
MPI_Wait(&req3);

----

Variant 2) is buggy since MPI_Wait<N> must be before any usage of out<N> because of data dependency, and the only correct variant is 1). So the way MPI_Ialrreduce and MPI_Wait is used it this reproducer code is quite useful and close to real usage scenarios. Moreover, it is very close by its structure to IMB-NBC benchamrks.

I don't really get the idea behind changing it to MPI_Waitall(N...) outside the loop.

2. The argument about "using only the given cpu's instead of using all the available cpu's ." seems doubtful. You can see from the command line that I run test with ppn=2. So setting the explicit proccessor list consisting of 4 cores doesn't change the in-node parallelism level for worse, as I imagine.

3. Thanks for the link on the article, I've read it before several times, I don't really think I'm doing in this reproducer code anything that contadicts the article or contradicts common sense.

The question persists: why combination of I_MPI_ASYNC_PROGRESS & I_MPI_PIN_PROCESSOR_LIST results in such a volatile performance figures?

--
Regards,
Alexey

alexey-medvedev · ‎10-27-2020

Hi Prasanth,

I tried to switch to "release_mt" I_MPI_KIND (the described output was taken with default "release" kind).

The volatile times disappeared, everything is now stable and smooth.

I seems I_MPI_ASYNC_PROGRESS=1 works well only with relase_mt and debug_mt kinds, even though "release" and "debug" kinds work well in MPI_THREAD_MULTIPLE mode. In IMPI2019 "release_mt" must be set explicitly (seems different from IMPI2018). I didn't get it from docs, and there is no dignostics of wrong usage (at leats in 2019.4), which is misleading.

--
Regards,
Alexey

PrasanthD_intel · ‎11-01-2020

Hi Alexey,

It has been mentioned in the developer reference (https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html) that only release_mt and debug_mt versions support asynchronous progress threads.

"Intel® MPI Library supports asynchronous progress threads that allow you to manage communication in parallel with application computation and, as a result, achieve better communication/computation overlapping. This feature is supported for the release_mt and debug_mt versions only."

Since your issue has been resolved could you please confirm so we close this thread?

Regards

Prasanth

PrasanthD_intel · ‎11-06-2020

Hi Alexey,

Since your issue has been resolved, we are closing this issue for now. Please raise a new thread for any further queries.

Any further interaction in this thread will be considered community only

Regards

Prasanth