Solved: On an MPI environment setting for using MPI-3* Non-Blocking Collectives

Viet · ‎12-03-2020

Dear Devcloud administrator and supporter,

I would like to test the command MPI_Iallreduce (non-blocking communication) as described by the below page

https://techdecoded.intel.io/resources/hiding-communication-latency-using-mpi-3-non-blocking-collectives/

I am going to run the code on a university supercomputer which is based on Intel CPUs.

The code will run through a job scheduling system where I can't specify the CPU list because the job scheduling system sets it automatically.

The following setting is explained in the above link.

export I_MPI_ASYNC_PROGRESS_PIN=<CPU list>

Is the above setting necessary?

Is there a problem if this setting is not used?

Thank you very much for any help you can provide.

Viet.

PrasanthD_intel · ‎12-23-2020

Hi Viet,

Yes, I too have observed in Devcloud the normal way of setting library configuration wasn't working. I will forward this issue to the internal team thanks for reporting.

coming to your other question:

A) Why does the pinning isn't showing for the other node?

I_MPI_ASYNC_PROGRESS=1 I_MPI_DEBUG=10 mpiexec.hydra -n 2 -host ${nodes[0]} -env I_MPI_ASYNC_PROGRESS_PIN=5,6 ./a.out ./mtx/hcircuit.mtx : -n 2 -host ${nodes[1]} -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./a.out ./mtx/hcircuit.mtx

[0] MPI startup(): threading: thread: 0, processor: 5

[0] MPI startup(): threading: thread: 1, processor: 6

A) If you observe the debug info there is a square bracket [0] at the start of each line which means that debug info is coming from the 0th rank.

Generally, the 0th rank isn't aware of the pinning that happens on another node and that is the reason the 1,2 cores you have pinned in node1 aren't showing.

It doesn't mean the pinning is not occurring.

If you want to check, change the order of the MPMD command you used.

Thanks for reporting the issue.

Let us know if you have any other issues.

Regards

Prasanth

View solution in original post

ArunJ_Intel · ‎12-03-2020

Hi Viet,

As this question is not about using devcloud but about running an MPI application we are moving this to HPC toolkit forum.

Thanks

Arun

Viet · ‎12-03-2020

Hi Arun,

Thank you!

Viet

PrasanthD_intel · ‎12-04-2020

Hi Viet,

As mentioned in the article "there is an overhead associated with non-blocking communication from making it asynchronous. Although asynchronous progress improves communication-computation overlap, it requires an additional thread per MPI rank. This thread consumes CPU cycles and, ideally, must be pinned to an exclusive core."

Each additional process will use an additional CPU core which you can pin to a certain core using I_MPI_ASYNC_PROGRESS_PIN=<CPU list>, just like how you pin mpi processes to certain cores using I_MPI_PIN_PROCESSOR_LIST.

For example, if i do not use I_MPI_ASYNC_PROGRESS_PIN variable it will still use cores but MPI will select those cores accordingly.

eg: I_MPI_ASYNC_PROGRESS=1 I_MPI_DEBUG=10 mpirun -n 4 -host epb602 ./org

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)

[0] MPI startup(): library kind: release_mt

[0] MPI startup(): libfabric version: 1.10.1-impi

......

[0] MPI startup(): I_MPI_ASYNC_PROGRESS=1

[0] MPI startup(): I_MPI_DEBUG=10

[0] MPI startup(): threading: mode: handoff

[0] MPI startup(): threading: vcis: 1

[0] MPI startup(): threading: progress_threads: 0

[0] MPI startup(): threading: is_threaded: 1

[0] MPI startup(): threading: async_progress: 1

[0] MPI startup(): threading: num_pools: 64

[0] MPI startup(): threading: lock_level: nolock

[0] MPI startup(): threading: enable_sep: 0

[0] MPI startup(): threading: direct_recv: 0

[0] MPI startup(): threading: zero_op_flags: 1

[0] MPI startup(): threading: num_am_buffers: 1

[0] MPI startup(): threading: library is built with per-vci thread granularity

[0] MPI startup(): global_rank 0, local_rank 0, local_size 4, threads_per_node 4

[0] MPI startup(): threading: thread: 0, processor: 95

[0] MPI startup(): threading: thread: 1, processor: 94

[0] MPI startup(): threading: thread: 2, processor: 93

[0] MPI startup(): threading: thread: 3, processor: 92

My node has 96 cores and it selected the last 4 cores(92-95) for async threads as i have launched 4 processes.

I can select the cores to use for async threads with I_MPI_ASYNC_PROGRESS_PIN=<CPU list>

Eg: I_MPI_ASYNC_PROGRESS_PIN=81,82,83,84 I_MPI_ASYNC_PROGRESS=1 I_MPI_DEBUG=10 mpirun -n 4 -host epb602 ./org

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)

[0] MPI startup(): library kind: release_mt

[0] MPI startup(): libfabric version: 1.10.1-impi

....

[0] MPI startup(): I_MPI_ASYNC_PROGRESS=1

[0] MPI startup(): I_MPI_ASYNC_PROGRESS_PIN=81,82,83,84

[0] MPI startup(): I_MPI_DEBUG=10

[3] MPI startup(): global_rank 3, local_rank 3, local_size 4, threads_per_node 4

[0] MPI startup(): threading: mode: handoff

[0] MPI startup(): threading: vcis: 1

[0] MPI startup(): threading: progress_threads: 0

[0] MPI startup(): threading: is_threaded: 1

[0] MPI startup(): threading: async_progress: 1

[0] MPI startup(): threading: num_pools: 64

[0] MPI startup(): threading: lock_level: nolock

[0] MPI startup(): threading: enable_sep: 0

[0] MPI startup(): threading: direct_recv: 0

[0] MPI startup(): threading: zero_op_flags: 1

[0] MPI startup(): threading: num_am_buffers: 1

[0] MPI startup(): threading: library is built with per-vci thread granularity

[0] MPI startup(): global_rank 0, local_rank 0, local_size 4, threads_per_node 4

[0] MPI startup(): threading: thread: 0, processor: 81

[0] MPI startup(): threading: thread: 1, processor: 82

[0] MPI startup(): threading: thread: 2, processor: 83

[0] MPI startup(): threading: thread: 3, processor: 84

here you can see the cores 81-84 were used.

Hope this helps, let us know if you need any further assistance.

Regards

Prasanth

Viet · ‎12-04-2020

Dear Prasanth,

Thank you very much for your useful information.
I now understand the important meaning of the I_MPI_ASYNC_PROGRESS_PIN variable.

As mentioned by the following explanation in the article, I now want to specify one or two additional threads per node.

"Exclusive thread pinning for each rank results in half of the cores being assigned just to accelerate the progress of non-blocking MPI calls. Therefore, through careful experimentation, we must select a certain number of cores per node to be assigned for asynchronous progress without causing a considerable compute penalty."

In your example, you ran your program on only one node.

If I want to run on multiple nodes and specify one or two additional threads per node, what is the correct syntax to define the I_MPI_ASYNC_PROGRESS_PIN variable?

Thank you!

Viet.

Viet · ‎12-08-2020

Dear Prasanth,

Do you know how to set the MPI release_mt mode in the oneAPI HPC for using MPI Non-blocking communication?

The method explained in the following link did not match the MPI in the oneAPI HPC.

https://scc.ustc.edu.cn/zlsc/tc4600/intel/2016.0.109/mpi/User_Guide/Intel_MPI_Library_Configurations.htm

Thank you for any help you can provide.
Viet.

PrasanthD_intel · ‎12-09-2020

Hi Viet,

Sorry for the delay in response,

Q) If I want to run on multiple nodes and specify one or two additional threads per node, what is the correct syntax to define the I_MPI_ASYNC_PROGRESS_PIN variable?

A) It is the same as for a single node, but the process will be divided across nodes and so does the async threads.

For e.g., if you launch 10 across 2 nodes you have to provide only 5 cores to I_MPI_ASYNC_PROGRESS_PIN as only 5 processes run on a single node.

Q) Do you know how to set the MPI release_mt mode in the oneAPI HPC for using MPI Non-blocking communication?

It's the same as you have mentioned you have to provide library configuration (release_mt) as an argument to mpivars.sh script.

For more info please refer: Selecting a Library Configuration (intel.com)

Let us know if you face any issues.

Regards

Prasanth

Viet · ‎12-09-2020

Hi Prasanth,

I was able to set the MPI release_mt mode, as explained under the link you sent.

Thank you very much.

I still don't know how to set the I_MPI_ASYNC_PROGRESS_PIN variable.

Suppose I want to run a non-blocking program on two nodes with node names: node_id1, node_id2. Suppose I want to tie additional threads to cores 1, 2 on node_id1 and cores 3,4 on node_id2.

I think it should be something like this:

export I_MPI_ASYNC_PROGRESS_PIN = node_id1:1, node_id1:2, node_id2:3, node_id2:4

Please let me know the correct setting for this variable.

Thank you!
Viet

PrasanthD_intel · ‎12-16-2020

Hi Viet,

To answer your question.

Q) Suppose I want to run a non-blocking program on two nodes with node names: node_id1, node_id2. Suppose I want to tie additional threads to cores 1, 2 on node_id1 and cores 3,4 on node_id2.

A) You can use argument sets for this. The command is

mpiexec.hydra -n 2 -host node_id1 -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./<exec> : -n 2 -host node_id2 -env I_MPI_ASYNC_PROGRESS_PIN=3,4./<exec>

Hope this helps. Let me know if you have any other queries.

Regards

Prasanth

Viet · ‎12-17-2020

Hi Prasanth,

Thank you very much.
I'll confirm this command on Devcloud and inform you know the result soon.

mpiexec.hydra -n 2 -host node_id1 -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./<exec> : -n 2 -host node_id2 -env I_MPI_ASYNC_PROGRESS_PIN=3,4./<exec>

Is it true that this command uses eight threads where four pinned threads are used for non-blocking communication?

Sincerely,

Viet.

Viet · ‎12-18-2020

Hi Prasanth,

Do you know how to run your MPI command on the Intel Devcloud system?

mpiexec.hydra -n 2 -host node_id1 -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./<exec> : -n 2 -host node_id2 -env I_MPI_ASYNC_PROGRESS_PIN=3,4./<exec>

The PBS queuing system in the Devcloud automatically assigns the nodes to the MPI job when the job is run.
I don't know how to get specified node names when running a qsub command.

Sincerely,

Viet.

PrasanthD_intel · ‎12-18-2020

Hi Viet,

The command launches 4 processes/ranks and 4 async threads which need to be run on separate cores.

Let us know if the command works for you.

Regards

Prasanth

Viet · ‎12-18-2020

Hello Prasanth,
Thank you for your response.
I will confirm this with the VTune Profiler.
I am still having problems executing your command on the Devcoud through the PBS system.
Best regards,
Viet.

PrasanthD_intel · ‎12-18-2020

Hi Viet,

Could you please let us know the errors you were facing while running in Devcloud?

Regards

Prasanth

Viet · ‎12-20-2020

Hi Prasanth,

Thank you for your response.

$ qsub -l nodes=2:ppn=2 -d . run_async.sh

Below is the error message when I ran the above command

$ cat run_async.sh.e767999 
[mpiexec@s001-n020] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on s001-n144 (pid 7432, exit code 65280)
[mpiexec@s001-n020] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@s001-n020] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@s001-n020] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:772): error waiting for event
[mpiexec@s001-n020] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1955): error setting up the boostrap proxies

The file run_async.sh contains an MPI command that follows your specified syntax.

$ cat run_async.sh
#!/usr/bin/bash
mpiexec.hydra -n 2 -host s001-n144 -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./a.out ./mtx/hcircuit.mtx : -n 2 -host s001-n143 -env I_MPI_ASYNC_PROGRESS_PIN=3,4 ./a.out ./mtx/hcircuit.mtx

Thank you for anything you can provide.

Sincerely,

Viet.

Viet · ‎12-22-2020

Hi Prasanth,

I added the setting to go into the release_mt mode, but still no success.

$ cat run_async.sh
#!/usr/bin/bash
source /opt/intel/inteloneapi/setvars.sh release_mt --force 
echo $LD_LIBRARY_PATH
mpiexec.hydra -n 2 -host s001-n144 -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./a.out ./mtx/hcircuit.mtx : -n 2 -host s001-n143 -env I_MPI_ASYNC_PROGRESS_PIN=3,4 ./a.out ./mtx/hcircuit.mtx

$ cat run_async.sh.e769054 
[mpiexec@s001-n008] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on s001-n144 (pid 22676, exit code 65280)
[mpiexec@s001-n008] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@s001-n008] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@s001-n008] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:772): error waiting for event
[mpiexec@s001-n008] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1955): error setting up the boostrap proxies

Thank you for anything you can provide.

Best, Viet.

Viet · ‎12-22-2020

Hi Prasanth,

. /opt/intel/inteloneapi/mpi/2021.1.1/env/vars.sh release_mt

I also ran the above command, but it seems that the release_mt mode is not loaded in the Intel MPI Library version 2021.

Best, Viet.

Viet · ‎12-22-2020

Hello Prasanth,

Now I can load the MPI release_mt mode and run the command you explained.

$ cat run_async.sh

#!/usr/bin/bash
. /opt/intel/inteloneapi/mpi/2021.1.1/env/vars.sh --i_mpi_library_kind=release_mt
echo $LD_LIBRARY_PATH
#echo "+ The nodefile for this job is stored at ${PBS_NODEFILE}"
uniq ${PBS_NODEFILE} node_list.txt
mapfile -t nodes < node_list.txt
np=$(wc -l < ${PBS_NODEFILE})
echo "+ Number of cores assigned: ${np}"
echo "+ node list:" ${nodes[0]} ${nodes[1]}
I_MPI_ASYNC_PROGRESS=1 I_MPI_DEBUG=10 mpiexec.hydra -n 2 -host ${nodes[0]} -env I_MPI_ASYNC_PROGRESS_PIN=5,6 ./a.out ./mtx/hcircuit.mtx : -n 2 -host ${nodes[1]} -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./a.out ./mtx/hcircuit.mtx

The result of this command is as follows.

[0] MPI startup(): Intel(R) MPI Library, Version 2021.1  Build 20201112 (id: b9c9d2fc5)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release_mt
[0] MPI startup(): libfabric version: 1.11.0-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[1] MPI startup(): global_rank 1, local_rank 1, local_size 2, threads_per_node 2
[3] MPI startup(): global_rank 3, local_rank 1, local_size 2, threads_per_node 2
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       18729    s001-n056  {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 1       18730    s001-n056  {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 2       21267    s001-n023  {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 3       21268    s001-n023  {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_ROOT=/glob/development-tools/versions/oneapi/gold/inteloneapi/mpi/2021.1.1
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_ASYNC_PROGRESS=1
[0] MPI startup(): I_MPI_ASYNC_PROGRESS_PIN=5,6
[0] MPI startup(): I_MPI_DEBUG=10
[0] MPI startup(): threading: mode: handoff
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: is_threaded: 1
[0] MPI startup(): threading: async_progress: 1
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: nolock
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 0
[0] MPI startup(): threading: zero_op_flags: 0
[0] MPI startup(): threading: num_am_buffers: 8
[0] MPI startup(): threading: library is built with per-vci thread granularity
[0] MPI startup(): global_rank 0, local_rank 0, local_size 2, threads_per_node 2
[0] MPI startup(): threading: thread: 0, processor: 5
[0] MPI startup(): threading: thread: 1, processor: 6
[2] MPI startup(): global_rank 2, local_rank 0, local_size 2, threads_per_node 2

Two additional threads (5 and 6) specified in the first part of the command ran, but two additional threads (1 and 2) specified in the last part of the command did not.

Please let me know what was wrong with the command.

Thank you.

Viet.

PrasanthD_intel · ‎12-23-2020

Hi Viet,

Yes, I too have observed in Devcloud the normal way of setting library configuration wasn't working. I will forward this issue to the internal team thanks for reporting.

coming to your other question:

A) Why does the pinning isn't showing for the other node?

I_MPI_ASYNC_PROGRESS=1 I_MPI_DEBUG=10 mpiexec.hydra -n 2 -host ${nodes[0]} -env I_MPI_ASYNC_PROGRESS_PIN=5,6 ./a.out ./mtx/hcircuit.mtx : -n 2 -host ${nodes[1]} -env I_MPI_ASYNC_PROGRESS_PIN=1,2 ./a.out ./mtx/hcircuit.mtx

[0] MPI startup(): threading: thread: 0, processor: 5

[0] MPI startup(): threading: thread: 1, processor: 6

A) If you observe the debug info there is a square bracket [0] at the start of each line which means that debug info is coming from the 0th rank.

Generally, the 0th rank isn't aware of the pinning that happens on another node and that is the reason the 1,2 cores you have pinned in node1 aren't showing.

It doesn't mean the pinning is not occurring.

If you want to check, change the order of the MPMD command you used.

Thanks for reporting the issue.

Let us know if you have any other issues.

Regards

Prasanth

Viet · ‎12-25-2020

Hello Prasanth,
It's good to know that debug information comes from only one rank.
I now know how to set up the MPI environment to use MPI-3 * Non-Blocking Collectives functions.
Explicitly pinning additional threads requires complicated affinity settings.
I'm now thinking about using offloaded MPI Non-Blocking Collectives functions on InfiniBand. I will open questions about this in a new thread.
This thread can be closed here.
Thank you very much for your valuable answers.
Hope you have a great New Year's holiday!
With best regards,
Viet.

PrasanthD_intel · ‎12-28-2020

Hi Viet,

Glad we could be of help.

As the issue has been resolved we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only

Regards

Prasanth