Re: Intel MPI Compiler 2018.5.274

Mohit9638 · ‎03-01-2022

Respected Sir/Madam,

I am working on an HPC system, using MPI Intel Compiler 2018.5.274.

When I run code with:-

export I_MPI_ASYNC_PROGRESS=0

my code (containing non-blocking call ) takes lesser time as compare to

export I_MPI_ASYNC_PROGRESS=1

Following commands in my job script :

$module load compiler/intel/2018.5.274
$source /opt/ohpc/pub/apps/intel/2018/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh release_mt

If you can help me regarding this , I will be very thankful to you for this.

Thanks & Regards,

Mohit Kumar,

CSE Department,

IIT Kanpur

HemanthCH_Intel · ‎03-03-2022

Hi,

Thanks for posting in Intel Communities.

Could you please provide the sample reproducer code and steps to reproduce your issue at our end?

Could you please provide the OS details and its version?

Thanks & Regards,

Hemanth.

Mohit9638 · ‎03-03-2022

Hi Hemanth,

output of $ cat /etc/os-release:-

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Sample Reproducer Code(non_block.c):-

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

int main( int argc, char *argv[])
{
int myrank, size;
double start_time, time, max_time;

MPI_Init(&argc, &argv);
MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
MPI_Status status[size-1];
MPI_Request request[size-1];

int BUFSIZE = atoi(argv[1]);
int arr[BUFSIZE];

start_time = MPI_Wtime ();
if (myrank < size-1)
{

MPI_Send(arr, BUFSIZE, MPI_INT, size-1, 99, MPI_COMM_WORLD);

}
else
{
int count, recvarr[size][BUFSIZE];

for (int i=0; i<size-1; i++)
{
MPI_Irecv(recvarr[i], BUFSIZE, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &request[i]);
}
MPI_Waitall (size-1, request, status);

}
time = MPI_Wtime () - start_time;

MPI_Reduce (&time, &max_time, 1, MPI_DOUBLE, MPI_MAX, size-1, MPI_COMM_WORLD);
if (myrank == size-1) printf ("%lf\n", max_time);

MPI_Finalize();
return 0;
}

Sample Step to compile:-

$mpicc non_block.c -o nb_p2p

Sample Job Script:-

#!/bin/bash
#SBATCH -N 2
#SBATCH --tasks-per-node=8
#SBATCH --error=err.out
#SBATCH --output=out.out
#SBATCH --time=00:05:00
#SBATCH --mem-per-cpu=4000MB
module load compiler/intel/2018.5.274
source /opt/ohpc/pub/apps/intel/2018/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh release_mt

export I_MPI_ASYNC_PROGRESS=0

for d in 4 512 16384 100000;
do
echo "without i_mpi_async_progress" >> result.txt
mpirun -np 16 ./nb_p2p $d >> result.txt
done

export I_MPI_ASYNC_PROGRESS=1

for d in 4 512 16384 100000;
do
echo "with i_mpi_async_progress" >> result.txt

mpirun -np 16 ./nb_p2p $d >> result.txt
done

Thanks & regards,

Mohit Kumar

manasi-t24 · ‎03-10-2022

Hi Mohit,

Were you able to solve your issue of performance worsening when you use I_MPI_ASYC_PROGRESS=1 instead of I_MPI_ASYNC_PROGRESS=0? I am having the exact same issue and I have posted it at https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Asynchronous-progress-slows-down-my-program/m-p/1367494#M9284 .

Please let me know if you are able to fix it.

Regards,

Manasi

Mohit9638 · ‎03-10-2022

Hi Mansi,

Sorry to inform you that the issue is still not fixed .

Thanks,

Mohit Kumar.

HemanthCH_Intel · ‎03-15-2022

Hi,

We are working on your issue and will get back to you soon.

Thanks & Regards,

Hemanth.

HemanthCH_Intel · ‎04-07-2022

Hi,

We are still investigating your issue and will get back to you soon.

Thanks & Regards,

Hemanth

HemanthCH_Intel · ‎04-29-2022

Hi

We have reported this issue to the concerned development team. They are looking into the issue.

Thanks & Regards,

Hemanth

Rafael_L_Intel · ‎05-11-2022

Hi Mohit,

There are a few things to note.

1) asynchronous progress execution actually spawns extra threads dedicated to progress the communication, which might end up oversubscribing your resources, thus slowing down the execution. By default, that's 1 progress thread per MPI rank. You can change this behaviour

export I_MPI_PIN_PROCESSOR_LIST=1-4,6-9

export I_MPI_ASYNC_PROGRESS_PIN=0,0,0,0,5,5,5,5

export I_MPI_ASYNC_PROGRESS=1

export I_MPI_DEBUG=5

mpirun -np 8 ./nb_p2p 100000

The I_MPI_DEBUG=5 flag shows the pinning of the progress threads.

2) In your example there should be no benefit in having asynchronous threads, because there is nothing happening between the creation of the requests (line 25) and the wait (line 27). In this case the main MPI process is just waiting for the progress thread to complete the communication and return the results. I have attached a modification of your code which basically includes some computation between the send/rec and waitall. In my tests there is a performance gain for using asynchronous progress for BUFSIZE=100000.

Let me know if that clarifies the issue!

Rafael

Shaikh__Samir · ‎08-12-2022

Hi Rafael,

I tried benchmarking IMB-NBC with Iallreduce with latest intel-oneapi-mpi version 2021.6.0.

Below is my SLURM script-

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48 ## Max. Cores = 48
#SBATCH -p cpu ## gpu/standard
#SBATCH --exclusive

echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES

ulimit -s unlimited
ulimit -c unlimited

export I_MPI_DEBUG=5

source /home/apps/spack/opt/spack/linux-centos7-cascadelake/oneapi-2022.1.0/intel-oneapi-mpi-2021.6.0-rok3uz443uve4qyrm5t7uyojjyxuqrit/mpi/2021.6.0/env/vars.sh -i_mpi_ofi_internal=1 -i_mpi_library_kind=release_mt

#time I_MPI_ASYNC_PROGRESS=0 I_MPI_FABRICS=shm:ofi I_MPI_PIN_PROCESSOR_LIST=0-95 mpiexec.hydra -np 96 -ppn 48 IMB-NBC Iallreduce

time mpiexec.hydra -np 24 IMB-NBC Iallreduce

#With ASYC

#export I_MPI_PIN_PROCESSOR_LIST=1-4,6-9
#export I_MPI_ASYNC_PROGRESS_PIN=0,0,0,0,5,5,5,5
#export I_MPI_PIN_PROCESSOR_LIST=1-23,25-47
#export I_MPI_ASYNC_PROGRESS_PIN=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24

export I_MPI_PIN_PROCESSOR_LIST=0-23
export I_MPI_ASYNC_PROGRESS_PIN=24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47
export I_MPI_ASYNC_PROGRESS=1
export I_MPI_DEBUG=5

time mpiexec.hydra -np 24 IMB-NBC Iallreduce

I get below error-

# Benchmarking Iallreduce
# #processes = 2
# ( 22 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 7.74 6.35 6.23 76.19
4 1000 6.41 5.08 5.08 73.76
8 1000 6.40 5.12 5.06 73.84
16 1000 6.51 5.09 5.05 71.32
32 1000 6.51 5.15 5.04 71.37
64 1000 6.52 5.17 5.05 71.65
128 1000 6.86 5.42 5.29 71.01
256 1000 6.97 5.56 5.55 74.44
512 1000 6.84 5.43 5.30 71.49
1024 1000 7.07 5.78 5.78 77.74
2048 1000 7.71 6.26 6.24 76.45
4096 1000 9.98 8.71 8.68 85.03
8192 1000 10.75 9.56 9.40 85.82
16384 1000 12.41 11.10 11.06 87.81
32768 1000 17.78 16.46 16.35 91.32
65536 640 23.05 21.76 21.63 93.49
131072 320 34.74 33.47 33.40 96.01
262144 160 56.53 55.49 55.16 97.54
524288 80 124.26 123.22 122.89 98.88
1048576 40 260.57 259.91 259.19 99.47
2097152 20 473.92 472.98 472.38 99.67
4194304 10 963.34 963.10 961.07 99.76
Abort(472992015) on node 10 (rank 10 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=10, new_comm=0x2548004) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 256 but expected 4100
Abort(271665423) on node 13 (rank 13 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=13, new_comm=0x1b06b24) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 64 but expected 4100
Abort(271665423) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=15, new_comm=0x2319e74) failed
PMPI_Comm_split(476)................:

MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 64 but expected 4100
Abort(271665423) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=15, new_comm=0x2319e74) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 64 but expected 4100
Abort(3229967) on node 20 (rank 20 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=20, new_comm=0x24f2e84) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 256 but expected 4100

I tried different combination of async pinning you can see in SLURM script.

Thanks

Samir Shaikh

Rafael_L_Intel · ‎08-31-2022

Hi Samir,

Thank you for reporting. The error "message sizes do not match across processes in the collective routine" when asynchronous progress is enabled is a known bug in Intel MPI 2021.6. It happens when some of the topology-aware implementations of MPI_Bcast are called (in this case MPI_Bcast_intra_binomial). You can try workarounding the problem by enforcing another algorithm, e.g.
export I_MPI_BCAST_ADJUST=3
Another workaround is to disable topology-aware collectives with:
export I_MPI_CBWR=1
See more here:
https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-windows/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html

Cheers!
Rafael

Gregg_S_Intel · ‎05-24-2023

The environment variable is actually I_MPI_ADJUST_BCAST

Maybe also try I_MPI_CBWR=2