- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Respected Sir/Madam,
I am working on an HPC system, using MPI Intel Compiler 2018.5.274.
When I run code with:-
export I_MPI_ASYNC_PROGRESS=0
my code (containing non-blocking call ) takes lesser time as compare to
export I_MPI_ASYNC_PROGRESS=1
Following commands in my job script :
$module load compiler/intel/2018.5.274
$source /opt/ohpc/pub/apps/intel/2018/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh release_mt
If you can help me regarding this , I will be very thankful to you for this.
Thanks & Regards,
Mohit Kumar,
CSE Department,
IIT Kanpur
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
Could you please provide the sample reproducer code and steps to reproduce your issue at our end?
Could you please provide the OS details and its version?
Thanks & Regards,
Hemanth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Hemanth,
output of $ cat /etc/os-release:-
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Sample Reproducer Code(non_block.c):-
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main( int argc, char *argv[])
{
int myrank, size;
double start_time, time, max_time;
MPI_Init(&argc, &argv);
MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
MPI_Status status[size-1];
MPI_Request request[size-1];
int BUFSIZE = atoi(argv[1]);
int arr[BUFSIZE];
start_time = MPI_Wtime ();
if (myrank < size-1)
{
MPI_Send(arr, BUFSIZE, MPI_INT, size-1, 99, MPI_COMM_WORLD);
}
else
{
int count, recvarr[size][BUFSIZE];
for (int i=0; i<size-1; i++)
{
MPI_Irecv(recvarr[i], BUFSIZE, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &request[i]);
}
MPI_Waitall (size-1, request, status);
}
time = MPI_Wtime () - start_time;
MPI_Reduce (&time, &max_time, 1, MPI_DOUBLE, MPI_MAX, size-1, MPI_COMM_WORLD);
if (myrank == size-1) printf ("%lf\n", max_time);
MPI_Finalize();
return 0;
}
Sample Step to compile:-
$mpicc non_block.c -o nb_p2p
Sample Job Script:-
#!/bin/bash
#SBATCH -N 2
#SBATCH --tasks-per-node=8
#SBATCH --error=err.out
#SBATCH --output=out.out
#SBATCH --time=00:05:00
#SBATCH --mem-per-cpu=4000MB
module load compiler/intel/2018.5.274
source /opt/ohpc/pub/apps/intel/2018/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh release_mt
export I_MPI_ASYNC_PROGRESS=0
for d in 4 512 16384 100000;
do
echo "without i_mpi_async_progress" >> result.txt
mpirun -np 16 ./nb_p2p $d >> result.txt
done
export I_MPI_ASYNC_PROGRESS=1
for d in 4 512 16384 100000;
do
echo "with i_mpi_async_progress" >> result.txt
mpirun -np 16 ./nb_p2p $d >> result.txt
done
Thanks & regards,
Mohit Kumar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mohit,
Were you able to solve your issue of performance worsening when you use I_MPI_ASYC_PROGRESS=1 instead of I_MPI_ASYNC_PROGRESS=0? I am having the exact same issue and I have posted it at https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Asynchronous-progress-slows-down-my-program/m-p/1367494#M9284 .
Please let me know if you are able to fix it.
Regards,
Manasi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mansi,
Sorry to inform you that the issue is still not fixed .
Thanks,
Mohit Kumar.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on your issue and will get back to you soon.
Thanks & Regards,
Hemanth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are still investigating your issue and will get back to you soon.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
We have reported this issue to the concerned development team. They are looking into the issue.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mohit,
There are a few things to note.
1) asynchronous progress execution actually spawns extra threads dedicated to progress the communication, which might end up oversubscribing your resources, thus slowing down the execution. By default, that's 1 progress thread per MPI rank. You can change this behaviour
export I_MPI_PIN_PROCESSOR_LIST=1-4,6-9
export I_MPI_ASYNC_PROGRESS_PIN=0,0,0,0,5,5,5,5
export I_MPI_ASYNC_PROGRESS=1
export I_MPI_DEBUG=5
mpirun -np 8 ./nb_p2p 100000
The I_MPI_DEBUG=5 flag shows the pinning of the progress threads.
2) In your example there should be no benefit in having asynchronous threads, because there is nothing happening between the creation of the requests (line 25) and the wait (line 27). In this case the main MPI process is just waiting for the progress thread to complete the communication and return the results. I have attached a modification of your code which basically includes some computation between the send/rec and waitall. In my tests there is a performance gain for using asynchronous progress for BUFSIZE=100000.
Let me know if that clarifies the issue!
Rafael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rafael,
I tried benchmarking IMB-NBC with Iallreduce with latest intel-oneapi-mpi version 2021.6.0.
Below is my SLURM script-
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48 ## Max. Cores = 48
#SBATCH -p cpu ## gpu/standard
#SBATCH --exclusive
echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
ulimit -s unlimited
ulimit -c unlimited
export I_MPI_DEBUG=5
source /home/apps/spack/opt/spack/linux-centos7-cascadelake/oneapi-2022.1.0/intel-oneapi-mpi-2021.6.0-rok3uz443uve4qyrm5t7uyojjyxuqrit/mpi/2021.6.0/env/vars.sh -i_mpi_ofi_internal=1 -i_mpi_library_kind=release_mt
#time I_MPI_ASYNC_PROGRESS=0 I_MPI_FABRICS=shm:ofi I_MPI_PIN_PROCESSOR_LIST=0-95 mpiexec.hydra -np 96 -ppn 48 IMB-NBC Iallreduce
time mpiexec.hydra -np 24 IMB-NBC Iallreduce
#With ASYC
#export I_MPI_PIN_PROCESSOR_LIST=1-4,6-9
#export I_MPI_ASYNC_PROGRESS_PIN=0,0,0,0,5,5,5,5
#export I_MPI_PIN_PROCESSOR_LIST=1-23,25-47
#export I_MPI_ASYNC_PROGRESS_PIN=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24
export I_MPI_PIN_PROCESSOR_LIST=0-23
export I_MPI_ASYNC_PROGRESS_PIN=24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47
export I_MPI_ASYNC_PROGRESS=1
export I_MPI_DEBUG=5
time mpiexec.hydra -np 24 IMB-NBC Iallreduce
I get below error-
# Benchmarking Iallreduce
# #processes = 2
# ( 22 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 7.74 6.35 6.23 76.19
4 1000 6.41 5.08 5.08 73.76
8 1000 6.40 5.12 5.06 73.84
16 1000 6.51 5.09 5.05 71.32
32 1000 6.51 5.15 5.04 71.37
64 1000 6.52 5.17 5.05 71.65
128 1000 6.86 5.42 5.29 71.01
256 1000 6.97 5.56 5.55 74.44
512 1000 6.84 5.43 5.30 71.49
1024 1000 7.07 5.78 5.78 77.74
2048 1000 7.71 6.26 6.24 76.45
4096 1000 9.98 8.71 8.68 85.03
8192 1000 10.75 9.56 9.40 85.82
16384 1000 12.41 11.10 11.06 87.81
32768 1000 17.78 16.46 16.35 91.32
65536 640 23.05 21.76 21.63 93.49
131072 320 34.74 33.47 33.40 96.01
262144 160 56.53 55.49 55.16 97.54
524288 80 124.26 123.22 122.89 98.88
1048576 40 260.57 259.91 259.19 99.47
2097152 20 473.92 472.98 472.38 99.67
4194304 10 963.34 963.10 961.07 99.76
Abort(472992015) on node 10 (rank 10 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=10, new_comm=0x2548004) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 256 but expected 4100
Abort(271665423) on node 13 (rank 13 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=13, new_comm=0x1b06b24) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 64 but expected 4100
Abort(271665423) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=15, new_comm=0x2319e74) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 64 but expected 4100
Abort(271665423) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=15, new_comm=0x2319e74) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 64 but expected 4100
Abort(3229967) on node 20 (rank 20 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(495)................: MPI_Comm_split(MPI_COMM_WORLD, color=-32766, key=20, new_comm=0x24f2e84) failed
PMPI_Comm_split(476)................:
MPIR_Comm_split_impl(244)...........:
MPIR_Get_contextid_sparse_group(484):
MPIR_Allreduce_intra_auto_safe(235).:
MPIR_Bcast_intra_auto(85)...........:
MPIR_Bcast_intra_binomial(131)......: message sizes do not match across processes in the collective routine: Received 256 but expected 4100
I tried different combination of async pinning you can see in SLURM script.
Thanks
Samir Shaikh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Samir,
Thank you for reporting. The error "message sizes do not match across processes in the collective routine" when asynchronous progress is enabled is a known bug in Intel MPI 2021.6. It happens when some of the topology-aware implementations of MPI_Bcast are called (in this case MPI_Bcast_intra_binomial). You can try workarounding the problem by enforcing another algorithm, e.g.
export I_MPI_BCAST_ADJUST=3
Another workaround is to disable topology-aware collectives with:
export I_MPI_CBWR=1
See more here:
https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-windows/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html
Cheers!
Rafael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The environment variable is actually I_MPI_ADJUST_BCAST
Maybe also try I_MPI_CBWR=2
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page