- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wanna apply autotuner to a hpc application. Following is what I do:
# run it 10 times to get a baseline
I_MPI_COLL_EXTERNAL=1 mpiexec.hydra -hostfile ${hostfile} -genvall -ppn ${ppn} ./application.exe
# get the tuned dat file
I_MPI_COLL_EXTERNAL=1 I_MPI_TUNING_MODE=auto I_MPI_TUNING_BIN_DUMP=/path/to/dumpfile I_MPI_TUNING_AUTO_ITER_NUM=1000 mpiexec.hydra -hostfile ${hostfile} -genvall -ppn ${ppn} ./application.exe
# run 10 times to get tuned result
I_MPI_COLL_EXTERNAL=1 I_MPI_TUNING_BIN=/path/to/dumpfile mpiexec.hydra -hostfile ${hostfile} -genvall -ppn ${ppn} ./application.exe
But the result shows that the performance of tuned mpi is a little worse than that in default.
Is there something that I miss?
Why autotuner achieve negative effect ?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Aps info about the application:
| Summary information
|--------------------------------------------------------------------
Application : pw.x
Report creation date : 2021-05-31 17:27:19
Number of ranks : 384
Ranks per node : 48
OpenMP threads number per rank: 1
Used statistics : aps_result_20210531/
|
| Your application is MPI bound.
| This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks.
|
Elapsed time: 469.42 sec
MPI Time: 382.09 sec 81.53%
| Your application is MPI bound. This may be caused by high busy wait time
| inside the library (imbalance), non-optimal communication schema or MPI
| library settings. Explore the MPI Imbalance metric if it is available or use
| MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore
| possible performance bottlenecks.
MPI Imbalance: 217.13 sec 46.33%
| The application workload is not well balanced between MPI ranks.For more
| details about the MPI communication scheme use Intel(R) Trace Analyzer and
| Collector available as part of Intel(R) Parallel Studio XE Cluster Edition.
Top 5 MPI functions (avg time):
Alltoall 202.84 sec (43.21 %)
Barrier 106.51 sec (22.69 %)
Bcast 35.63 sec ( 7.59 %)
Allreduce 14.54 sec ( 3.10 %)
Reduce 7.50 sec ( 1.60 %)
Disk I/O Bound: 0.06 sec ( 0.01 %)
Data read: 653.5 MB
Data written: 2.5 GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Could you please provide the reproducer code so that we can investigate more on your issue?
Could you also provide the steps to reproduce the issue?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How to reproduce:
# The HPC application is openfoam
# Download:
1. https://sourceforge.net/projects/openfoam/files/v1912/OpenFOAM-v1912.tgz
2. https://sourceforge.net/projects/openfoam/files/v1912/ThirdParty-v1912.tgz
# Compile
# source oneapi vars. The version is 2021.2
export AppDir=/path/to/install/openfoam
mkdir -p ${AppDir}
tar xf OpenFOAM-v1912.tgz -C ${AppDir}
tar xf ThirdParty-v1912.tgz -C ${AppDir}
source #{AppDir}/OpenFOAM-v1912/etc/bashrc
cd ${AppDir}/OpenFOAM-v1912
# change compiler and MPI to icc and INTELMPI
sed -i "s/^export WM_COMPILER=.*/export WM_COMPILER=Icc/" etc/bashrc
sed -i "s/^export WM_MPLIB=.*/export WM_MPLIB=INTELMPI/" etc/bashrc
cd ${AppDir}/ThirdParty-v1912
source ../env.sh
./Allwmake -j
cd ${AppDir}/OpenFOAM-v1912
./Allwmake -j
# how to run
# the workload is miracar
export OMP_NUM_THREADS=1
export I_MPI_HYDRA_IFACE=ib0
export I_MPI_TUNING_AUTO_STORAGE_SIZE=10485760
source /opt/compiler/intel/oneapi/setvars.sh
# for every host, echo 3 > /proc/sys/vm/drop_caches
# how to tune
I_MPI_COLL_EXTERNAL=1 I_MPI_TUNING_MODE=auto I_MPI_TUNING_BIN_DUMP=${tuning_dat} I_MPI_TUNING_AUTO_ITER_NUM=200 mpiexec.hydra -hostfile ${hostfile} -genvall -ppn ${ppn} simpleFoam -parallel
# how to get tuned result
I_MPI_COLL_EXTERNAL=1 I_MPI_TUNING_BIN=${tuning_dat} mpiexec.hydra -hostfile ${hostfile} -genvall -ppn ${ppn} simpleFoam -parallel
# how to get the baseline
I_MPI_COLL_EXTERNAL=1 mpiexec.hydra -hostfile ${hostfile} -genvall -ppn ${ppn} simpleFoam -parallel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please follow the below steps and provide us the aps-report to investigate more on your issue:
1. export APS_STAT_LEVEL=5.
2. mpiexec.hydra -n 4 aps IMB-MPI1 alltoall -msglog 2:8 -npmin 4.
--> You can see the statement below at the end of execution.
"Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/u67125/aps_result_20210607" command to generate textual and HTML reports for the profiling session."
Now as suggested above, do:
3.aps-report /home/u67125/aps_result_20210607 -mPDF
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Output of IMB:
[root@6248r-node121 osu]# mpiexec.hydra -n 4 aps IMB-MPI1 alltoall -msglog 2:8 -npmin 4
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.2, MPI-1 part
#----------------------------------------------------------------
# Date : Wed Jun 16 09:04:21 2021
# Machine : x86_64
# System : Linux
# Release : 3.10.0-957.el7.x86_64
# Version : #1 SMP Thu Nov 8 23:39:32 UTC 2018
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# IMB-MPI1 alltoall -msglog 2:8 -npmin 4
# Minimum message length in bytes: 0
# Maximum message length in bytes: 256
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Alltoall
#----------------------------------------------------------------
# Benchmarking Alltoall
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.22 0.24 0.23
4 1000 2.25 2.42 2.33
8 1000 2.24 2.50 2.36
16 1000 2.30 2.64 2.40
32 1000 2.31 2.64 2.42
64 1000 2.27 2.45 2.37
128 1000 2.29 2.52 2.40
256 1000 2.47 2.69 2.59
# All processes entering MPI_Finalize
Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/software/intelmpi_test/osu/aps_result_20210616" command to generate textual and HTML reports for the profiling session.
Output of aps:
[root@6248r-node121 osu]# aps --report /home/software/intelmpi_test/osu/aps_result_20210616 -mPDF
Loading 100.00%
| Message Sizes summary for all Ranks
|--------------------------------------------------------------------------------------------------------------------------
| Function Message size(B) Volume(MB) Volume(%) Transfers Time(sec) Time(%)
|--------------------------------------------------------------------------------------------------------------------------
MPI_Alltoall ALL 12.21 99.98 67232 0.15 65.46
4 0.10 0.79 8404 0.03 20.40
16 0.38 3.15 8404 0.02 15.14
256 6.16 50.39 8404 0.02 13.78
32 0.77 6.30 8404 0.02 12.67
128 3.08 25.20 8404 0.02 12.66
64 1.54 12.60 8404 0.02 12.41
8 0.19 1.57 8404 0.02 12.36
0 0.00 0.00 8404 0.00 0.58
|--------------------------------------------------------------------------------------------------------------------------
MPI_Barrier ALL 0.00 0.00 67520 0.08 33.94
0 0.00 100.00 67520 0.08 100.00
|--------------------------------------------------------------------------------------------------------------------------
MPI_Allreduce ALL 0.00 0.02 128 0.00 0.42
4 0.00 33.33 64 0.00 82.03
8 0.00 66.67 64 0.00 17.97
|--------------------------------------------------------------------------------------------------------------------------
MPI_Bcast ALL 0.00 0.00 32 0.00 0.13
4 0.00 100.00 32 0.00 100.00
|--------------------------------------------------------------------------------------------------------------------------
MPI_Gather ALL 0.00 0.00 32 0.00 0.05
8 0.00 100.00 32 0.00 100.00
|==========================================================================================================================
| TOTAL 12.22 100.00 134944 0.23 100.00
|
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I run IMB on 8 hosts.
The output of IMB:
[root@6248r-node121 osu]# mpiexec.hydra --hostfile /home/software/hostfiles/hostfile8 -n 4 aps IMB-MPI1 alltoall -msglog 2:8 -npmin 4
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.2, MPI-1 part
#----------------------------------------------------------------
# Date : Wed Jun 16 09:12:09 2021
# Machine : x86_64
# System : Linux
# Release : 3.10.0-957.el7.x86_64
# Version : #1 SMP Thu Nov 8 23:39:32 UTC 2018
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# IMB-MPI1 alltoall -msglog 2:8 -npmin 4
# Minimum message length in bytes: 0
# Maximum message length in bytes: 256
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Alltoall
#----------------------------------------------------------------
# Benchmarking Alltoall
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.22 0.23 0.22
4 1000 2.20 2.43 2.32
8 1000 2.18 2.39 2.32
16 1000 2.24 2.59 2.35
32 1000 2.27 2.62 2.37
64 1000 2.24 2.50 2.35
128 1000 2.27 2.43 2.35
256 1000 2.43 2.75 2.57
# All processes entering MPI_Finalize
Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/software/intelmpi_test/osu/aps_result_20210616" command to generate textual and HTML reports for the profiling session.
Output of aps:
[root@6248r-node121 osu]# aps --report /home/software/intelmpi_test/osu/aps_result_20210616 -mPDF
Loading 100.00%
| Message Sizes summary for all Ranks
|--------------------------------------------------------------------------------------------------------------------------
| Function Message size(B) Volume(MB) Volume(%) Transfers Time(sec) Time(%)
|--------------------------------------------------------------------------------------------------------------------------
MPI_Alltoall ALL 12.21 99.98 67232 0.15 70.29
4 0.10 0.79 8404 0.03 21.06
256 6.16 50.39 8404 0.02 13.94
16 0.38 3.15 8404 0.02 13.80
128 3.08 25.20 8404 0.02 12.80
64 1.54 12.60 8404 0.02 12.66
32 0.77 6.30 8404 0.02 12.65
8 0.19 1.57 8404 0.02 12.50
0 0.00 0.00 8404 0.00 0.59
|--------------------------------------------------------------------------------------------------------------------------
MPI_Barrier ALL 0.00 0.00 67520 0.06 29.06
0 0.00 100.00 67520 0.06 100.00
|--------------------------------------------------------------------------------------------------------------------------
MPI_Allreduce ALL 0.00 0.02 128 0.00 0.46
4 0.00 33.33 64 0.00 80.02
8 0.00 66.67 64 0.00 19.98
|--------------------------------------------------------------------------------------------------------------------------
MPI_Bcast ALL 0.00 0.00 32 0.00 0.14
4 0.00 100.00 32 0.00 100.00
|--------------------------------------------------------------------------------------------------------------------------
MPI_Gather ALL 0.00 0.00 32 0.00 0.05
8 0.00 100.00 32 0.00 100.00
|==========================================================================================================================
| TOTAL 12.22 100.00 134944 0.21 100.00
|
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have post the output of IMB and aps.
But I don't think the key is the preformance of my server. Autotuner works well and achieve about 1.7% performance improvement on another HPC application named wrf.
I think the key is the property of the application.
Are there conditions on which autotuner can achieve better performance and conditions on which autotuner achieves ~0 improvement or negative effect?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for providing the details been asked. Could you please provide the same details with respect to your application?
1. export APS_STAT_LEVEL=5.
2. mpiexec.hydra -n 4 aps <app name> alltoall -msglog 2:8 -npmin 4.
--> You can see the statement below at the end of execution.
"Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/u67125/aps_result_20210607" command to generate textual and HTML reports for the profiling session."
Now as suggested above, do:
3.aps-report /home/u67125/aps_result_20210607 -mPDF
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As we didn't hear back from you, Is your issue resolved? If not, please provide the details that have been asked in my previous post.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As we have not heard back from you, we are considering that your issue has been resolved and we have answered all your queries. So we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
Have a Good day!
Thanks & Regards
Shivani

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page