Why autotuner shows negative effect on HPC applications?

oleotiger · ‎05-31-2021

I wanna apply autotuner to a hpc application. Following is what I do:

# run it 10 times to get a baseline 
I_MPI_COLL_EXTERNAL=1 mpiexec.hydra  -hostfile  ${hostfile} -genvall  -ppn ${ppn} ./application.exe

# get the tuned dat file
I_MPI_COLL_EXTERNAL=1 I_MPI_TUNING_MODE=auto I_MPI_TUNING_BIN_DUMP=/path/to/dumpfile I_MPI_TUNING_AUTO_ITER_NUM=1000 mpiexec.hydra  -hostfile  ${hostfile} -genvall  -ppn ${ppn} ./application.exe

# run 10 times to get tuned result
I_MPI_COLL_EXTERNAL=1 I_MPI_TUNING_BIN=/path/to/dumpfile mpiexec.hydra  -hostfile  ${hostfile} -genvall  -ppn ${ppn} ./application.exe

But the result shows that the performance of tuned mpi is a little worse than that in default.

Is there something that I miss?

Why autotuner achieve negative effect ?

oleotiger · ‎05-31-2021

Aps info about the application:

| Summary information							
|--------------------------------------------------------------------							
  Application                   : pw.x							
  Report creation date          : 2021-05-31 17:27:19							
  Number of ranks               : 384							
  Ranks per node                : 48							
  OpenMP threads number per rank: 1							
  Used statistics               : aps_result_20210531/							
|							
| Your application is MPI bound.							
| This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks.							
|							
  Elapsed time:                              469.42 sec							
  MPI Time:                  382.09 sec            81.53%							
| Your application is MPI bound. This may be caused by high busy wait time							
| inside the library (imbalance), non-optimal communication schema or MPI							
| library settings. Explore the MPI Imbalance metric if it is available or use							
| MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore							
| possible performance bottlenecks.							
    MPI Imbalance:           217.13 sec            46.33%							
| The application workload is not well balanced between MPI ranks.For more							
| details about the MPI communication scheme use Intel(R) Trace Analyzer and							
| Collector available as part of Intel(R) Parallel Studio XE Cluster Edition.							
    Top 5 MPI functions (avg time):							
        Alltoall                   202.84 sec  (43.21 %)							
        Barrier                    106.51 sec  (22.69 %)							
        Bcast                       35.63 sec  ( 7.59 %)							
        Allreduce                   14.54 sec  ( 3.10 %)							
        Reduce                       7.50 sec  ( 1.60 %)							
 Disk I/O Bound:             0.06 sec ( 0.01 %)							
       Data read:            653.5 MB							
       Data written:         2.5 GB

ShivaniK_Intel · ‎06-01-2021

Hi,

Thanks for reaching out to us.

Could you please provide the reproducer code so that we can investigate more on your issue?

Could you also provide the steps to reproduce the issue?

Thanks & Regards

Shivani

oleotiger · ‎06-03-2021

How to reproduce:

# The HPC application is openfoam
# Download:
1. https://sourceforge.net/projects/openfoam/files/v1912/OpenFOAM-v1912.tgz
2. https://sourceforge.net/projects/openfoam/files/v1912/ThirdParty-v1912.tgz

# Compile
# source oneapi vars. The version is 2021.2
export AppDir=/path/to/install/openfoam
mkdir -p ${AppDir}
tar xf OpenFOAM-v1912.tgz -C ${AppDir}
tar xf ThirdParty-v1912.tgz -C ${AppDir}
source #{AppDir}/OpenFOAM-v1912/etc/bashrc

cd ${AppDir}/OpenFOAM-v1912
# change compiler and MPI to  icc and INTELMPI
sed -i "s/^export WM_COMPILER=.*/export WM_COMPILER=Icc/" etc/bashrc
sed -i "s/^export WM_MPLIB=.*/export WM_MPLIB=INTELMPI/" etc/bashrc

cd ${AppDir}/ThirdParty-v1912
source ../env.sh
./Allwmake -j

cd ${AppDir}/OpenFOAM-v1912
./Allwmake -j


# how to run
# the workload is miracar

export OMP_NUM_THREADS=1
export I_MPI_HYDRA_IFACE=ib0
export I_MPI_TUNING_AUTO_STORAGE_SIZE=10485760
source /opt/compiler/intel/oneapi/setvars.sh
# for every host, echo 3 > /proc/sys/vm/drop_caches

# how to tune
I_MPI_COLL_EXTERNAL=1  I_MPI_TUNING_MODE=auto I_MPI_TUNING_BIN_DUMP=${tuning_dat} I_MPI_TUNING_AUTO_ITER_NUM=200 mpiexec.hydra  -hostfile  ${hostfile} -genvall  -ppn ${ppn} simpleFoam -parallel

# how to get tuned result
I_MPI_COLL_EXTERNAL=1  I_MPI_TUNING_BIN=${tuning_dat}  mpiexec.hydra  -hostfile  ${hostfile} -genvall  -ppn ${ppn}  simpleFoam -parallel

# how to get the baseline
I_MPI_COLL_EXTERNAL=1  mpiexec.hydra  -hostfile  ${hostfile} -genvall  -ppn ${ppn}  simpleFoam -parallel

ShivaniK_Intel · ‎06-08-2021

Hi,

Could you please follow the below steps and provide us the aps-report to investigate more on your issue:

1. export APS_STAT_LEVEL=5.

2. mpiexec.hydra -n 4 aps IMB-MPI1 alltoall -msglog 2:8 -npmin 4.

--> You can see the statement below at the end of execution.

"Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/u67125/aps_result_20210607" command to generate textual and HTML reports for the profiling session."

Now as suggested above, do:

3.aps-report /home/u67125/aps_result_20210607 -mPDF

Thanks & Regards

Shivani

oleotiger · ‎06-15-2021

Output of IMB:

[root@6248r-node121 osu]# mpiexec.hydra -n 4 aps IMB-MPI1 alltoall -msglog 2:8 -npmin 4
#----------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2021.2, MPI-1 part
#----------------------------------------------------------------
# Date                  : Wed Jun 16 09:04:21 2021
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-957.el7.x86_64
# Version               : #1 SMP Thu Nov 8 23:39:32 UTC 2018
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# IMB-MPI1 alltoall -msglog 2:8 -npmin 4

# Minimum message length in bytes:   0
# Maximum message length in bytes:   256
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Alltoall

#----------------------------------------------------------------
# Benchmarking Alltoall
# #processes = 4
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.22         0.24         0.23
            4         1000         2.25         2.42         2.33
            8         1000         2.24         2.50         2.36
           16         1000         2.30         2.64         2.40
           32         1000         2.31         2.64         2.42
           64         1000         2.27         2.45         2.37
          128         1000         2.29         2.52         2.40
          256         1000         2.47         2.69         2.59


# All processes entering MPI_Finalize

Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/software/intelmpi_test/osu/aps_result_20210616" command to generate textual and HTML reports for the profiling session.

Output of aps:

[root@6248r-node121 osu]# aps --report /home/software/intelmpi_test/osu/aps_result_20210616 -mPDF
Loading 100.00%
| Message Sizes summary for all Ranks
|--------------------------------------------------------------------------------------------------------------------------
| Function             Message size(B)       Volume(MB)        Volume(%)        Transfers        Time(sec)          Time(%)
|--------------------------------------------------------------------------------------------------------------------------
        MPI_Alltoall               ALL            12.21            99.98            67232             0.15            65.46
                                     4             0.10             0.79             8404             0.03            20.40
                                    16             0.38             3.15             8404             0.02            15.14
                                   256             6.16            50.39             8404             0.02            13.78
                                    32             0.77             6.30             8404             0.02            12.67
                                   128             3.08            25.20             8404             0.02            12.66
                                    64             1.54            12.60             8404             0.02            12.41
                                     8             0.19             1.57             8404             0.02            12.36
                                     0             0.00             0.00             8404             0.00             0.58
|--------------------------------------------------------------------------------------------------------------------------
         MPI_Barrier               ALL             0.00             0.00            67520             0.08            33.94
                                     0             0.00           100.00            67520             0.08           100.00
|--------------------------------------------------------------------------------------------------------------------------
       MPI_Allreduce               ALL             0.00             0.02              128             0.00             0.42
                                     4             0.00            33.33               64             0.00            82.03
                                     8             0.00            66.67               64             0.00            17.97
|--------------------------------------------------------------------------------------------------------------------------
           MPI_Bcast               ALL             0.00             0.00               32             0.00             0.13
                                     4             0.00           100.00               32             0.00           100.00
|--------------------------------------------------------------------------------------------------------------------------
          MPI_Gather               ALL             0.00             0.00               32             0.00             0.05
                                     8             0.00           100.00               32             0.00           100.00
|==========================================================================================================================
| TOTAL                                           12.22           100.00           134944             0.23           100.00
|

oleotiger · ‎06-15-2021

If I run IMB on 8 hosts.

The output of IMB:

[root@6248r-node121 osu]# mpiexec.hydra --hostfile /home/software/hostfiles/hostfile8  -n 4 aps IMB-MPI1 alltoall -msglog 2:8 -npmin 4
#----------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2021.2, MPI-1 part
#----------------------------------------------------------------
# Date                  : Wed Jun 16 09:12:09 2021
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-957.el7.x86_64
# Version               : #1 SMP Thu Nov 8 23:39:32 UTC 2018
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# IMB-MPI1 alltoall -msglog 2:8 -npmin 4

# Minimum message length in bytes:   0
# Maximum message length in bytes:   256
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Alltoall

#----------------------------------------------------------------
# Benchmarking Alltoall
# #processes = 4
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.22         0.23         0.22
            4         1000         2.20         2.43         2.32
            8         1000         2.18         2.39         2.32
           16         1000         2.24         2.59         2.35
           32         1000         2.27         2.62         2.37
           64         1000         2.24         2.50         2.35
          128         1000         2.27         2.43         2.35
          256         1000         2.43         2.75         2.57


# All processes entering MPI_Finalize

Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/software/intelmpi_test/osu/aps_result_20210616" command to generate textual and HTML reports for the profiling session.

Output of aps:

[root@6248r-node121 osu]# aps --report /home/software/intelmpi_test/osu/aps_result_20210616 -mPDF
Loading 100.00%
| Message Sizes summary for all Ranks
|--------------------------------------------------------------------------------------------------------------------------
| Function             Message size(B)       Volume(MB)        Volume(%)        Transfers        Time(sec)          Time(%)
|--------------------------------------------------------------------------------------------------------------------------
        MPI_Alltoall               ALL            12.21            99.98            67232             0.15            70.29
                                     4             0.10             0.79             8404             0.03            21.06
                                   256             6.16            50.39             8404             0.02            13.94
                                    16             0.38             3.15             8404             0.02            13.80
                                   128             3.08            25.20             8404             0.02            12.80
                                    64             1.54            12.60             8404             0.02            12.66
                                    32             0.77             6.30             8404             0.02            12.65
                                     8             0.19             1.57             8404             0.02            12.50
                                     0             0.00             0.00             8404             0.00             0.59
|--------------------------------------------------------------------------------------------------------------------------
         MPI_Barrier               ALL             0.00             0.00            67520             0.06            29.06
                                     0             0.00           100.00            67520             0.06           100.00
|--------------------------------------------------------------------------------------------------------------------------
       MPI_Allreduce               ALL             0.00             0.02              128             0.00             0.46
                                     4             0.00            33.33               64             0.00            80.02
                                     8             0.00            66.67               64             0.00            19.98
|--------------------------------------------------------------------------------------------------------------------------
           MPI_Bcast               ALL             0.00             0.00               32             0.00             0.14
                                     4             0.00           100.00               32             0.00           100.00
|--------------------------------------------------------------------------------------------------------------------------
          MPI_Gather               ALL             0.00             0.00               32             0.00             0.05
                                     8             0.00           100.00               32             0.00           100.00
|==========================================================================================================================
| TOTAL                                           12.22           100.00           134944             0.21           100.00
|

oleotiger · ‎06-15-2021

I have post the output of IMB and aps.

But I don't think the key is the preformance of my server. Autotuner works well and achieve about 1.7% performance improvement on another HPC application named wrf.

I think the key is the property of the application.

Are there conditions on which autotuner can achieve better performance and conditions on which autotuner achieves ~0 improvement or negative effect?

ShivaniK_Intel · ‎06-17-2021

Hi,

Thank you for providing the details been asked. Could you please provide the same details with respect to your application?

1. export APS_STAT_LEVEL=5.

2. mpiexec.hydra -n 4 aps <app name> alltoall -msglog 2:8 -npmin 4.

--> You can see the statement below at the end of execution.

"Intel(R) oneAPI VTune(TM) Profiler 2021.2.0 collection completed successfully. Use the "aps --report /home/u67125/aps_result_20210607" command to generate textual and HTML reports for the profiling session."

Now as suggested above, do:

3.aps-report /home/u67125/aps_result_20210607 -mPDF

Thanks & Regards

Shivani

ShivaniK_Intel · ‎06-24-2021

Hi,

As we didn't hear back from you, Is your issue resolved? If not, please provide the details that have been asked in my previous post.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎07-05-2021

Hi,

As we have not heard back from you, we are considering that your issue has been resolved and we have answered all your queries. So we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Have a Good day!

Thanks & Regards

Shivani