Solved: The ifort compiler option '

Bruns__Marco · ‎03-23-2019

Hi,

I have just received my copy of the Intel Fortran compiler (Linux) as an Open Source Contributor. I first idea I had was to compare it to the gfortran compiler. As a benchmark I have tried the following code:

https://github.com/marcobruns/fortran_performance_for_neural_networks/blob/master/fortran_matmul.f90

with gfortran the compiled code took 24.5sec to be executed and the with ifort it took 219.4sec!!!!! The blas version of the code (same repository quoted above) performed with nearly identical execution durations close to 3.5sec.

Why does it take so much longer when compiled with ifort.

I am using the following compiler versions:

gfortran: GNU Fortran (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0

Intel Fortran: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.3.199 Build 20190206

Thank you very much in advance for any kind of constructive criticism.

Best Wishes

Marco

Johannes_Rieke · ‎03-26-2019

The ifort compiler option '-parallel' does a great job at your code. With that option ('-O3 -fast -parallel') I could reduce execution time from 21 sec. to 1.4 sec. compared to '-O3 -fast' only (for my cpu, PSXE 2019 u3).

I think '-qopt-matmul' can also be used. In that case one has to specify -mkl:parallel also. '-O3 -parallel' triggers '-qopt-matmul'.

ps: It might be good way to use modules instead of interfaces in your code.

View solution in original post

Juergen_R_R · ‎03-26-2019

Dear Marco,

indeed I can confirm your timings, for gfortran 5.4 they are very similar to ifort v19.0.3, namely ca. 290-300 s, while for gfortran 9.0.1

I get 50s. I know that between v6 and v7 of gcc/gfortran there was some massive work on performance and optimization (which also caused a couple of optimization regressions), so that might be a reason. I didn't look into the details to find out what exactly is going on.

Cheers,

JRR

FortranFan · ‎03-26-2019

Bruns, Marco wrote:
.. Why does it take so much longer when compiled with ifort. ..
Thank you very much in advance for any kind of constructive criticism.

@Bruns, Marco,

You may also want to submit a support request about this at Intel Support Center: https://supporttickets.intel.com/servicecenter?lang=en-US

My hunch is they will request details such as compiler options especially with optimization, etc. with your 2 comparisons and it will be worth sharing them here as well.

See this as to how they list the compiler options they employed in the comparisons: https://www.fortran.uk/fortran-compiler-comparisons/polyhedron-benchmarks-linux64-on-intel/

Are you using the same set of options?

Johannes_Rieke · ‎03-26-2019

The ifort compiler option '-parallel' does a great job at your code. With that option ('-O3 -fast -parallel') I could reduce execution time from 21 sec. to 1.4 sec. compared to '-O3 -fast' only (for my cpu, PSXE 2019 u3).

I think '-qopt-matmul' can also be used. In that case one has to specify -mkl:parallel also. '-O3 -parallel' triggers '-qopt-matmul'.

ps: It might be good way to use modules instead of interfaces in your code.

Devorah_H_Intel · ‎03-26-2019

ifort perf1.f90 -o perf -O3 -parallel
./perf
mat1 created!
mat2 created!
matrix multiplication took :    1.472000

What options were used with gfortran build?

jimdempseyatthecove · ‎03-26-2019

FWIW

The sample code is serial. However, depending on the libraries linked, the parallel version of the MKL matmul may be called, and if so, the first call has the additional overhead of instantiating the OpenMP thread pool (or other thread pool if this has changed).

For smaller arrays, you can link in the non-threaded MKL library.

Jim Dempsey

Johannes_Rieke · ‎03-27-2019

FYI

'-qopt-matmul' without '-parallel' and with '-mkl:sequential' creates link errors. The documentation is not clear about this. Intentionally, I would suggest that this option should link (PSEX 2017 up to 2019u3 on Windows OS, Linux version seem to be different PSXE 2017u6 links). Might be a bug?

In an older thread of mine I encountered a similar issue for PSXE 2015/2016, which was solved in PSXE 2016 u3 (https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/606632). However, MKL/BLAS offload of matmul has not been performed in this case.

Maybe it is a nice feature to let the compiler choose between intrisic matmul and MKL/BLAS by '-mkl=sequential' for the case that you strictly need sequential code?

TimP · ‎03-27-2019

I haven't checked the latest ifort, but past versions did not vectorize matmul effectively until you set -O3 (which implies -qopt-matmul, which you may or may not want, either at -O2 or -O3). Surely gfortran performance also depends on your compiler settings, but not in the same way.

-qopt-matmul is implemented with linking to an internal entry point in MKL library. If you wished to set -O3 for good single thread performance of MATMUL and did not want to link MKL, you would turn off opt-matmul explicitly. opt-matmul is probably required for threaded MATMUL ; -qparallel would imply -qopt-matmul -mkl .

Past versions of ifort have MKL_DIRECT options to optimize MATMUL for moderate size problems. Unless the release notes indicate a change, I would expect the latest version to work with the documentation of earlier versions. Perhaps the latest version has done away with the need to consider these options.

Johannes_Rieke · ‎03-27-2019

I can confirm after installation PSXE2019u3 for my GNU/Linux that linking against mkl:sequential works fine (ifort -qopt-matmul -mkl:parallel matmultest.f90). The same fails for Windows OS for PSXE 2019 family (ifort /Qopt-matmul /Qmkl:sequential matmultest.f90), while /Qmkl:parallel works fine. I don't know if it's an issue with my system. Nevertheless, I will open a ticket.

Bruns__Marco · ‎03-28-2019

Hi Johannes,

thank you very much (and I would also like to thank everybody who replied to my question) for your input

johannes k. wrote:
The ifort compiler option '-parallel' does a great job at your code. With that option ('-O3 -fast -parallel') I could reduce execution time from 21 sec. to 1.4 sec. compared to '-O3 -fast' only (for my cpu, PSXE 2019 u3).

Sorry, for not mentioning it - since it is vital information - I have used no compiler for optimization. the commands for compiling the code producing my results are:

ifort fortran_matmul.f90 -o fortran_matmul

gfortran fortran_matmul.f90 -o fortran_matmul

But regarding your answer, I will defintely look into the compiler options for ifort (and also for gfortran), since there are obviously some promising compiler options available to speed up the execution of the code.

Why is matmul so much slower when compiled with ifort (compared to gfortran)