ifort versus gfortran: binary produced by INTEL compilers runs 2 time slower than binary produced by GNU compilers

hladkyjiri · ‎08-14-2008

Hi all,

I'm working with ARPREC package downloaded from David H. Bailey's web page:
http://crd.lbl.gov/~dhbailey/mpdist/

ARPREC is a C++/Fortran-90 arbitrary precision package. I have compiled it on 64-bit Linux System (Opensuse 10.2) with Intel Core2 CPU T7400 running @ 2.16GHz. I have used intel compiler 10.1.015 and gcc compiler version 4.1.2. I see big difference in the runtime - in fact, binary produced by Intel runs 2 times slower! Do you see similar results with other code? Any clue what's wrong? I just can't believe that Intel performs so bad. The source code is actually fairly simple and short....

How to reproduce it:

Download http://crd.lbl.gov/~dhbailey/mpdist/arprec-2.2.2.tar.gz
Unpack. cd arprec-2.2.2
./configure
make
make check
make toolkit
cd toolkit; ./mathinit

=============INTEL COMPILERs==========
./configure reports:
C++ Compiler = icpc
C++ Flags = -O2 -mp -wd1572 -DHAVE_CONFIG_H
F90 Compiler = ifort
F90 Flags = -O2 -FR -mp
F90 Libs = -L/opt/intel/fce/10.1.015/lib -L/usr/lib64/gcc/x86_64-suse-linux/4.1.2/ -L/usr/lib64/gcc/x86_64-suse-linux/4.1.2/../../../../lib64 -lifport -lifcore -limf -lsvml -lm -lipgo -lirc -lirc_s -ldl

./mathinit reports
total cpu time = 116.309994816780
=======================================

===========GNU COMPILERs=============
./configure reports
C++ Compiler = c++
C++ Flags = -O2 -Wall -DHAVE_CONFIG_H
F90 Compiler = gfortran
F90 Flags = -O2 -ffree-form
F90 Libs = -L/usr/lib64/gcc/x86_64-suse-linux/4.1.2 -L/usr/lib64/gcc/x86_64-suse-linux/4.1.2/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib64/gcc/x86_64-suse-linux/4.1.2/../../../../x86_64-suse-linux/lib -L/usr/lib64/gcc/x86_64-suse-linux/4.1.2/../../.. -lgfortranbegin -lgfortran -lm

./mathinit reports
total cpu time = 57.6636048406363
=======================================

Any idea why I see such big difference in runtimes?

Thanks for any hints!
Jiri

TimP · ‎08-14-2008

In case you haven't tried Intel compiler options similar to the gfortran options, replace -mp by
-fno-inline-functions -assume protect_parens,minus0 -prec-div -prec-sqrt -no-ftz
For icpc, you would replace -mp by -fp-model source -fno-inline-functions; the same may work for ifort.
Evidently, you should not need all those options, but it may be worth while to start with the closest equivalents.
The -mp option was never designed for performance, and compilers from 9.1 on have much better alternatives.
It seems unlikely that -fno-inline-functions would speed up the code, but the code may have been designed for that option, as it is working for you with g++/gfortran. The other options I mention as you seem to be concerned about the ways in which ifort defaults are less accurate than the options you chose for gfortran.

TimP · ‎08-14-2008

-mp may be used where you would use -mfpmath=387 with gfortran or g++, but you didn't mention that option, so I mention this only for completeness.

hladkyjiri · ‎08-14-2008

Hi Tim!

Thanks for your hints! I will give it a try and post my results here. So far, I have just used standard procedure of configure/make. I think that inline functions are used quite heavily in C++ source code.

Thanks
Jiri

joseph-krahn · ‎08-14-2008

In my experience, Intel Fortran is almost always faster then gfortran, and often significantly faster. So, your example is unusual. Gfortran is still fairly new, and they are putting a big effort into standards-compliance. There are many situations were gfortran has not yet had time to optimize. I think they are doing a good job just for getting close to Intel Fortran performance.

I haven't checked recently, but my general expectation is for gfortran to be around 5-20% slower. That is pretty good for a free compiler. The obvious problem here is that "-mp" forces more precise IEEE math, which excludes a lot of significant math optimizations, which you are letting gfortran get away with. If Intel Fortran is giving bad numbers without -mp, try -mp1, which improves IEEE compliance with less speed impact.

TimP · ‎08-14-2008

-mp1 is an obsolete option,covered bythe options I recommended. You would want that option if you used ifort 9.0 or earlier, but it may fall short.

Gfortran, with the options OP has quoted, doesn't perform optimizations contrary to Fortran or IEEE standards. ifort 10.1 supports equivalent options, as I've suggested.

There is no uniform performance comparison factor between gfortran and ifort. Cases where ifort is slower are rare,when equivalent options are in use. They are worth reporting, if you are able to provide specific test cases.

hladkyjiri · ‎08-14-2008

Hi all!

Thanks for your hints! I have used following options with Intel compilers (fno-inline-functions had negative impact so it's not included here):
====INTEL COMPILERs: run #1===============
C++ Compiler = icpc
C++ Flags = -O2 -fp-model source -axT -wd1572 -DHAVE_CONFIG_H
F90 Compiler = ifort
F90 Flags = -O2 -FR -fp-model source -axT
./mathinit reports
total cpu time = 56.8500003069639
=======================================

====INTEL COMPILERs: run #2===============
C++ Compiler = icpc
C++ Flags = -O2 -fp-model source -axT -wd1572 -DHAVE_CONFIG_H
F90 Compiler = ifort
F90 Flags = -O2 -FR -assume protect_parens,minus0 -prec-div -prec-sqrt -no-ftz -axT
./mathinit reports
total cpu time = 57.0899981707335
=======================================

I'm little bit puzzled that -fp-model source is a little bit faster than -assume protect_parens,minus0 -prec-div -prec-sqrt -no-ftz. What is the REAL difference between these two sets? -fp-model source seems to be more elegant - I don't like too many options

ifort documentation says:
-fp-model source is equivalent to-fp-model precise
icpc documentation says:
precise -- Enables value-safe optimizations on floating-point data.
source -- Rounds intermediate results to source-defined precision and enables value-safe optimizations.
Is there any difference or are they indeed same? (Just curious).

On GNU side, I have added -finline-functions and -mssse3 (to match -axT) to make comparison fair:
===========GNU COMPILERs run#1============
C++ Compiler = c++
C++ Flags = -O2 -finline-functions -mssse3 -Wall -DHAVE_CONFIG_H
F90 Compiler = gfortran
F90 Flags = -O2 -finline-functions -mssse3 -ffree-form
./mathinit reports
total cpu time = 57.23957 66973495
=======================================
Adding -finline-functions had only small impact on run-time. Propably because all simple fuctions are marked as inline in the source code (and inlined by gcc already with -O) and pushing inlining too far does not help much.
===========GNU COMPILERs run#2============
C++ Compiler = c++
C++ Flags = -O2 -mssse3 -Wall -DHAVE_CONFIG_H
F90 Compiler = gfortran
F90 Flags = -O2 -mssse3 -ffree-form
./mathinit reports
total cpu time = 57.3195801824331
===========================================

Final result: Best Intel 56.85s, best GNU 57.24s
Intel wins!! smiley [:-)]

Remaining questions: should I go for -fp-model source or -assume protect_parens,minus0 -prec-div -prec-sqrt -no-ftz? Is there any difference between fp-model source and precise?

Thanks a lot for quick help!
Jiri

TimP · ‎08-14-2008

For ifort, -fp-model source and -fp-model precise are much the same, unlike the situation for C and C++, where -fp-model precise promotes float expressions to double.
The -fp-model options disable certain vectorizations, including those involving sum reduction and most math functions, and possibly other optimizations involving algebraically equivalent expressions. If you have none of those, they may give full performance. It's even possible, when your code has been written carefully so as to always optimize with a preference for left to right evaluation, that the -fp-model options may be faster.

hladkyjiri · ‎08-15-2008

Hi Tim,

thanks for clarification! I'm now using -fp-model source both in C/C++ and Fortran compiler. I will recommend to the author of ARPREC package to change ./configure to go for these options.

Summary of this thread:
==================
-mp may be used where you would use -mfpmath=387 otherwise it should be replaced with -fp-model source

In this particular example, numerical results with -mp and -fp-model source where same. Run time of program compiled with -mp was 2 times longer than run time of the same program compiled with -fp-model source. (on Intel Core2 Duo CPU).

Thanks
Jiri