Community
cancel
Showing results for 
Search instead for 
Did you mean: 
luca_l_
Beginner
252 Views

icpc slower than gcc?

I'm trying to make an optimized parallel version of [opencv SURF][1] and in particular [surf.cpp][2] using Intel C++ compiler.

I'm using Intel Advisor to locate inefficient and unvectorized loops. In particular, it suggests to rebuild the code using the `icpc` compiler (instead of `gcc`) and then to use the `xCORE-AVX2` flag since it's available for my hardware. 

So my original `cmake` for building opencv  using `g++` was:

    cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON

And built the application which uses SURF with `g++ ... -O3 -g -fopenmp`

Using `icpc` instead is:

    cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-debug inline-debug-info -parallel-source-info=2 -ipo -parallel -xCORE-AVX2 -Bdynamic"

(in particular notice `-DCMAKE_C_COMPILER -DCMAKE_CXX_COMPILER -DCMAKE_CXX_FLAGS`)

And compiled the SURF application with: `-g -O3 -ipo -parallel -qopenmp -xCORE-AVX2` and `-shared-intel -parallel` for linking

I thought that the `icpc` solution was going to be faster than the `g++` one, but it isn't: `icpc` takes 0.15s while `g++` takes `0.12`s (I ran the experiments several times and these numbers are reliable).

Why this happens? Am I doing something wrong with `icpc`?

 

  [1]: http://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_surf_intro/py_surf_intro.html
  [2]: https://github.com/opencv/opencv_contrib/blob/master/modules/xfeatures2d/src/surf.cpp

0 Kudos
13 Replies
SergeyKostrov
Valued Contributor II
252 Views

Repeat your tests after optimizations are completed and remove option -g ( Generate debug information in default format ). Any binaries with debug information are always slower.
luca_l_
Beginner
252 Views

Sergey Kostrov wrote:

Repeat your tests after optimizations are completed and remove option -g ( Generate debug information in default format ).

Any binaries with debug information are always slower.

Thanks for your answer. However, -g is highly suggested by all the Intel tools, such as Advisor or VTune. In addition, both g++ and icpc have the -g, so both should be slower and icpc should still be faster.

McCalpinJohn
Black Belt
252 Views

Are the 0.15s and 0.12s whole-program times? 

If so, they are probably dominated by process startup time, including loading required shared libraries. It seems likely that there are systematic differences between the two compiler families, and compiler optimization is not going to change this.

SergeyKostrov
Valued Contributor II
252 Views

>>However, -g is highly suggested by all the Intel tools, such as Advisor or VTune. That is true for cases when profiling is needed. Final performance tests should be done for Release executables without any Debug information. By default, option -g ( = 3 ) creates a complete Debug information. >>In addition, both g++ and icpc have the -g, so both should be slower and icpc should still be faster. Intel executables built with Debug information always slower ( I've been using versions 7.x, 8.x, 12.x, 13.x, 16.x and 17.x ) when compared to executables built with Debug information by another C++ compilers. It is recommended that all final performance verifications need to be done with a clean Release configuration.
luca_l_
Beginner
252 Views

McCalpin, John wrote:

Are the 0.15s and 0.12s whole-program times? 

If so, they are probably dominated by process startup time, including loading required shared libraries. It seems likely that there are systematic differences between the two compiler families, and compiler optimization is not going to change this.

Thanks for your answer. Actually these time measurements refers to the function that I want to parallelize only.

jimdempseyatthecove
Black Belt
252 Views

>>And compiled the SURF application with: `-g -O3 -ipo -parallel -qopenmp -xCORE-AVX2` and `-shared-intel -parallel` for linking

You typically do not use auto-parallelism (-parallel) combined with OpenMP (-qopenmp). As to if this has an effect, I cannot say. If you have VTune, looking at the disassembly might shed some light on what is going on.

Jim Dempsey

McCalpinJohn
Black Belt
252 Views

It is certainly not impossible for gcc to be faster than the Intel C compiler...    VTune may be useful in locating the differences. 

SergeyKostrov
Valued Contributor II
252 Views

>>...It is certainly not impossible for gcc to be faster than the Intel C compiler... I've been using different versions ( 3.4.2, 4.8.1, 4.9.0, 4.9.2, 5.1.0 and 6.1.0 ) of GCC-alike MinGW C++ compiler and I can say it gets better and better in terms of performance, vectorization, etc, when compared to Intel C++ compiler. In some cases it is slower by 5%-10% than Intel C++ compiler. In some cases it is faster by 5%-10% than Intel C++ compiler. It depends on an algorithm and I would say that performance of GCC C++ compiler doesn't concern me any longer.
TimP
Black Belt
252 Views

There are just a few instances where gcc performance doesn't compete with icc, and there are a few where icc optimizes with different source code than gcc.  Among the more egregious is where gcc depends on fmax|fmin et al. and takes shortcuts where icc used to require std::max|min (and now also optimizes carefully written code with ?) but does not optimize the gcc preferred code.

Even further than Sergey mentioned, gcc 7.1 has improved performance of max|min index reduction so it may be faster than some methods like cilkplus or user defined reducers which have been publicized for icc.

So you would need to be specific about where you see slowdown in icc if you are looking for advice (and don't solve the problem by looking into it).

jimdempseyatthecove
Black Belt
252 Views

I've seen many cases where inclusion  or exclusion of statements NOT used (not even close to function) affects the performance in the 3%-5% range (YMMV). On close inspection, the culprit seems to be changes in placement of code. In particular if movement causes a critical loop to span an additional cache line, If an instruction containing a branch spans a cache line, if the major loop spans page boundary (requiring an additional TLB entry).

Jim Dempsey

SergeyKostrov
Valued Contributor II
252 Views

>>...gcc 7.1 has improved performance... it is a good news for me that it is finally released.
SergeyKostrov
Valued Contributor II
252 Views

>>>>...gcc 7.1 has improved performance... >> >>it is a good news for me that it is finally released. Tim, where did you see version 7.1?
TimP
Black Belt
252 Views

gcc 7.0 source was released.  7.1 is trunk development version which seems stable.  There seem to be more than usual changes from 7.0 to 7.1 but it appears to build OK for usual Windows targets, if you deal with strangeness such as gfortran configures for mingw 64-bit if 64-bit Windows is detected, and the gfortran dynamic library has a name version change due to ABI change, besides the C++ changes on the official page.  I wouldn't be surprised if there are no good pre-built Windows binaries for gcc 7.x.

Reply