Repeat your tests after

luca_l_ · ‎02-14-2017

I'm trying to make an optimized parallel version of [opencv SURF][1] and in particular [surf.cpp][2] using Intel C++ compiler.

I'm using Intel Advisor to locate inefficient and unvectorized loops. In particular, it suggests to rebuild the code using the `icpc` compiler (instead of `gcc`) and then to use the `xCORE-AVX2` flag since it's available for my hardware.

So my original `cmake` for building opencv using `g++` was:

cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON

And built the application which uses SURF with `g++ ... -O3 -g -fopenmp`

Using `icpc` instead is:

cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-debug inline-debug-info -parallel-source-info=2 -ipo -parallel -xCORE-AVX2 -Bdynamic"

(in particular notice `-DCMAKE_C_COMPILER -DCMAKE_CXX_COMPILER -DCMAKE_CXX_FLAGS`)

And compiled the SURF application with: `-g -O3 -ipo -parallel -qopenmp -xCORE-AVX2` and `-shared-intel -parallel` for linking

I thought that the `icpc` solution was going to be faster than the `g++` one, but it isn't: `icpc` takes 0.15s while `g++` takes `0.12`s (I ran the experiments several times and these numbers are reliable).

Why this happens? Am I doing something wrong with `icpc`?

[1]: http://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_surf_intro/py_surf_intro.html
[2]: https://github.com/opencv/opencv_contrib/blob/master/modules/xfeatures2d/src/surf.cpp

SergeyKostrov · ‎02-14-2017

Repeat your tests after optimizations are completed and remove option -g ( Generate debug information in default format ). Any binaries with debug information are always slower.

luca_l_ · ‎02-15-2017

Sergey Kostrov wrote:

Repeat your tests after optimizations are completed and remove option -g ( Generate debug information in default format ).

Any binaries with debug information are always slower.

Thanks for your answer. However, -g is highly suggested by all the Intel tools, such as Advisor or VTune. In addition, both g++ and icpc have the -g, so both should be slower and icpc should still be faster.

McCalpinJohn · ‎02-15-2017

Are the 0.15s and 0.12s whole-program times?

If so, they are probably dominated by process startup time, including loading required shared libraries. It seems likely that there are systematic differences between the two compiler families, and compiler optimization is not going to change this.

SergeyKostrov · ‎02-15-2017

>>However, -g is highly suggested by all the Intel tools, such as Advisor or VTune. That is true for cases when profiling is needed. Final performance tests should be done for Release executables without any Debug information. By default, option -g ( = 3 ) creates a complete Debug information. >>In addition, both g++ and icpc have the -g, so both should be slower and icpc should still be faster. Intel executables built with Debug information always slower ( I've been using versions 7.x, 8.x, 12.x, 13.x, 16.x and 17.x ) when compared to executables built with Debug information by another C++ compilers. It is recommended that all final performance verifications need to be done with a clean Release configuration.

luca_l_ · ‎02-20-2017

McCalpin, John wrote:

Are the 0.15s and 0.12s whole-program times?

If so, they are probably dominated by process startup time, including loading required shared libraries. It seems likely that there are systematic differences between the two compiler families, and compiler optimization is not going to change this.

Thanks for your answer. Actually these time measurements refers to the function that I want to parallelize only.

jimdempseyatthecove · ‎02-20-2017

>>And compiled the SURF application with: `-g -O3 -ipo -parallel -qopenmp -xCORE-AVX2` and `-shared-intel -parallel` for linking

You typically do not use auto-parallelism (-parallel) combined with OpenMP (-qopenmp). As to if this has an effect, I cannot say. If you have VTune, looking at the disassembly might shed some light on what is going on.

Jim Dempsey

McCalpinJohn · ‎02-20-2017

It is certainly not impossible for gcc to be faster than the Intel C compiler... VTune may be useful in locating the differences.

SergeyKostrov · ‎02-21-2017

>>...It is certainly not impossible for gcc to be faster than the Intel C compiler... I've been using different versions ( 3.4.2, 4.8.1, 4.9.0, 4.9.2, 5.1.0 and 6.1.0 ) of GCC-alike MinGW C++ compiler and I can say it gets better and better in terms of performance, vectorization, etc, when compared to Intel C++ compiler. In some cases it is slower by 5%-10% than Intel C++ compiler. In some cases it is faster by 5%-10% than Intel C++ compiler. It depends on an algorithm and I would say that performance of GCC C++ compiler doesn't concern me any longer.

TimP · ‎02-21-2017

There are just a few instances where gcc performance doesn't compete with icc, and there are a few where icc optimizes with different source code than gcc. Among the more egregious is where gcc depends on fmax|fmin et al. and takes shortcuts where icc used to require std::max|min (and now also optimizes carefully written code with ?) but does not optimize the gcc preferred code.

Even further than Sergey mentioned, gcc 7.1 has improved performance of max|min index reduction so it may be faster than some methods like cilkplus or user defined reducers which have been publicized for icc.

So you would need to be specific about where you see slowdown in icc if you are looking for advice (and don't solve the problem by looking into it).

jimdempseyatthecove · ‎02-22-2017

I've seen many cases where inclusion or exclusion of statements NOT used (not even close to function) affects the performance in the 3%-5% range (YMMV). On close inspection, the culprit seems to be changes in placement of code. In particular if movement causes a critical loop to span an additional cache line, If an instruction containing a branch spans a cache line, if the major loop spans page boundary (requiring an additional TLB entry).

Jim Dempsey

SergeyKostrov · ‎02-23-2017

>>...gcc 7.1 has improved performance... it is a good news for me that it is finally released.

SergeyKostrov · ‎02-23-2017

>>>>...gcc 7.1 has improved performance... >> >>it is a good news for me that it is finally released. Tim, where did you see version 7.1?

TimP · ‎02-23-2017

gcc 7.0 source was released. 7.1 is trunk development version which seems stable. There seem to be more than usual changes from 7.0 to 7.1 but it appears to build OK for usual Windows targets, if you deal with strangeness such as gfortran configures for mingw 64-bit if 64-bit Windows is detected, and the gfortran dynamic library has a name version change due to ABI change, besides the C++ changes on the official page. I wouldn't be surprised if there are no good pre-built Windows binaries for gcc 7.x.

icpc slower than gcc?