- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to make an optimized parallel version of [opencv SURF][1] and in particular [surf.cpp][2] using Intel C++ compiler.
I'm using Intel Advisor to locate inefficient and unvectorized loops. In particular, it suggests to rebuild the code using the `icpc` compiler (instead of `gcc`) and then to use the `xCORE-AVX2` flag since it's available for my hardware.
So my original `cmake` for building opencv using `g++` was:
cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON
And built the application which uses SURF with `g++ ... -O3 -g -fopenmp`
Using `icpc` instead is:
cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-debug inline-debug-info -parallel-source-info=2 -ipo -parallel -xCORE-AVX2 -Bdynamic"
(in particular notice `-DCMAKE_C_COMPILER -DCMAKE_CXX_COMPILER -DCMAKE_CXX_FLAGS`)
And compiled the SURF application with: `-g -O3 -ipo -parallel -qopenmp -xCORE-AVX2` and `-shared-intel -parallel` for linking
I thought that the `icpc` solution was going to be faster than the `g++` one, but it isn't: `icpc` takes 0.15s while `g++` takes `0.12`s (I ran the experiments several times and these numbers are reliable).
Why this happens? Am I doing something wrong with `icpc`?
[1]: http://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_surf_intro/py_surf_intro.html
[2]: https://github.com/opencv/opencv_contrib/blob/master/modules/xfeatures2d/src/surf.cpp
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Repeat your tests after optimizations are completed and remove option -g ( Generate debug information in default format ).
Any binaries with debug information are always slower.
Thanks for your answer. However, -g is highly suggested by all the Intel tools, such as Advisor or VTune. In addition, both g++ and icpc have the -g, so both should be slower and icpc should still be faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are the 0.15s and 0.12s whole-program times?
If so, they are probably dominated by process startup time, including loading required shared libraries. It seems likely that there are systematic differences between the two compiler families, and compiler optimization is not going to change this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
McCalpin, John wrote:
Are the 0.15s and 0.12s whole-program times?
If so, they are probably dominated by process startup time, including loading required shared libraries. It seems likely that there are systematic differences between the two compiler families, and compiler optimization is not going to change this.
Thanks for your answer. Actually these time measurements refers to the function that I want to parallelize only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>And compiled the SURF application with: `-g -O3 -ipo -parallel -qopenmp -xCORE-AVX2` and `-shared-intel -parallel` for linking
You typically do not use auto-parallelism (-parallel) combined with OpenMP (-qopenmp). As to if this has an effect, I cannot say. If you have VTune, looking at the disassembly might shed some light on what is going on.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is certainly not impossible for gcc to be faster than the Intel C compiler... VTune may be useful in locating the differences.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are just a few instances where gcc performance doesn't compete with icc, and there are a few where icc optimizes with different source code than gcc. Among the more egregious is where gcc depends on fmax|fmin et al. and takes shortcuts where icc used to require std::max|min (and now also optimizes carefully written code with ?) but does not optimize the gcc preferred code.
Even further than Sergey mentioned, gcc 7.1 has improved performance of max|min index reduction so it may be faster than some methods like cilkplus or user defined reducers which have been publicized for icc.
So you would need to be specific about where you see slowdown in icc if you are looking for advice (and don't solve the problem by looking into it).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've seen many cases where inclusion or exclusion of statements NOT used (not even close to function) affects the performance in the 3%-5% range (YMMV). On close inspection, the culprit seems to be changes in placement of code. In particular if movement causes a critical loop to span an additional cache line, If an instruction containing a branch spans a cache line, if the major loop spans page boundary (requiring an additional TLB entry).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
gcc 7.0 source was released. 7.1 is trunk development version which seems stable. There seem to be more than usual changes from 7.0 to 7.1 but it appears to build OK for usual Windows targets, if you deal with strangeness such as gfortran configures for mingw 64-bit if 64-bit Windows is detected, and the gfortran dynamic library has a name version change due to ABI change, besides the C++ changes on the official page. I wouldn't be surprised if there are no good pre-built Windows binaries for gcc 7.x.


- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page