icpc performance versus gcc; c++filt

coleman_silk · ‎10-23-2009

Hello Experts,
I have a well-established code which is essentially multiplying complex matrices. Now compiling this code with icpc I find that it runs at best two times slower than the gcc 4.3 compiled version. I have tried Intel 10.1, 11.0 and 11.1 and many optimizations along the lines of
-O3 -Oi -fno_alias -ftz -funroll-all-qloops -ipo -parallel -rcd -fp fast
with little improvement.

To shed some light on this, I profiled the executables with gprof. For g++ it returns

>more gprof_gcc.output
Flaches Profil:
Zeit seconds seconds Aufrufe s/Aufru s/Aufru Name
21.35 11.70 11.70 596035584 0.00 0.00 cmplx::operator*(cmplx const&) const
19.69 22.49 10.79 1072674322 0.00 0.00 cmplx::cmplx(double const&, double const&)
7.94 26.84 4.35 397357056 0.00 0.00 cmplx::operator+(cmplx const&) const
7.76 31.09 4.25 22075392 0.00 0.00 matrix<3, cmplx >::operator*(matrix<3, cmplx > const&) co
nst
5.58 34.15 3.06 72899318 0.00 0.00 matrix<3, cmplx >::~matrix()
5.53 37.18 3.03 1729259746 0.00 0.00 cmplx::~cmplx()
4.64 39.73 2.55 449710128 0.00 0.00 cmplx::operator=(cmplx const&)
4.64 42.27 2.54 48866662 0.00 0.00 matrix<3, cmplx >::matrix()
...

You see that the most time-consuming function is the multiplication of two complex values, which seems plausible to me as my code does little else, like I said. In general: the algebraic functions have the top ranks, as I would suppose.

Now, for the intel icpc we have

>more gprof_intel2.output
Flaches Profil:
Zeit seconds seconds Aufrufe s/Aufru s/Aufru Name
15.55 18.92 18.92 1073063440 0.00 0.00 cmplx::~cmplx()
13.37 35.19 16.27 1072674322 0.00 0.00 _ZN5cmplxIfEC9ERKdS2_
11.97 49.76 14.57 596035584 0.00 0.00 cmplx::operator*(cmplx const&) const
9.58 61.43 11.66 22075392 0.00 0.00 matrix<3, cmplx >::operator*(matrix<3, cmplx > const&) co
nst
8.45 71.71 10.28 1729259748 0.00 0.00 _ZN5cmplxIfED9Ev
7.73 81.12 9.41 1072674322 0.00 0.00 cmplx::cmplx(double const&, double const&)
6.00 88.42 7.31 397357056 0.00 0.00 cmplx::operator+(cmplx const&) const
3.97 93.25 4.83 cmplx::cmplx()
3.69 97.74 4.49 449710128 0.00 0.00 cmplx::operator=(cmplx const&)
2.26 100.49 2.75 448016038 0.00 0.00 _ZN5cmplxIfEC9Ev
...

You see that complex multiplication only comes in third. The complex deconstructor and another function have stepped up and seem consume a lot of time!

Now I have two issues:
1) Do you see where the problem could lie? Can you give me hints where to look to improve the icpc performance, whether it be an alterration of the code or a compile flag I have not thought of?

2) "c++filt _ZN5cmplxIfEC9ERKdS2_" does not work for me. Can c++filt decipher the intel name mangling and if not, can you tell me another way how to do so?

Thank you so much for your help in advance! If need be, I am eager to provide more information.

TimP · ‎10-23-2009

Complex multiplication would need to be expanded in line for satisfactory performance, including enabling vectorization.
icpc tends to reserve optimization of complex matrix multiplication for the options -O3 -xSSE3, with restrict qualifiers where appropriate, and a preference for C99 complex. When no information on data extents is provided, it ought to optimize for something on the order of 50x50; #pragma loop count min(10) avg(20) max(30), for example, might optimize a designated for() for smaller sizes. As MKL BLAS is provided, for a quick route to high performance, you would prefer that over elegance, if the data extents are moderate to large.
You appear to be trying random combinations of linux and Windows compiler options.