Replace the gcc by icc, but meet performance problem

zhang_s_2 · ‎07-22-2015

Hi, all,

Now our team try to use the icc replace gcc which we have used years. But find the icc compiled executable file performance is bad than gcc.

We use the script generated some small demos to test this, all in the attached test.tar.gz

Test machine CPU: Intel(R) Xeon(R) CPU E7-4850 v2 @ 2.30GHz
OS: Centos 6.6
GCC: 4.7.2
ICC: parallel_studio_xe_2015_update3

In the test.tar.gz, contains N10, N50, N100, N250, N500 folders, in each folder, contains f10A.c file, which is auto generated by genf2A.pl, then make to generate f10Agcc and f10Aicc, use run.csh to run them and compare the result.

At last, the result is as below:

GCC ICC
N10 0.384 0.291
N50 2.107 2.155
N100 5.429 5.648
N250 15.397 23.575
N500 38.949 58.672

When N<100, the icc is better than gcc, but when N>100, especially N=500, the icc performance is much worse than gcc.

How can we improve the icc performance? Maybe add some special compiling options?
Anyone know please tell me, thanks very much!

zhang_s_2 · ‎07-23-2015

Anyone help?

TimP · ‎07-24-2015

It's not running for me with either icc nor gcc.

KitturGanesh · ‎07-24-2015

Hi Zhang/Tim,
I could reproduce the performance issue with icc product release (true with other optimizations switches that I tried as well and looks like an issue in HPO) on the equivalent RHEL system (don't have access to Cent OS)

But, when I tried the latest ICC beta release (update 2) ICC is faster than GCC. So, please download the latest beta release and try out. Here's the run info I got below:
-----------------------------------------------------
[%N500]$ icc -V
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.0.069 Beta Build 20150527

[%N500]$ ./run.csh
time ./f10Agcc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09
real 1m25.668s
user 1m25.461s
sys 0m0.016s

time ./f10Aicc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09
real 0m51.074s
user 0m50.837s
sys 0m0.122s

[%N500]

===================================

_Kittur

zhang_s_2 · ‎07-26-2015

Kittur Ganesh (Intel) wrote:

Hi Zhang/Tim,
I could reproduce the performance issue with icc product release (true with other optimizations switches that I tried as well and looks like an issue in HPO) on the equivalent RHEL system (don't have access to Cent OS)

But, when I tried the latest ICC beta release (update 2) ICC is faster than GCC. So, please download the latest beta release and try out. Here's the run info I got below:
-----------------------------------------------------
[%N500]$ icc -V
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.0.069 Beta Build 20150527

[%N500]$ ./run.csh
time ./f10Agcc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09
real 1m25.668s
user 1m25.461s
sys 0m0.016s

time ./f10Aicc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09
real 0m51.074s
user 0m50.837s
sys 0m0.122s

[%N500]

===================================

_Kittur

Hi Kittur,

Thanks for your reply! I use the icc version is 15.0.3, which is in the parallel studio xe 2015. ~~Where can I download the latest icc version? And when will the new icc version updated to the studio package?~~ (I have found the beta package)

Thanks,

Shaohua

zhang_s_2 · ‎07-26-2015

And what does the HPO mean? Where can I get the related documents? Could we resolve this HPO problem by using the old icc version?

Shenghong_G_Intel · ‎07-27-2015

@Shaohua,

Below page contains how to register for 2016 beta:https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2016-beta

Regarding meaning of HPO, it is a term of compiler optimization (high performance optimization). You are not suggested to use old icc version. It is better to try v16.0 beta and wait for the release (which should be released in several months).

Thanks,

Shenghong

zhang_s_2 · ‎07-29-2015

In our code, there are a lot of conditional operator, like:

__igtemp_raw_mt2_mb2_mb1_mb2_mb2_f18 = ((int)(__igtemp_raw_mt18 + 0.5)) ? (((int)(__igtemp_raw_mt2_mb2_mb1_mb2_mb2_f1_mt18 + 0.5)) ? __TRatio8 : 1.0e-38) : 0.0;
__T28 = (((int)((((- __MJD_va8) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) )) > 80.0) + 0.5)) ? (5.540622384e+34 * ((1.0 + ((- __MJD_va8) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) ))) - 80.0)) : (((int)((((- __MJD_va8) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) )) < (- 80.0)) + 0.5)) ? 1.804851387e-35 : exp( ((- __MJD_va8) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) )) )));

How to speed up the conditional operation? If this can be resolved, I think the performance problem should be OK.

Is there anyone help?

TimP · ‎07-30-2015

You depend to a large extent on compiler constant propagation, so that the conditionals may be evaluated at compile time. It's a strange combination of int results of conditionals and doubles. I would expect any compiler to miss many opportunities to optimize so far off the beaten path. Even the possibly important compile time evaluation of log(constant expression) may be missed

Besides that, immediately following, you have a sequence of divides by double constant which could be optimized (in violation of C standard). If gcc replaces /scale by *(1./scale) that alone would make the difference.

Under Fortran rules, it would be conceivable to replace

    sum +=__T28/scale;
    sum +=ccc8/scale;
    sum +=__igtemp_raw_mt2_mb2_mb1_mb2_mb2_f18/scale;

by

sum +=(__T28xx +ccc8 +__igtemp_raw_mt2_mb2_mb1_mb2_mb2_f18)*(1./scale);

but in C these optimizations have to be applied selectively under options which encourage standard violations. Reciprocal replacement (gcc -freciprocal-math, icc -no-prec-div) is the most common one. I suppose icc doesn't do that outside loops which are candidates for vectorization. I don't think your options include gcc -fassociative-math (usually invoked indirectly by -ffast-math).

Such compile options are likely to go wrong in such a complicated expression. So it's far better if the source code passed to the compiler is already closer to what you intend.

icc has fairly stringent limits (many of them with options to modify) where optimizations may be discarded as the source file grows in size, particularly if compiling for 32-bit mode, where the compiler internal tables might grow beyond the addressing limit. I don't know whether all such situations will throw a compile-time warning. If that happens, you may get better results by setting a lower level of optimization from the beginning. icc -O1 -fp-model source may be better when vectorization isn't of practical use.

Kittur suggested use of the Intel64 16.0 compiler. No doubt, it may have overcome some of these issues.

KitturGanesh · ‎07-30-2015

@Tim's suggestions and comments make excellent sense for you to consider for any further optimization. Yes, 16.0 has enhanced optimization implementations and should be used after making code changes where necessary per Tim's input especially the options he mentions which is what I'd in mind as well. (Thanks Tim)

_Kittur

zhang_s_2 · ‎08-05-2015

Hi Tim, Kittur,

Thanks for your replies!

I have checked the code, and find the strange thing. In the code, the calculation part are the most cost time part, take the N500 as example, it has 500 calculation sections, each sections like this:

if (aaa370 > bbb370 ) {
if (aaa370 < ddd370 ) {
if (aaa370 > 0.001 ) {
ccc370 += ((int)(aaa370-ddd370))?(ccc370*ddd370+aaa370):(0.0);
}
else {
ccc370 -= ((int)(aaa370*ddd370))?(ccc370*ddd370+aaa370):(0.0);
}
ccc370 += ((int)(aaa370-ddd370))?(ccc370*ddd370+aaa370):(0.0);
ccc370 += ((int)(ccc370+ddd370))?(ccc370+ddd370+aaa370):(0.0);
ccc370 += ((int)(ccc370+2*ddd370))?(ccc370-ddd370+aaa370):(0.0);
ccc370 += ((int)(ccc370*ddd370))?(ccc370+ddd370-aaa370):(0.0);
}
else {
ccc370 += ((int)(ccc370+ddd370))?(ccc370*ddd370+aaa370):(0.0);
ccc370 += ((int)(ccc370*ddd370))?(ccc370*ddd370-aaa370):(0.0);
ccc370 += ((!((int)(ccc370+0.1)))&&(!((int)(ddd370+0.2))))?(ccc370*ddd370-aaa370):(0.0);
ccc370 += ((!((int)(ccc370+0.5)))&&(!((int)(ddd370+0.5))))?(ccc370*ddd370-aaa370):(0.0);
ccc370 += ((!((int)(ccc370+0.6)))&&(!((int)(ddd370+0.1))))?(ccc370+ddd370-aaa370):(0.0);
}
__igtemp_raw_mt2_mb2_mb1_mb2_mb2_f1370 = ((int)(__igtemp_raw_mt1370 + 0.5)) ? (((int)(__igtemp_raw_mt2_mb2_mb1_mb2_mb2_f1_mt1370 + 0.5)) ? __TRatio370 : 1.0e-38) : 0.0;
__T2370 = (((int)((((- __MJD_va370) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) )) > 80.0) + 0.5)) ? (5.540622384e+34 * ((1.0 + ((- __MJD_va370) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) ))) - 80.0)) : (((int)((((- __MJD_va370) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) )) < (- 80.0)) + 0.5)) ? 1.804851387e-35 : exp( ((- __MJD_va370) * log( (((int)((0.1 > 1.0e-38) + 0.5)) ? 0.1 : 1.0e-38) )) )));
sum +=__T2370/scale;
sum +=ccc370/scale;
sum +=__igtemp_raw_mt2_mb2_mb1_mb2_mb2_f1370/scale;
}

All the calculation sections are the same format. After test the icc is slow than gcc when compile 500 sections, but if I comment some sections in the code, like only keep 100 sections, the icc will be faster then gcc.
Sections GCC ICC
50 4.366 4.185
100 11.554 10.542
500 88.3 119.2

(And when 500 sections, the icc compile time become very very long)

Is the code size too big? But I have used the -O1 to test, it is even worse.
You can get the code to try, hopes can give me some feedback.

Thanks,
Shaohua

KitturGanesh · ‎08-06-2015

I tried again on my system (details below):
Sandy Bridge
RHEL 6.2
gcc (GCC) 4.8.1
%icc -V
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.0.069 Beta Build 20150527

Adding the options -no-prec-div and -qopt-prefetch increased the speed a bit. So, you can try on a similar system with the latest beta version and see if it helps. Else, I need to be able to reproduce the issue to file an issue with the developers if there's a performance issue, fyi.

%N500]$ make
gcc -O3 f10A.c -o f10Agcc -lm
icc -O3 f10A.c -qopt-prefetch -no-prec-div -o f10Aicc

[kganesh1@dpdmic09 N500]$ ./run.csh
time ./f10Agcc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 1m25.201s
user 1m25.044s
sys 0m0.016s
time ./f10Aicc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m50.358s
user 0m50.228s
sys 0m0.028s

Regards,
Kittur

zhang_s_2 · ‎08-06-2015

Hi Kittur,

Thanks for your reply, I will get one similar machine to test it.

zhang_s_2 · ‎08-07-2015

Hi Kittur,

I have tried your suggestion, but the ‘-qopt-prefetch -no-prec-div’ does not help me.

1.
CPU: I7-4790
OS: centos5.9
$ make
gcc -O3 f10A.c -o f10Agcc -lm
icc -O3 f10A.c -o f10Aicc -lm
icc -O3 f10A.c -o f10Aicc2 -qopt-prefetch -no-prec-div -lm
$ ./run.csh
time ./f10Agcc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m20.674s
user 0m20.670s
sys 0m0.001s
time ./f10Aicc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m20.574s
user 0m20.572s
sys 0m0.001s
time ./f10Aicc2
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m21.012s
user 0m21.003s
sys 0m0.005s
(It become worse on this I7 machine)

2.
CPU: E7-4850 v2
OS: centos6.6
$ make
gcc -O3 f10A.c -o f10Agcc2 -lm
icc -O3 f10A.c -o f10Aicc -lm
icc -O3 f10A.c -qopt-prefetch -no-prec-div -o f10Aicc2 -lm
$./run.csh
time ./f10Agcc2
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m54.031s
user 0m53.982s
sys 0m0.011s
time ./f10Aicc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 1m4.764s
user 1m4.703s
sys 0m0.016s
time ./f10Aicc2
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 1m4.260s
user 1m4.207s
sys 0m0.009s
(Why the performance so bad on the Xeon CPU?)

Please help to check if this is real performance issue? And I will also try to get the Sandy Bridge machine and RHEL 6.2 to try.

Thanks,
Shaohua

jimdempseyatthecove · ‎08-07-2015

Untested code:

if (aaa370 < ddd370 ) {
  double temp = (double)(((int)(aaa370-ddd370))!=0);
  if (aaa370 > 0.001 ) {
    // ergo ddd370 > 0.001
    // ccc370 += ((int)(aaa370-ddd370))?(ccc370*ddd370+aaa370):(0.0);
    ccc370 += (ccc370*ddd370+aaa370) * temp;
  } 
  else { 
    // ccc370 -= ((int)(aaa370*ddd370))?(ccc370*ddd370+aaa370):(0.0);
    ccc370 -= (ccc370*ddd370+aaa370) * (double)(((int)(aaa370*ddd370))!=0);
  } 
  ccc370 += (ccc370*ddd370+aaa370) * temp;
  // ccc370 += ((int)(ccc370+ddd370))?(ccc370+ddd370+aaa370):(0.0);
  temp = (double)(((int)(ccc370+2*ddd370))!=0);
  ccc370 += (ccc370-ddd370+aaa370) * temp;
  // ccc370 += ((int)(ccc370*ddd370))?(ccc370+ddd370-aaa370):(0.0);
  temp = (double)((((int)(ccc370*ddd370))) != 0);
  ccc370 += (ccc370+ddd370-aaa370) * temp;
}
else
{
  ...

Jim Dempsey

jimdempseyatthecove · ‎08-07-2015

And I'd suggest extending the technique to the mess that is in the else clause ... above.

Jim Dempsey

KitturGanesh · ‎08-07-2015

Well put Jim, the if/else clause needs to be cleaned up to ensure the repeated expressions aren't executed multiple times. I am still not able to reproduce the issue on my system yet with existing code vs. gcc :-(

Shaohua, could you please try on an equivalent SNB system with the configuration of gcc etc. I'd earlier noted and try out after making the code changes suggested by Jim as well?
_Kittur

jimdempseyatthecove · ‎08-07-2015

Kittur,

It is not so much an issue of eliminating repeated expressions in the source code. Compiler optimizations will (should) do that for you already. What my code suggestion is actually doing, and its benefit derived therefrom, is to produce straight-line code without (or with as little) branching. Note, the logical if test should produce 0/1 and conversion to 0.0 or 1.0 all without branching. Floating point multiplication is now a relatively low cost instruction, branches though taking 1 cycle to execute, may induce a processor pipeline stall.

Zhang,

If you can extend the technique further and completely produce straight-line code, then this code can be vectorized, and if your code is suitable for vectorization this may yield performance additional improvements.

Jim Dempsey

KitturGanesh · ‎08-07-2015

I agree Jim, yes you're right the compiler should automatically do so and very important to write straight line code. Of course, from code readability point of view avoiding yanking out repeated expressions is what I meant to indicate but didn't!

_Kittur

zhang_s_2 · ‎08-10-2015

Hi Jim, Kittur,

Thanks for your suggestions, but currently we do not change our code. What we want to do is find out the root reason of icc is slower than gcc on our E7-V2 machine.
( Suppose after we change the code and compile it by icc, the performance become better; but on the other hand, I think the gcc performance will be better too, if icc is still slower than gcc, why I replace the gcc?)

Hi Kittur,

We have found one Sandy Bridge machine to test, but get bad result:
CPU: I7-2600
GCC: 4.7.2
OS: centos 5.5

$./run.csh
time ./f10Agcc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m33.987s
user 0m33.961s
sys 0m0.002s
time ./f10Aicc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m42.087s
user 0m42.072s
sys 0m0.007s

If you think I should install the RHEL6.2 and gcc4.8.1, please reply me and I can try that.

Actually the performance is well on our I7-4790(Haswell) machine,

zhang s. wrote:

CPU: I7-4790
OS: centos5.9
$ make
gcc -O3 f10A.c -o f10Agcc -lm
icc -O3 f10A.c -o f10Aicc -lm
icc -O3 f10A.c -o f10Aicc2 -qopt-prefetch -no-prec-div -lm
$ ./run.csh
time ./f10Agcc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m20.674s
user 0m20.670s
sys 0m0.001s
time ./f10Aicc
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m20.574s
user 0m20.572s
sys 0m0.001s
time ./f10Aicc2
result[998]=1.993884889683e+03
result[999998]=1.994003389023e+09
result[999999]=1.994007377033e+09

real 0m21.012s
user 0m21.003s
sys 0m0.005s

But our application's most common and important run environment is the E7-4850 v2 machine(Ivy Bridge), so please help us to resolve this problem on this machine. (Or is there some special compiling options for the Ivy Bridge CPU?)

Thanks,
Shaohua

jimdempseyatthecove · ‎08-10-2015

>>Thanks for your suggestions, but currently we do not change our code. What we want to do is find out the root reason of icc is slower than gcc on our E7-V2 machine.
( Suppose after we change the code and compile it by icc, the performance become better; but on the other hand, I think the gcc performance will be better too, if icc is still slower than gcc, why I replace the gcc?)<<

The time difference (~5%) can generally be accounted for by two causes:

a) code generation by the compiler (you verify this by looking at the disassembly, say from VTune or debugger)
or more nefariously
b) code placement (you verify this by looking at the disassembly, say from VTune or debugger)

I've experienced many cases where code placement affects performance by as much as 5%.
For code placement, you would inspect the placement to see how many cache lines are used in the critical loop, and if the cache lines transitions pages. If this is the case, then you might be able to use the code alignment pragmas to correct the behavior.

The point of my optimization suggestion is not to fix the Intel code per se, rather, it is to improve your performance in general (for both compilers).

Jim Dempsey