Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Very slow compilation

eliosh
Beginner
1,234 Views
In some cases compilation is unreasonably slow.
For exmaple, simple compilation of the file attached takes about 650 times more than required by gcc:

> time icc ibm2.c

real 1m3.930s
user 1m2.690s
sys 0m0.320s


> time gcc ibm2.c

real 0m0.091s
user 0m0.050s
sys 0m0.040s



This is the latest (11.1.072) version of icc for Linux (64 bit)


> icc -V
Intel C Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100414 Package ID: l_cproc_p_11.1.072
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

0 Kudos
13 Replies
Dale_S_Intel
Employee
1,236 Views
Well, sure enough, I'm seeing the same problem. Thanks for the self contained test case, this is exactly the kind of problems we hope to find (and solve, cross my fingers :-). I will definitely look into it and update here.

Thanks!
Dale
0 Kudos
mecej4
Honored Contributor III
1,236 Views
I can reproduce the problem with 11.1.069 (x64) on Suse 11.1. However, with -O0 or -O1 compile time drops to less than 1 second.
0 Kudos
TimP
Honored Contributor III
1,236 Views
That's a fair point, when the comparison is against gcc -O0. Even if you set the nearest equivalent gcc options to the icc default, and use a current gcc which implements auto-vectorization, something like
gcc -O3 -ffast-math -fno-cx-limited-range -fno-strict-aliasing -funroll-loops --param max-unroll-times=2
gcc isn't going to attempt all the optimizations which OP has requested under icc.
icc -O1 would be roughly equivalent to gcc -O2 -ffast-math -fno-cx-limited-range -fno-strict-aliasing

It looks like the original code is not intended for any practical purpose, only to see whether a compiler can be provoked into attempting optimization beyond the bounds of sanity. You can't fault any compiler for taking longer with aggressive options set than gcc takes with optimizations disabled.
We have had continual arguments over whether it should be necessary, but the fact remains that large parts of some commercial applications have to be built with icc cut back to -O1, where gcc options -O3 or -ffast-math would never be considered.
0 Kudos
mecej4
Honored Contributor III
1,236 Views
Tim, I don't know if you remember: I had made a similar complaint in comp.lang.fortran regarding IFort, and requested feedback concerning a switch that would say to the compiler, "do a great job of optimization, I know you are very capable, but don't kill yourself at it!" You shot the suggestion down, and it was harder for me to make a case since the source files in question were several 100 Kbytes long and were covered by a non-disclosure agreement.

In the present case, as well, I don't think that the user, who did not write down any compiler switches explicitly, was conscious about all the optimizations that 'he had requested'. Had he known what they were, and what the likely cost was going to be, he might have looked for switches to generate within 95% of the best optimization possible, or something similar.

However, the fact that the C example is less than 1 kbyte long and took over 1 minute to compile puts new light on the issue. Perhaps, if the back ends of the C and Fortran compilers have some common portions, we shall all benefit when the developers work at it.
0 Kudos
TimP
Honored Contributor III
1,236 Views
I think there are conflicting priorities here which will never be resolved.
What will give 95% is highly application dependent.
gcc users are expected to know that they should set switches when they want optimized code.
icc marketers want the compiler to show to best advantage when used by the people who write magazine articles based on trivial benchmarks with default compiler options, and when running SPEC baseline benchmarks. Those goals conflict to some extent with useability, and gcc is in a better position, as it clearly puts those goals at a lower priority than useability.
Another conflict is posed by the desire to have consistent defaults between linux and Windows. It is felt strongly that the icc default must be competitive with the VC default, and include auto-vectorization, since there is no equivalent VC option. The inexplicable part is the decision to make /fp:fast inconsistent with VC.
At one time, there was an effort to persuade Intel compiler to adopt a simple option which would be consistent with typical gcc options like -O2. This failed. The picture has changed now that gcc includes auto-vectorization in -O3. I'd like to see more coverage in the documentation about icc options which are equivalent to options normally used in the reference compiler, but that has become more difficult with the evolution of those compilers to include more optimization.
On the Fortran side, Steve Lionel has agreed that some of the standard compliance options should be used in normal practice, and he has made an effort to consolidate them.
0 Kudos
barragan_villanueva_
Valued Contributor I
1,236 Views
Maybe, Intel compilershould introduce some new set of optimization options (say --gnu-O0 ... --gnu-O3) which somehow correspond to analogous gcc optimizations
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,236 Views
Maybe, Intel compilershould introduce some new set of optimization options (say --gnu-O0 ... --gnu-O3) which somehow correspond to analogous gcc optimizations


Sounds good on the surface, however...

Considering that the incentive for the compiler vendor is produce

Our -On produces faster code than their -On

Then the incentive would be for Vendor's A interpretation of Vendor's B equivilent options to be a selection of options that produces lower performing code.

The comparable options would be best selected from an un-biased (neutral) party.

Please note, the interpretation of -On is

-O0 is no optimization
-O1 is more effort than -O0
-O2 is more effort than -O1
-O3 is more effort than -O2
-Oother is additional features beyond -O3

And the vendor is more or less free to include which optimization features go into what level.

IMHO - it is the users responsibility to select from the available options a set that trades off compile time against performance returned (and quality of results).

It is likely that on rare occasions a given sample code may cause a compiler to choke or produce erronious results. Last week I encountered a problem where g++ could not handle a simple template that both icc and msvc had no problems with.

Jim Dempsey


0 Kudos
dpeterc
Beginner
1,236 Views
Please note, the interpretation of -On is

-O0 is no optimization
-O1 is more effort than -O0
-O2 is more effort than -O1
-O3 is more effort than -O2
-Oother is additional features beyond -O3

And the vendor is more or less free to include which optimization features go into what level.

IMHO - it is the users responsibility to select from the available options a set that trades off compile time against performance returned (and quality of results).


My problem with Intel's choice of group of optimizations for O0, O1, O2, O3 is not with the effort but with the result.
I mostly get best results with -O1, while higher optimizations take more time to compile and produce code which is both slower and bigger. Or much bigger for negligent speedup.
Like with every toy or gadget, when the initial experimentation phase is over, one sticks with the options which worked best so far, and can't continuosly flip flop n-dimensional space of compiler options in search of a jackpot.
So I wish more time would be devoted to make sure that the optimization of a higer level is actually better, not just a particularly smart way of doing something which sometimes works and sometimes not; it is the programmer's responsability to check.
In comparison, GCC very rarely behaves worse with higher optimization.


0 Kudos
jimdempseyatthecove
Honored Contributor III
1,236 Views
>>I mostly get best results with -O1, while higher optimizations take more time to compile and produce code which is both slower and bigger. Or much bigger for negligent speedup.

This indicates your programs tend to perform better without loop unrolling.A small percentage of applications behave this way.

Try -Os

Jim
0 Kudos
TimP
Honored Contributor III
1,236 Views
Intel compilers usually optimize for loop trip counts of about 100 when no clear information on that subject is present in the source code. Turning off vectorization as well as unrolling, as -O1 has done since the Intel 11.0 compilers, is likely to improve performance of short loops, as well as speed up compilation. In principle, profile guided feedback should induce the compiler to optimize for trip counts in the training data set, at least when they don't vary much (but certainly doesn't address compilation speed).
As you apparently leave unrolling off when you use gcc, Jim's suggestion may be on target. gcc actually makes it more difficult to get a useful level of unrolling for loop trip counts of 20 or more.
-Os is intended to reduce generated code size at the expense of performance in comparison with -O1. It's possible it may perform relatively well when loop trip counts are 0 to 1.
0 Kudos
Om_S_Intel
Employee
1,236 Views
Though icc takes more time, the sample testcase run 4 times faster when compiled with Intel compiler.

$ time icc tstcase.cpp

real 0m33.722s

user 0m33.647s

sys 0m0.053s

$ time ./a.out

count = 991

real 0m0.005s

user 0m0.003s

sys 0m0.001s

$ time g++ tstcase.cpp

real 0m0.091s

user 0m0.061s

sys 0m0.029s

$ time ./a.out

count = 991

real 0m0.019s

user 0m0.017s

sys 0m0.003s

$ uname -a

Linux maya11 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

$ icc -V

Intel C Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100203 Package ID: l_cproc_p_11.1.069

Copyright (C) 1985-2010 Intel Corporation. All rights reserved.

0 Kudos
Om_S_Intel
Employee
1,236 Views

I have reported the issue to the Intel compiler development team. I will update this forum thread when there is update on this.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,236 Views
Run-times on the order of 3ms is too short for meaningful results data.
Rework the main() so it takes an input arg for use as major loop count.
Based on your run times Iwould expect an interaton count of 1000 would produce more meaningful results.

Jim
0 Kudos
Reply