Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

gcc-6 vs icc 18.0 performance (seeing no gain)

oss_compiler
Beginner
1,709 Views

Does anyone have any anecdotal data on what are the realistic gains that could be seen between gcc-6 and icc 18.0 compiler ? After some painstaking effort, I tried on couple of open source projects and found no gain. I used -02 -axAVX,SSE2,SSE4.1 etc [actually all instructions set] and used auto dispatch. It looks like ICC might work best for loopy code versus branchy code that may not have nice loops with large data set. Is that a reasonable assumption ?

0 Kudos
6 Replies
TimP
Honored Contributor III
1,709 Views

It's most reasonable not to assume anything if you don't have time to characterize the code, at least to the extent of gathering icc opt-report (and, more conveniently, processing through Intel Advisor).  As gcc doesn't have a convenient auto dispatch, it seems more relevant to compare the compilers for a specific target ISA, and not give icc the potential handicap of having to select execution paths at run time, with associated code bloat.  Quoted gains for icc often are based on the use of more aggressive compile options, or better selection of unroll option, but that may not be a factor in your comparison.

Both gcc and icc may depend to some extent on feedback from training runs to optimize branchy code, and in some scenarios that may be more convenient with icc.  Recent gcc versions are quite good on run-of -the-mill auto-vectorization of plain loops.  Some open source projects may be already tuned to avoid gcc performance pitfalls.

0 Kudos
oss_compiler
Beginner
1,709 Views

Quoted gains for icc often are based on the use of more aggressive compile options

Can you provide some pointers to this ? I used -O3 and auto dispatch. Few more things to add. Specific builds targeting specific ISA is too cumbersome (not for my use case). And code bloat is of no concern. It would be too hard to maintain different builds like that and portability would be concern too from performance point of view as binaries could be moved between different SKUs.

0 Kudos
Vladimir_P_1234567890
1,709 Views
0 Kudos
TimP
Honored Contributor III
1,709 Views

The closest equivalent gcc option to icc -xHost default [-fp-model fast=1] might be gcc -O3 -ffast-math -fno-cx-limited-range -funroll-loops --param max-unroll-times=2 -march=native

Needless to say, this is not such a popular combination of options for gcc. In particular, -ffast-math in the past had a reputation for lack of safety (due in part to its inclusion of the cx-limited-range and in part to the buggy mathinline include files).  The -ffast-math is needed for auto-vectorization of reduction loops.  With icc, either the default -fp-model fast or use of #pragma omp simd reduction will enable those simd optimizations.

Also, it is unusual to try to optimize gcc unrolling by use of the max-unroll-times=2 (or 4) clause.  The amount of unrolling can be fairly critical for recent Intel CPUs.  gcc only now is adding a #pragma unroll in the development branch. icc also performs (sometimes excessive) aggressive riffling of reductions, which I haven't seen a way to duplicate with gcc.  As a result, icc may see a gain for AVX2 where gcc needs -mno-fma to avoid a loss.

Intel compilers also are more aggressive about auto loop nest switching than gcc.  This is a point where fully developed open source applications are likely to be optimized in source and not need the icc optimizations.

Auto-dispatch builds are slower than a build targeted for the right single ISA.  The i-cache miss rate increases, and there is an overhead at each function call for selecting among multiple code paths.  If you are lucky, the difference may be negligible.  The compiler may prune your list down to 3 or fewer code paths, so you should give some thought to which code paths you may actually need (such as SSE3 and AVX).  We have seen situations where adding an SSE4 path to an auto-dispatch build actually slowed an application.  There are cases where #pragma vector [always|aligned] could make an SSE3 build run as fast as SSE4 (because the compiler thought that vectorization might be counter-productive for the earliest SSE3 platforms).

0 Kudos
oss_compiler
Beginner
1,709 Views

Thanks Tim. Appreciate you taking time for providing a detailed response. I will do more experiments and get back to this thread. 

0 Kudos
Xu__Jianwei
Beginner
1,709 Views

Hi, I have the same problem. I want to evaluate icc performance before buy it, so I downloaded "Parallel Studio XE Cluster Edition for Linux" for a trial. But after many efforts, I don't get performance gains over gcc.

 I installed psxe on Redhat7(3.10.0-693.el7.x86_64), and run SpecCPU2006 INT test on it. Both for gcc and icc, I use default compile options for fairly comparison.

The following output excerpt taken from gcc and icc, for gcc:

                                  Estimated                       Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  ---------
400.perlbench      --      3.52          -- S
401.bzip2          --      4.34          -- S
403.gcc            --      0.951         -- S
429.mcf            --      2.09          -- S
445.gobmk          --     12.2           -- S
456.hmmer          --      1.95          -- S
458.sjeng          --      2.82          -- S
462.libquantum     --      0.058         -- S
464.h264ref        --      8.52          -- S
471.omnetpp        --      0.338         -- S
473.astar          --      6.85          -- S
483.xalancbmk     100         --            CE

for icc:

                                  Estimated                       Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  ---------
400.perlbench      --      3.49          -- S
401.bzip2          --      4.26          -- S
403.gcc            --      0.933         -- S
429.mcf            --      2.12          -- S
445.gobmk          --     12.0           -- S
456.hmmer          --      2.19          -- S
458.sjeng          --      2.79          -- S
462.libquantum     --      0.043         -- S
464.h264ref        --      8.58          -- S
471.omnetpp        --      0.317         -- S
473.astar          --      6.70          -- S
483.xalancbmk     100         --            CE

The detailed output is attached also. so my question is, why icc has no gains than gcc? Is there anything more need to do ?

Supplementary instructions: I have tried many times in install  procedure, but I always get errors in generated documents as follows:

The Help for Intel® C++ Compiler is not available

You have reached this page due to one of the following reasons:  

Reason What to do to resolve

Intel® C++ Compiler is not installed.

Consult with your system administrator, or invoke the installer to install Intel C++ Compiler on your system.
Intel C++ Compiler is installed, but JavaScript is disabled on your browser. Click: Intel C++ Compiler
Intel C++ Compiler is installed in an unexpected location. Open the Compiler Help from the installation directory. Go to <install_dir>/documentation_2018/en/compiler_c/ps2018/get_started_lc.htm

Is this the cause? but I can use icc without any error prompt.

 

 

 

 

0 Kudos
Reply