Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
22 Views

Benchmark problem: ICC compiler option '-xHost' slows down execution speed

Jump to solution

While preparing two benchmarks on a 4-socket/60-core Xeon server (E7-4890v2, HyperThreading and TurboBoost enabled; RHEL 7, Transparent Huge Pages activated), cf. topic 515900, I have observed that combining the optimization options '-O2 -ansi-alias' with '-xHost' (ICC 14.0.2) for the OpenMP parallelized "Benchmark B" increases the execution time:

 

==  'icpc -ansi-alias -O2 -xHost' ...  ====================================

----   60 threads  --------------------------------------------------------

         real   13m34.796s      user   805m26.873s      sys   1m33.328s

         real   13m35.425s      user   805m51.402s      sys   1m33.418s

         real   13m35.406s      user   805m39.981s      sys   1m35.493s

----  120 threads  --------------------------------------------------------

         real    13m8.188s      user  1553m54.815s      sys   2m52.075s

         real    13m3.077s      user  1544m28.512s      sys   2m45.633s

         real    13m2.473s      user  1542m21.130s      sys   2m52.306s

==  'icpc -ansi-alias -O2' ...  ===========================================

----   60 threads  --------------------------------------------------------

         real   13m23.987s      user   793m46.546s      sys   1m29.245s

         real   13m24.985s      user   795m35.378s      sys   1m28.225s

         real   13m27.984s      user   798m38.090s      sys   1m27.477s

----  120 threads  --------------------------------------------------------

         real   12m46.201s      user   1509m1.876s      sys   2m25.665s

         real   12m46.240s      user  1510m39.814s      sys   2m31.060s

         real   12m45.393s      user  1508m45.201s      sys   2m25.807s

 

 

Using IPO or PGO shows the same result: The machine code gets—against my expectations—slower. Is this effect already known with a special sort of code? I would be very grateful for any hint! Please ask if more information are needed.

 

Thank you for reading.

0 Kudos

Accepted Solutions
Highlighted
Black Belt
22 Views

I don't think this is entirely surprising.  While it's not one of the simpler issues to confirm, I suppose one might suspect that L1 cache capacity misses could account for HyperThread offering an apparent 4% advantage only in an SSE2 build.  As Sandy Bridge -v2 CPUs (and also Ivy Bridge) such as you apparently use have only a 128-bit path to L2 cache, it happens sometimes that AVX-256 builds lose performance.  It might be expected that 2 threads per core would expose this situation earlier.   The Haswell CPUs benefit more consistently from AVX (or AVX2) compilation but no such CPU which I tested had HyperThread.

You might be interested in trying -msse4.1 as that would employ some additional instructions without adding any which use 256-bit data, and may be less dependent on directives to direct vectorization.

By the way, in some cases the advantage of AVX-256 instruction set depends on 32-byte aligned data, which  could involve adding alignment specs in source code.  As far as I know icc doesn't have a global option to make alignments default to 32-byte as ifort permits.

 

View solution in original post

0 Kudos
3 Replies
Highlighted
Black Belt
23 Views

I don't think this is entirely surprising.  While it's not one of the simpler issues to confirm, I suppose one might suspect that L1 cache capacity misses could account for HyperThread offering an apparent 4% advantage only in an SSE2 build.  As Sandy Bridge -v2 CPUs (and also Ivy Bridge) such as you apparently use have only a 128-bit path to L2 cache, it happens sometimes that AVX-256 builds lose performance.  It might be expected that 2 threads per core would expose this situation earlier.   The Haswell CPUs benefit more consistently from AVX (or AVX2) compilation but no such CPU which I tested had HyperThread.

You might be interested in trying -msse4.1 as that would employ some additional instructions without adding any which use 256-bit data, and may be less dependent on directives to direct vectorization.

By the way, in some cases the advantage of AVX-256 instruction set depends on 32-byte aligned data, which  could involve adding alignment specs in source code.  As far as I know icc doesn't have a global option to make alignments default to 32-byte as ifort permits.

 

View solution in original post

0 Kudos
Highlighted
Beginner
22 Views

I'm very thankful for your comments, and your competent analysis impresses me! To be honest, I have only little knowledge about the details of Intel's current CPUs (e.g. data-path width, data alignment) and nearly no experiences with the fine-tuning of optimization options.

This number-crunching benchmark ("B") makes heavy use of the 'SVD::decompose()' function from the Numerical Recipes; my (obviously naive) hope was that especially this code may profit from the AVX instruction-set, enabled by '-xHost'. But I see, I have to dig more deeply into the sources and the optimization report… By the way, is there suggested literature—from Intel?

Your advice to check SSE4 options was helpful: The computing time didn't change significantly by use of '-msse4.1', but '-msse4.2' gives at least a bit less than 1% (which is effectively more than the equivalent of one CPU core). However, an additional PGO run has no influence…

Currently we expect 4.5 times more performance from a 240-core E7-x890v2-machine than from the "old" 80-core E7-8870 (theory says factor 7 = 240/80 · (2.8 GHz)/(2.4 GHz) · c where c = 2 from AVX, and the system architecture raises expectations for an excellent scaling), but economically justifiable is only a factor ≥ 5. So, further suggestions for optimizations on compiler level—if possible at all—are very welcome.

0 Kudos
Highlighted
Black Belt
22 Views

http://www.sisoftware.co.uk/?d=qa&f=mem_hsw

discusses in unusual detail the comparative technical characteristics of the v2/v3/v4 mobile cpus.

As far as I know, much of this applies to the server CPUs as well, but the Haswell server CPUs presumably remain under non-disclosure. 

I suppose a sales rep trying to sell the E7 upgrade ought to have contacts beyond what my web search tool sees.

Intel tends not to release such information in an easily searchable form even when the product is released and the red cover document restrictions could be lifted.  Sometimes, as seems to be the case here, it's not discussed publicly until it can be advertised as a feature of a new CPU.  I've never seen a discussion of how much the hardware architects knew initially about these bottlenecks or whether it was left to us software people to discover limitations and try to understand them.

This one confirms progressive relief of some data path width limitations from v2 to succeeding CPUs:

http://www.realworldtech.com/haswell-cpu/5/

As Intel presumably wanted to be able to brag about AVX2 features, a contribution is made by relieving some of the bottlenecks which might become evident more frequently.

These authors seem to have pried out information you will be hard pressed to find in Intel docs.

IMy experience comparing the dual 10- and 12-core E5 CPUs confirmed the tendency of customers to prefer the 10-core, as changes in utilization such as adopting hybrid MPI/OpenMP were needed to overcome the price and power consumption increase.

Few applications have sufficient L1 locality to expect fully proportional benefit from clock speed increase.

 

0 Kudos