- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While preparing two benchmarks on a 4-socket/60-core Xeon server (E7-4890v2, HyperThreading and TurboBoost enabled; RHEL 7, Transparent Huge Pages activated), cf. topic 515900, I have observed that combining the optimization options '-O2 -ansi-alias' with '-xHost' (ICC 14.0.2) for the OpenMP parallelized "Benchmark B" increases the execution time:
== 'icpc -ansi-alias -O2 -xHost' ... ====================================
---- 60 threads --------------------------------------------------------
real 13m34.796s user 805m26.873s sys 1m33.328s
real 13m35.425s user 805m51.402s sys 1m33.418s
real 13m35.406s user 805m39.981s sys 1m35.493s
---- 120 threads --------------------------------------------------------
real 13m8.188s user 1553m54.815s sys 2m52.075s
real 13m3.077s user 1544m28.512s sys 2m45.633s
real 13m2.473s user 1542m21.130s sys 2m52.306s
== 'icpc -ansi-alias -O2' ... ===========================================
---- 60 threads --------------------------------------------------------
real 13m23.987s user 793m46.546s sys 1m29.245s
real 13m24.985s user 795m35.378s sys 1m28.225s
real 13m27.984s user 798m38.090s sys 1m27.477s
---- 120 threads --------------------------------------------------------
real 12m46.201s user 1509m1.876s sys 2m25.665s
real 12m46.240s user 1510m39.814s sys 2m31.060s
real 12m45.393s user 1508m45.201s sys 2m25.807s
Using IPO or PGO shows the same result: The machine code gets—against my expectations—slower. Is this effect already known with a special sort of code? I would be very grateful for any hint! Please ask if more information are needed.
Thank you for reading.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think this is entirely surprising. While it's not one of the simpler issues to confirm, I suppose one might suspect that L1 cache capacity misses could account for HyperThread offering an apparent 4% advantage only in an SSE2 build. As Sandy Bridge -v2 CPUs (and also Ivy Bridge) such as you apparently use have only a 128-bit path to L2 cache, it happens sometimes that AVX-256 builds lose performance. It might be expected that 2 threads per core would expose this situation earlier. The Haswell CPUs benefit more consistently from AVX (or AVX2) compilation but no such CPU which I tested had HyperThread.
You might be interested in trying -msse4.1 as that would employ some additional instructions without adding any which use 256-bit data, and may be less dependent on directives to direct vectorization.
By the way, in some cases the advantage of AVX-256 instruction set depends on 32-byte aligned data, which could involve adding alignment specs in source code. As far as I know icc doesn't have a global option to make alignments default to 32-byte as ifort permits.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think this is entirely surprising. While it's not one of the simpler issues to confirm, I suppose one might suspect that L1 cache capacity misses could account for HyperThread offering an apparent 4% advantage only in an SSE2 build. As Sandy Bridge -v2 CPUs (and also Ivy Bridge) such as you apparently use have only a 128-bit path to L2 cache, it happens sometimes that AVX-256 builds lose performance. It might be expected that 2 threads per core would expose this situation earlier. The Haswell CPUs benefit more consistently from AVX (or AVX2) compilation but no such CPU which I tested had HyperThread.
You might be interested in trying -msse4.1 as that would employ some additional instructions without adding any which use 256-bit data, and may be less dependent on directives to direct vectorization.
By the way, in some cases the advantage of AVX-256 instruction set depends on 32-byte aligned data, which could involve adding alignment specs in source code. As far as I know icc doesn't have a global option to make alignments default to 32-byte as ifort permits.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm very thankful for your comments, and your competent analysis impresses me! To be honest, I have only little knowledge about the details of Intel's current CPUs (e.g. data-path width, data alignment) and nearly no experiences with the fine-tuning of optimization options.
This number-crunching benchmark ("B") makes heavy use of the 'SVD::decompose()' function from the Numerical Recipes; my (obviously naive) hope was that especially this code may profit from the AVX instruction-set, enabled by '-xHost'. But I see, I have to dig more deeply into the sources and the optimization report… By the way, is there suggested literature—from Intel?
Your advice to check SSE4 options was helpful: The computing time didn't change significantly by use of '-msse4.1', but '-msse4.2' gives at least a bit less than 1% (which is effectively more than the equivalent of one CPU core). However, an additional PGO run has no influence…
Currently we expect 4.5 times more performance from a 240-core E7-x890v2-machine than from the "old" 80-core E7-8870 (theory says factor 7 = 240/80 · (2.8 GHz)/(2.4 GHz) · c where c = 2 from AVX, and the system architecture raises expectations for an excellent scaling), but economically justifiable is only a factor ≥ 5. So, further suggestions for optimizations on compiler level—if possible at all—are very welcome.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://www.sisoftware.co.uk/?d=qa&f=mem_hsw
discusses in unusual detail the comparative technical characteristics of the v2/v3/v4 mobile cpus.
As far as I know, much of this applies to the server CPUs as well, but the Haswell server CPUs presumably remain under non-disclosure.
I suppose a sales rep trying to sell the E7 upgrade ought to have contacts beyond what my web search tool sees.
Intel tends not to release such information in an easily searchable form even when the product is released and the red cover document restrictions could be lifted. Sometimes, as seems to be the case here, it's not discussed publicly until it can be advertised as a feature of a new CPU. I've never seen a discussion of how much the hardware architects knew initially about these bottlenecks or whether it was left to us software people to discover limitations and try to understand them.
This one confirms progressive relief of some data path width limitations from v2 to succeeding CPUs:
http://www.realworldtech.com/haswell-cpu/5/
As Intel presumably wanted to be able to brag about AVX2 features, a contribution is made by relieving some of the bottlenecks which might become evident more frequently.
These authors seem to have pried out information you will be hard pressed to find in Intel docs.
IMy experience comparing the dual 10- and 12-core E5 CPUs confirmed the tendency of customers to prefer the 10-core, as changes in utilization such as adopting hybrid MPI/OpenMP were needed to overcome the price and power consumption increase.
Few applications have sufficient L1 locality to expect fully proportional benefit from clock speed increase.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page