After some bad experience a while ago when trying to upgrade to icc 13, I've been trying version XE 2015. But again, I'm seeing a loss in performance. This time I've been looking into it with VTune, and what I see doesn't really make sense.
First, IPP. The new IPP (8.2, coming from 6.1) seems to be almost twice as fast, which gives a considerable boost in performance. Great!
Then my code. Most of the functions in it have roughly the same performance with Intel compiler 10.1 and XE2015. There are a few where I see a big improvement (upto a factor 4), but also a few where I see a big degradation, and unfortunately that happens in what already was one of the heaviest functions in my code.
Now I have been trying to optimize this really heavy function, and I managed to get rid of a number of memory accesses and reduce the code by a few instructions. With these optimizations, the new code is several instructions smaller than what the 10.1 compiler generated. The number of memory accesses is down from 11 to 7. Still, I'm seeing a factor 2.5 reduction in speed.
There's something else. I have 2 nearly-identical versions of this function in my code. One version does the same thing as the one that I've profiled, except for a single step, which is also visible in the assembly code where a single memory write is gone - except for that it's identical. Now, that function - again, identical except that one instruction is removed - is also slow in the old compiler. (?)
Can there be anything that I'm missing in the conversion from 10.1 to XE 2015? Something like "Flush denormals to 0" (which is enabled). Looking at the assembly code it should really be faster, but for some reason it's not.
As you seem to be grasping for straws, here's one I discovered after months of confusion:
When I build a parallel job (OpenMP or Cilk(tm) Plus) with Intel compilers, if I run with HyperThread enabled (no BIOS option to disable it on my Ultrabook), with default numbers of threads or workers, subsequent single thread regions take about 30% longer than they do when I set the number of threads to number of cores. This effect seems to be peculiar to the recent Intel development tools; gcc builds run OK when using hyperthreads and don't slow down the serial regions.
Another effect which doesn't appear well documented: the default where the compiler chooses the amount of unrolling isn't as effective for me as specifying unrolling. XE 2015 has fixed a number of cases where unrolling caused time to be wasted in remainder loops, so I usually start with /Qunroll4. Past compiler updates often seemed to require changing unroll options.
This week I was doing a lot of VTune analysis on code which (we suspect) was generated using ICL as a back end. We didn't have available source code or the intervening tool. VTune quoted a lot of front end bound performance penalties which may be associated with non-vector code with no unrolling.
You didn't say which platform you are testing and whether it's the same one with same ISA target for both compiler versions. Even on the corei7 platform, which is fairly old and doesn't have any 256-bit hardware data types, current compilers often get a big advantage from 32-byte data alignment.