One (or maybe the only)

Harald_Deischinger · ‎09-09-2013

When upgrading from version 11 to 13 I notice a big performance loss (3x).

Versions in use:
- Composer XE 2011 update 11 build 344
- Composer XE 2013 build 089 (we tried a new one, but same result)
for the test, both using from visual studio 2008.

I could create a relatively small test program - which I could provide to an Intel engineer - but I can not post it to the public forum.
The code is mainly hand written SSE code using F32vec4, and intrinsicts, a little bit of log.

The runtime are worse with /Qipo, also on 2011.
- compiler 1210, /Qip: ~0.42 sec
- compiler 1210, /Qipo: ~1.30 sec
- compiler 1300, /Qip: ~1.20 sec
- compiler 1300, /Qip: ~1.70 sec

Any suggestions, ideas or magic compiler options that coould help?

regards
Harald

Harald_Deischinger · ‎09-09-2013

One (or maybe the only) reason for the difference seems to be inlining - the code contains (for "historical reasons") quite a lot of__forcelines. At least for this one example, removing one of them helps to get equivalent speed for the two compilers.

Some more Information:
this loss of performance is effecting several algorithms in use. For two of them I have undertaken some experiments.

Intel 11, with lots of forceinlines : 630, 137 (throughput of the two algorithms)
Intel 13, with lots of forceinlines : 500, 30
Intel 11, without forceinlines: 650, 98
Intel 13, without forcelinines: 650, 77
Intel 13, without forcelines, enable interprocedureal optimizsation: 620, 95

So I still can not reach to original speed, but at least the loss of speed is no longer so dramatic.

As far as I remember long back ago there have been some options to control the heuritics for inlnining - are there still some of those options available? Or any other hints?

harald

Richard_Nutman · ‎09-13-2013

It seems inlining is broken under certain circumstances with ICL 13.0. I experienced the same thing while testing ICL 13.0 for our uses. Massive performance drops compared to MSVC 2012, due to lack of inlining small template functions in hot loops.

Worse performance with version 13 - worsens with /Qipo