Solved: Massive speedup of integer SSE2 code using AVX1(!)

diamantis__nikos · ‎09-06-2015

Hello.

I've recently recompiled an old C program which mainly uses integers, with VC 2013 and ICC 13.1 compilers and was initially compiled targeting SSE2 architecture using VC 2010 compiler.

I targeted AVX (version 1.0) architecture in the compiler options for my SandyBridge and I saw a massive speedup of 68% with both compilers (VC 2013 and ICC 13.1) compared to old SSE2 optimizations.

I recompiled the C code targeting this time AVX2 for my Haswell but the speedup was the same like AVX for Haswell for both compilers.

My questions:

1) How is it possible for an integer based C code, both VC 2013 and ICC 13.1 compilers to produce a huge speedup of approximately 70% using AVX1 target architecture ?

2) Is there a way for a compiler to see the 256 bit AVX units as integer SIMD units and almost double the performance of integer code compared to SSE2 ? Or is it the SSE2 code running on 256 bit execution units ? Or is it something else ?

3) Why there was no speedup (1%) for AVX2 compared to AVX optimizations for integer code using Haswell processor ?

Thanks!

andysem · ‎09-07-2015

You will have the exact answer if you examine the disassembler of your compiled binaries. With the data you provided we can only speculate.

As far as speculations go, there are several possible avenues of speedup:

- AVX implies support for all extensions up to SSE4.1. The speed up may be achieved with an extension prior to AVX, not AVX itself.

- AVX includes a new instruction encoding, which allows to specify the output operand so that inputs are not destroyed. Depending on your code specifics and the amount of xmm registers used, this may reduce the pressure on the register file and avoid register spills on the stack in performance critical parts of the code.

- The new instruction encoding also tends to shorten the code as many moves between registers can be removed. This may result in the code becoming small enough to fit into L1 cache or be easier to inline by the compiler.

- You are bottlenecked by some other part of the program, not the integer part.

Is there a way for a compiler to see the 256 bit AVX units as integer SIMD units and almost double the performance of integer code compared to SSE2 ? Or is it the SSE2 code running on 256 bit execution units ? Or is it something else ?

The compiler may use 256-bit ymm registers to implement some integer operations - mostly logical ones, some shuffles and blends. But it can't convert your integers to floats and operate on them like that. I bet it would not speed the code up anyway. On Haswell and later the compiler could use AVX/AVX2 to speed up some memory operations, as the cache interface has been widened to 256 bits on that architecture. To my knowledge, Sandy/Ivy Bridge CPUs don't provide doubled throughput for 128-bit operations compared to 256-bit ones, so you're not getting the benefit from instruction level parallelism (in other words, no, the extra execution units that are used with 256-bit instructions are not speeding up the 128-bit instructions).

View solution in original post

andysem · ‎09-07-2015

You will have the exact answer if you examine the disassembler of your compiled binaries. With the data you provided we can only speculate.

As far as speculations go, there are several possible avenues of speedup:

- AVX implies support for all extensions up to SSE4.1. The speed up may be achieved with an extension prior to AVX, not AVX itself.

- AVX includes a new instruction encoding, which allows to specify the output operand so that inputs are not destroyed. Depending on your code specifics and the amount of xmm registers used, this may reduce the pressure on the register file and avoid register spills on the stack in performance critical parts of the code.

- The new instruction encoding also tends to shorten the code as many moves between registers can be removed. This may result in the code becoming small enough to fit into L1 cache or be easier to inline by the compiler.

- You are bottlenecked by some other part of the program, not the integer part.

Is there a way for a compiler to see the 256 bit AVX units as integer SIMD units and almost double the performance of integer code compared to SSE2 ? Or is it the SSE2 code running on 256 bit execution units ? Or is it something else ?

The compiler may use 256-bit ymm registers to implement some integer operations - mostly logical ones, some shuffles and blends. But it can't convert your integers to floats and operate on them like that. I bet it would not speed the code up anyway. On Haswell and later the compiler could use AVX/AVX2 to speed up some memory operations, as the cache interface has been widened to 256 bits on that architecture. To my knowledge, Sandy/Ivy Bridge CPUs don't provide doubled throughput for 128-bit operations compared to 256-bit ones, so you're not getting the benefit from instruction level parallelism (in other words, no, the extra execution units that are used with 256-bit instructions are not speeding up the 128-bit instructions).

diamantis__nikos · ‎09-07-2015

Thank you for your replies.

I took another look this morning on the initial code and the initial disassembled binary. It's a x86 app with SSE2, 80x87 instructions sets compiled with VC 2010.

The recompilation performance speedups are like these on SandyBridge:

ICC 11.1 x86 with SSE2 to SSE4.2 optimizations speedup 17% with SSE2 to SSE4.2 and 80x87 instructions sets after disassembling

ICC 11.1 x64 with SSE2 to SSE4.2 optimizations speedup 19% with SSE2 to SSE4.2 and 80x87 instructions sets after disassembling

ICC 13.1 x64 with SSE2 to SSE4.2 optimizations speedup 29% with SSE2 to SSE4.2 but no 80x87 instructions sets after disassembling

VC 2013 x64 with SSE2 to SSE4.2 optimizations speedup 36% with SSE2 to SSE4.2 but no 80x87 instructions sets after disassembling

VC 2013 x64 with AVX optimizations speedup 69% with AVX but no 80x87 instructions sets after disassembling

ICC 13.1 x64 with AVX optimizations speedup 71% with AVX but no 80x87 instructions sets after disassembling

The only floating point arithmetic 80x87 of the initial compilation is a 3 FADDP, 3 FMUL and 3 FDIVP instructions with single precision 32bit floats for calculating and displaying on screen with a time interval of 1 sec, the performance of the main task which is integer calculations.

That on-screen counter disappeared in run-time in all versions that the recent compilers (ICC 13.1 and VC 2013) decided to avoid 80x87 instructions and leave the app with only vector code, but the counter is there for the ICC 11.1 and VC 2010 compiled versions which kept the 80x87 instructions!

I'm not sure if the disappearance of a speed counter, can be called "optimization" though :)

It's unbelievable if the speedup mainly comes from a disappeared on-screen counter of performance of the main integer task!

Still, there is a significant speedup going from optimized SSE4.2 code without 80x87 instructions to AVX code without 80x87 instructions for both compilers ICC 13.1 and VC 2013.

It's 24% for VC 2013 and 32% for ICC 13.1 which is pure AVX speedup over SSE4.2 code without (?) further 80x87 optimizations.

What about the 1% only speedup of Haswell using AVX2-FMA3 over AVX ?

Vladimir_Sedach · ‎09-07-2015

If your code is integer and 128-bit only, 1% is a wonderful speedup when switching AVX -> AVX2 )
The only thing you can do is to use 256-bit instructions instead of 128-bit ones. I'd also try GCC.

diamantis__nikos · ‎09-07-2015

Thank you for your reply.

Actually the AVX -> AVX2 is 0% faster using VC 2013 and 0.5% faster using ICC 13.1. Haven't tried GCC because I thought is inferior to the other two.

One interesting comment:

SandyBridge is 5% faster than Haswell running AVX code at the same clock (turbo disabled) when 4 threads are used, using DDR3-1333MHz CL9 for Sandy and DDR3-1600MHz CL9 for Haswell.

Haswell is 13% faster than Sandy if 8 threads (hyper-threading) are used for the same clock.

andysem · ‎09-07-2015

x87 legacy code is slow and should generally be avoided. Apart from being slow by itself, it also forces the kernel to save/restore the FPU state when switching contexts. It is prohibited in Windows kernel mode already. All 64-bit compilers should (and do) generate SSE2 instructions for floating point math for 64-bit x86 targets. Whether that is the source of your measured speedup or not - I can't tell, but it should certainly add to it. You can use a profiler to identify the pieces of code that matter most for performance.

It's not clear what you mean by saying that the "on-screen counter disappeared in run-time". Provided that the code is not changed (e.g. because of some preprocessor conditions), its behavior should be the same. Otherwise you may be experiencing a compiler bug, although I find it hard to believe since multiple compilers behave similarly.

What about the 1% only speedup of Haswell using AVX2-FMA3 over AVX ?

As the majority of your code is integer-related, FMA won't help you. AVX2 could potentially help, if your data processing pattern suits well. If your code is scalar then not seeing any benefit means that the compiler has failed to perform the optimization for some reason. This is not uncommon.

Vladimir_Sedach · ‎09-07-2015

GCC is way better than VC on average. In my opinion it is at least no worse than ICC.

diamantis__nikos · ‎09-07-2015

@andysem

Thank you for your reply.

ICC 11.1 x64 kept 80x87 instructions and didn't convert them to SSE2, but the other two x64 compilers (ICC 13.1 and VC 2013) did it.

The program changed its behavior using the two compilations which converted the x87 instructions to SSE2, since three real-time (1 sec interval) counters, a progress counter (%), a time counter (32 bit float) and a speed counter (32 bit float) disappeared from the screen and display the results only when finishing the task.

The compilation of ICC 11.1 x64 which kept x87, works fine.

I have to see again the warnings of the compilers to understand what is going on with that.

@Vladimir

Thanks, I'll give it a try.

andysem · ‎09-07-2015

Haven't tried GCC because I thought is inferior to the other two.

In my experience, gcc is certainly better than MSVC when it comes to SSE/AVX programming (and probably general programming as well). It is also generally better than Intel in terms of language support.

TimP · ‎09-07-2015

According to my observations, the VS2010 compiler uses only AVX128, no AVX256. VS2013 and VS2015 give me performance improvements for /arch:AVX2 (but not /arch:AVX).

I think you may still find performance comparisons posted by Intel where the versions of gcc and MSVC used are quite old and don't provide AVX optimization (even though it seems that improved AVX2 support in recent gcc and MSVC may have been funded or negotiated by Intel). It's difficult to choose consistent optimizations among the various compilers, and the selections tend to be biased. I've had problems with the gcc combination -fopenmp -ffast-math which might be used to justify turning off important optimizations.

Differences in performance among compilers sometimes come down to rather obscure options. I'm looking at one now where gcc uses conditional loop code alignment with maxskip=10 (and misses an important alignment by 1 byte) where ICL changed recently to unconditional alignment.

McCalpinJohn · ‎09-08-2015

Looking at the Intel 13.1 results, almost 1/2 of the overall speedup is from the SSE2 to SSE4.1 optimizations. This includes ~13 new instructions from SSE3, ~16 new instructions in SSSE3, and ~47 new instructions in SSE4.1. It would be interesting to see which instructions provide the biggest gain. Can you look at the code's primary hot spots using VTune? Can you tell if the primary hot spots are vectorized?

As noted by andysem, AVX provides a more register-efficient 3-operand instruction encoding. This can significantly improve performance if the SSE code is running out of registers. In addition, AVX allows unaligned memory operands in arithmetic instructions, which can further reduce instruction count and register pressure.

The 5% slowdown on Haswell may or may not be significant. Haswell has higher latency for instructions that move data between the upper and lower 128-bits of the 256-bit AVX registers (e.g. VPERM, so some codes that depend on these "swizzles" may run slower.

diamantis__nikos · ‎09-09-2015

Thanks for your replies.

My first comment is about SSE4.1 instruction set that both andysem and John D. McCalpin referred to.

Is there a reason that you both stopped at SSE4.1 ? The code optimizations are always up to SSE4.2...Is SSE4.2 insignificant or never actually used ?

Regarding VTune and other similar tools, I'm not familiar with them and never used that before.

One thing I could add is a similar project, with integer code that seems to have a little different results and I will give you more information when I put them on Excel, because if I don't see the results next to each other and measure the differences I can't have a clear picture.

Richard_Nutman · ‎09-09-2015

Hi Nikos,

There's not much in SSE4.2 that would be of use. It's mostly string and CRC functions.

You can see all SSE/AVX and other intrinsics here;

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Sounds to me the most likely culprit is unaligned memory access in AVX being much lower penalty as mentioned by John.

andysem · ‎09-10-2015

Is there a reason that you both stopped at SSE4.1 ? The code optimizations are always up to SSE4.2...Is SSE4.2 insignificant or never actually used ?

It is not insignificant. As Richard N. mentioned, SSE4.2 is mostly helpful for string operations, it's not a generic extension like SSE4.1. The reason I didn't include it is that (a) it is unlikely that compilers will generate those from a scalar code (I haven't seen that) and (b) I'm not sure SSE4.2 is implied by AVX.

Christopher_H_ · ‎09-10-2015

I could not see that anyone had mentioned this, with VEX encoding of SSE instructions, you will have 3 operand instructions which reduces register pressure if that were the bottleneck. I have seen 20-30% speeds up with SSE intrinsic code just by compiling with AVX to use VEX encoding.

Vladimir_Sedach · ‎09-10-2015

Nikos,

Could you reveal critical sections of your code so that we could suggest something more realistic? -
w/o wondering what we're talking about.

jimdempseyatthecove · ‎09-10-2015

>>with VEX encoding of SSE instructions, you will have 3 operand instructions

Is that documented? Meaning Intel assures it will (continue to) work on future designs.

Jim Dempsey

diamantis__nikos · ‎09-10-2015

Thanks for all your replies.

@Richard. Excellent link, I didn't know

@Christofer . I'll search for that VEX encoding in my disassembled executables.

@Vladimir. As a matter of fact I'm covered from the replies, my main question was basically informative because I found it weird for a basically integer code to have such speedup with AVX and not AVX2 re-compilation.

McCalpinJohn · ‎09-10-2015

The 3-operand instructions are a critical part of the AVX (and following) instruction set extensions, so they will certainly be maintained.

I don't understand the instruction encoding well enough to know whether the expression "VEX encoding of SSE instructions" is completely accurate. It is clear, however, that AVX includes 3-operand instructions that perform the same functions on 128-bit SIMD registers as almost all of the available SSE instructions. These are shown as VEX-encoded instructions with "xmm" register arguments in the instruction descriptions of Volume 2 of the Intel Architectures Software Developer's Manual (document 325383).

As an example, the entry for PADDB/PADDW/PADDD shows the 2-operand SSE encodings, the 3-operand 128-bit AVX encodings, and the 3-operand 256-bit AVX2 encodings.

Christopher_H_ · ‎09-10-2015

I simply meant the avx-128 equivalents of sse instructions by "VEX encoding of SSE instructions" . Where avx-128 has 3 operand and sse (meaning sse-sse4.1) has 2 operand instructions.

TimP · ‎09-10-2015

AVX128 offers a reduction in number of instructions compared with sse4 but a real performance increase only in rare cases. The useful side would be the proof that instructions per clock is a poor measure. The main reasons for compilers to auto-promote sse intrinsic to avx 128 is to avoid need for vzero_upper.