Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Massive speedup of integer SSE2 code using AVX1(!)

diamantis__nikos
2,520 Views

Hello.

I've recently recompiled an old C program which mainly uses integers, with VC 2013 and ICC 13.1 compilers and was initially compiled targeting SSE2 architecture using VC 2010 compiler.

I targeted AVX (version 1.0) architecture in the compiler options for my SandyBridge and I saw a massive speedup of 68% with both compilers (VC 2013 and ICC 13.1) compared to old SSE2 optimizations.

I recompiled the C code targeting this time AVX2 for my Haswell but the speedup was the same like AVX for Haswell for both compilers.

My questions:

1) How is it possible for an integer based C code, both VC 2013 and ICC 13.1 compilers to produce a huge speedup of approximately 70% using AVX1 target architecture ?

2) Is there a way for a compiler to see the 256 bit AVX units as integer SIMD units and almost double the performance of integer code compared to SSE2 ? Or is it the SSE2 code running on 256 bit execution units ? Or is it something else ?

3) Why there was no speedup (1%) for AVX2 compared to AVX optimizations for integer code using Haswell processor ?

Thanks!

0 Kudos
1 Solution
andysem
New Contributor III
2,460 Views

You will have the exact answer if you examine the disassembler of your compiled binaries. With the data you provided we can only speculate.

As far as speculations go, there are several possible avenues of speedup:

- AVX implies support for all extensions up to SSE4.1. The speed up may be achieved with an extension prior to AVX, not AVX itself.

- AVX includes a new instruction encoding, which allows to specify the output operand so that inputs are not destroyed. Depending on your code specifics and the amount of xmm registers used, this may reduce the pressure on the register file and avoid register spills on the stack in performance critical parts of the code.

- The new instruction encoding also tends to shorten the code as many moves between registers can be removed. This may result in the code becoming small enough to fit into L1 cache or be easier to inline by the compiler.

- You are bottlenecked by some other part of the program, not the integer part.

Is there a way for a compiler to see the 256 bit AVX units as integer SIMD units and almost double the performance of integer code compared to SSE2 ? Or is it the SSE2 code running on 256 bit execution units ? Or is it something else ?

The compiler may use 256-bit ymm registers to implement some integer operations - mostly logical ones, some shuffles and blends. But it can't convert your integers to floats and operate on them like that. I bet it would not speed the code up anyway. On Haswell and later the compiler could use AVX/AVX2 to speed up some memory operations, as the cache interface has been widened to 256 bits on that architecture. To my knowledge, Sandy/Ivy Bridge CPUs don't provide doubled throughput for 128-bit operations compared to 256-bit ones, so you're not getting the benefit from instruction level parallelism (in other words, no, the extra execution units that are used with 256-bit instructions are not speeding up the 128-bit instructions).

 

View solution in original post

0 Kudos
28 Replies
andysem
New Contributor III
358 Views

Tim Prince wrote:

AVX128 offers a reduction in number of instructions compared with sse4 but a real performance increase only in rare cases.

In my experience the improvement from recompiling for AVX is usually marginal or small but rather consistent. I've never seen a regression.

 

0 Kudos
diamantis__nikos
358 Views

@ Christopher and all

Yes, all of the opcodes are in "VEX encoding" for both recompilations of ICC 13.1 and MS VC 2013 using AVX architecture.

The AVX2 opcodes for AVX2 recompilation of both compilers are using "VEX encoding" with "xmm" registers, probably meaning no gain for AVX2 compared to AVX.

 

0 Kudos
TimP
Honored Contributor III
358 Views

Setting AVX2 gives compilers some opportunities for translation into AVX2 equivalents which may improve performance. For example, there may be optimizations where FMA replaces AVX code, which should improve performance (except in context such as sum reduction).  More often, there is no change from AVX to AVX2 in either 128- or 256- modes.

SSE intrinsics are translated to AVX-128, which is most easily seen by the use of xmm registers.  As we were just discussing, this usually results in little change in performance from SSE code on the same platform. 

I believe gcc has a command line switch to implement AVX and AVX2 in 128-wide mode, which might have been beneficial to supporting more threads on AMD platforms.

0 Kudos
McCalpinJohn
Honored Contributor III
358 Views

I have recently run across a case where "-O3 -xCORE-AVX2" provides much better optimization than "-O3 -xAVX".  This is unrelated to the actual instruction set, since the code generated in both cases consisted only of AVX instructions.

The inner loop was a simple sum reduction on a 2MiB-aligned vector of doubles with a known/fixed length (divisible by 64).

  • With "-O3 -xCORE-AVX2" the compiler generated excellent code, using 8 256-bit registers as (4-wide) partial sums.  These 8 partial sums were combined very neatly at the end of the loop with a tree-based sum that was able hide most of the latency of the 7 required VADDPD instructions.
  • With "-O3 -xAVX" the compiler vectorized the code, but only used 2 256-bit registers as partial sums.  This is not enough to tolerate the 3-cycle latency of the VADDPD instruction, so the code took 1.5 cycles per VADDPD instead of 1.0 cycles per VAPPDP.  

The compiler version is "icc (ICC) 15.0.1 20141023".  It has been doing a great job on Haswell, but this difference in generated code may explain why some of my Sandy Bridge tests did not run as well as I expected....

0 Kudos
jimdempseyatthecove
Honored Contributor III
358 Views

John,

That is an interesting example of, and more importantly a good description of, the compiler optimization team adding optimization methods to the newer (latest) ISA (AVX2 in this example), but not taking the time to verify if the optimization method would apply to the earlier ISA(s).

Jim Dempsey

0 Kudos
diamantis__nikos
358 Views

OK, so I recompiled a different but similar source code based on 128 bit integers and the results were a lot different.

First of all VC 2013 x64 was a lot slower of all the compilers, even slower than VC 2010 x86 (!) across all CPU architectures Core 2 Duo, Sandy, Haswell. And VC 2010 is slower than ICC 11.1 and ICC 13.1

Also, ICC 11.1 x86 has almost the same speed as ICC 13.1 x64 on Core 2 Duo and Sandy using SSEx (SSE2 to SSE4.2) compilations. AVX compilation is 20% faster than SSEx on Sandy.

Haswell has a little different behavior up to 4 threads, favoring a little more than previous generations the x64 SSEx compilations, but not dramatically (about 10% from x86 to x64). But using HT and all 8 threads, the gap goes up to 20% for x64 SSEx .

AVX has exactly the same speed like AVX2 and 1% to 4% only faster than SSEx using same compiler ICC 13.1 x64 and up to 4 threads.

HT and 8 threads give an extra boost to AVX/AVX2 of 13% over SSEx.

In the previous project Sandy was faster than Haswell clock for clock by 5% using 4 threads, but Haswell was faster using 8 threads by 13%.

In the new project, Sandy is even faster 7% than Haswell using 4T, but the huge gains of HT in this code give a 31% faster Haswell than Sandy using 8T.

0 Kudos
jimdempseyatthecove
Honored Contributor III
358 Views

Nikos,

When running performance tests, you must keep in mind that code placement can affect performance significantly. I've personally experienced over 10% difference in runtime based upon a change in an unrelated section of code (e.g. inserting a printf in front of the timed section). Using VTune or the debugger you can verify the origin of the top of your loop. When the top of the loop is at (or very near) the start of a cache line, then it may run better, principally due to the loop possibly residing in one fewer cache lines. When coding in C/C++, you can insert a #pragma to adjust the alignment of the code.

As John mentioned, selecting AVX2 may still use AVX1 code, but with different optimization analysis within the compiler (thus producing better AVX1 code). Caution to those other readers. Selecting AVX2 code generation for systems without AVX2 is not advised. Should the compiler elect to use AVX2 instructions, then your program will break.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
358 Views

>>>HT and 8 threads give an extra boost to AVX/AVX2 of 13% over SSEx>>>

I think that it is dependent on actual Port Saturation, at least enabling HT will usually improve the result by ~10%. When both HW threads are issueing the same code(uops) to Execution Ports then the contention will appear and uops will need to wait for completion. For exmple few vdivpd instuctions decoded into their corresponding uops and tagged as belonging to both HW threads are about to be scheduled for execution.

0 Kudos
Reply