I've recently recompiled an old C program which mainly uses integers, with VC 2013 and ICC 13.1 compilers and was initially compiled targeting SSE2 architecture using VC 2010 compiler.
I targeted AVX (version 1.0) architecture in the compiler options for my SandyBridge and I saw a massive speedup of 68% with both compilers (VC 2013 and ICC 13.1) compared to old SSE2 optimizations.
I recompiled the C code targeting this time AVX2 for my Haswell but the speedup was the same like AVX for Haswell for both compilers.
1) How is it possible for an integer based C code, both VC 2013 and ICC 13.1 compilers to produce a huge speedup of approximately 70% using AVX1 target architecture ?
2) Is there a way for a compiler to see the 256 bit AVX units as integer SIMD units and almost double the performance of integer code compared to SSE2 ? Or is it the SSE2 code running on 256 bit execution units ? Or is it something else ?
3) Why there was no speedup (1%) for AVX2 compared to AVX optimizations for integer code using Haswell processor ?
Tim Prince wrote:
AVX128 offers a reduction in number of instructions compared with sse4 but a real performance increase only in rare cases.
In my experience the improvement from recompiling for AVX is usually marginal or small but rather consistent. I've never seen a regression.
@ Christopher and all
Yes, all of the opcodes are in "VEX encoding" for both recompilations of ICC 13.1 and MS VC 2013 using AVX architecture.
The AVX2 opcodes for AVX2 recompilation of both compilers are using "VEX encoding" with "xmm" registers, probably meaning no gain for AVX2 compared to AVX.
Setting AVX2 gives compilers some opportunities for translation into AVX2 equivalents which may improve performance. For example, there may be optimizations where FMA replaces AVX code, which should improve performance (except in context such as sum reduction). More often, there is no change from AVX to AVX2 in either 128- or 256- modes.
SSE intrinsics are translated to AVX-128, which is most easily seen by the use of xmm registers. As we were just discussing, this usually results in little change in performance from SSE code on the same platform.
I believe gcc has a command line switch to implement AVX and AVX2 in 128-wide mode, which might have been beneficial to supporting more threads on AMD platforms.
I have recently run across a case where "-O3 -xCORE-AVX2" provides much better optimization than "-O3 -xAVX". This is unrelated to the actual instruction set, since the code generated in both cases consisted only of AVX instructions.
The inner loop was a simple sum reduction on a 2MiB-aligned vector of doubles with a known/fixed length (divisible by 64).
The compiler version is "icc (ICC) 15.0.1 20141023". It has been doing a great job on Haswell, but this difference in generated code may explain why some of my Sandy Bridge tests did not run as well as I expected....
That is an interesting example of, and more importantly a good description of, the compiler optimization team adding optimization methods to the newer (latest) ISA (AVX2 in this example), but not taking the time to verify if the optimization method would apply to the earlier ISA(s).
OK, so I recompiled a different but similar source code based on 128 bit integers and the results were a lot different.
First of all VC 2013 x64 was a lot slower of all the compilers, even slower than VC 2010 x86 (!) across all CPU architectures Core 2 Duo, Sandy, Haswell. And VC 2010 is slower than ICC 11.1 and ICC 13.1
Also, ICC 11.1 x86 has almost the same speed as ICC 13.1 x64 on Core 2 Duo and Sandy using SSEx (SSE2 to SSE4.2) compilations. AVX compilation is 20% faster than SSEx on Sandy.
Haswell has a little different behavior up to 4 threads, favoring a little more than previous generations the x64 SSEx compilations, but not dramatically (about 10% from x86 to x64). But using HT and all 8 threads, the gap goes up to 20% for x64 SSEx .
AVX has exactly the same speed like AVX2 and 1% to 4% only faster than SSEx using same compiler ICC 13.1 x64 and up to 4 threads.
HT and 8 threads give an extra boost to AVX/AVX2 of 13% over SSEx.
In the previous project Sandy was faster than Haswell clock for clock by 5% using 4 threads, but Haswell was faster using 8 threads by 13%.
In the new project, Sandy is even faster 7% than Haswell using 4T, but the huge gains of HT in this code give a 31% faster Haswell than Sandy using 8T.
When running performance tests, you must keep in mind that code placement can affect performance significantly. I've personally experienced over 10% difference in runtime based upon a change in an unrelated section of code (e.g. inserting a printf in front of the timed section). Using VTune or the debugger you can verify the origin of the top of your loop. When the top of the loop is at (or very near) the start of a cache line, then it may run better, principally due to the loop possibly residing in one fewer cache lines. When coding in C/C++, you can insert a #pragma to adjust the alignment of the code.
As John mentioned, selecting AVX2 may still use AVX1 code, but with different optimization analysis within the compiler (thus producing better AVX1 code). Caution to those other readers. Selecting AVX2 code generation for systems without AVX2 is not advised. Should the compiler elect to use AVX2 instructions, then your program will break.
>>>HT and 8 threads give an extra boost to AVX/AVX2 of 13% over SSEx>>>
I think that it is dependent on actual Port Saturation, at least enabling HT will usually improve the result by ~10%. When both HW threads are issueing the same code(uops) to Execution Ports then the contention will appear and uops will need to wait for completion. For exmple few vdivpd instuctions decoded into their corresponding uops and tagged as belonging to both HW threads are about to be scheduled for execution.