Solved: AVX Optimizations and Performance: VisualStudio vs GCC

James_S_7 · ‎10-01-2013

Greetings,

I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:

1. Windows 7 w/ Visual Studio 2010 on a i7-2760QM

Optimization: Maximize Speed (/O2)

Inline Function Expansion: Only __inline(/Ob1)

Enable Intrinsic Functions: No

Favor Size or Speed: Favor fast code (/Ot)

2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE

Flags: -O3 -mavx -m64 -march=corei7-avx -mtune=corei7-avx

For my testing I ran the C implementation and the AVX implementation on both platforms and got the following timing results:

In Visual Studio:

C Implementation: 30ms

AVX Implementation: 5ms

In GCC:

C Implementation: 9ms

AVX Implementation: 57ms

As you can tell my AVX numbers on Linux are very large by comparison. My concern and reason for this post is that I may not have a proper understanding of using AVX and the settings to properly them in both scenarios. For example, take my Visual Studio run. If I change the flag Enable Intrinsics to Yes, my AVX numbers go from 5ms to 59ms. Does that mean disabling the compiler to optimize with intrinsics and manually setting them in Visual Studio give that much better results? Last I checked there is nothing similar in gcc. Could Microsoft be that more capable of a better compile than gcc in this case? Any ideas why my AVX numbers on gcc are just that much larger? Any help is most appreciated. Cheers.

SergeyKostrov · ‎12-06-2013

>>...2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE I recommend you to upgrade GCC to version 4.8.1 Release 4. AVX Performance Tests [ Microsoft C++ compiler VS 2010 ( AVX ) ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 58.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 58.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 66.25000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 62.50000 ticks Sub - 1D-based - Passed ...

View solution in original post

Bernard · ‎10-02-2013

Sorry but I am confused.Did you use inline AVX assembly in your code or SImd AVX intrinsics?

James_S_7 · ‎10-02-2013

My apologies for not being more specific. I used SIMD AVX intrinsics...more specifically the functions: _mm256_loadu_ps, _mm256_mul_ps, _mm256_add_ps, and _mm256_storeu_ps.

Bernard · ‎10-02-2013

First question I see that you are comparing compiled code on two different processor generations.How do you measure your code performance?

James_S_7 · ‎10-02-2013

I am measuring performance with timing of the operation (operation is performing the convolution on the data). So, I am using native libraries to grab a timestamp and determine the length in milliseconds. Yes, they are different generations, but I would presume the newer generation would give better numbers on AVX than the older. This is why I am thinking this something wrong with the gcc version or how I have set the optimization flags with it.

Bernard · ‎10-02-2013

Have you looked at disassembled code as it was generated by those two compilers?Some of the intrinsics are not directly translated to single machine code instruction , but I presume that you are doing convolution on digital data so the intrinsics used mainly should be load ,store ,add and mul.Moreover there are an additional factors like memory and cache performance and overall load of the system at the time of measurement.

Bernard · ‎10-03-2013

There are also additional factors like uncertainties related to thread being swapped in the middle of your code being measured.So basicly when the thread's execution is resumed the wait time can be also included.

James_S_7 · ‎10-04-2013

iliypolak,

Thank you very much for your responses. I retrieved the assembly code for gcc and Visual Studio for both the AVX and C implementations of what I am doing. The Visual Studio comparison was fairly clear- The AVX implementation showed the following assembly for where my AVX calls were made:

; Line 190
   vmovups   ymm3, YMMWORD PTR [eax]
; Line 192
   vmulps   ymm3, ymm3, YMMWORD PTR [ecx]
   add   eax, edi
   add   ecx, 32                   ; 00000020H
; Line 194
   vaddps   ymm0, ymm3, ymm0

The C implementation was much larger by comparison (I will not post) and contained a plethora of moves, adds, and multiplies. Thus, it was clear to see that the Visual Studio compiler utilized the AVX intrinsics and reduced by code size considerably. The gcc assembly, however, was not as clear. The AVX version contains what I believe to be the AVX assembly, but it differs from what Visual Studio produced:

vmulps %ymm1, %ymm6, %ymm1

vmulps %ymm1, %ymm5, %ymm1

etc... As this occurs 5 times over. I do notice that in visual studio the vmulps call referenced a pointer location with "YMMWORD PTR [ecx]" whereas gcc uses direct variables. The C implementation of gcc did not contain any of the AVX assembly, however, it was shorter in length than the AVX version in overall size.

In regards to your second question, the code running on linux with gcc has its affinity set to avoid context switching if that is what you were referring to. Thanks again for all of your help.

Bernard · ‎10-04-2013

VS implementation as seen in that assembly code snippet loads ( or dereferences a pointer to the array) [line:190] probably an input to your convolution function next at [line:192] there is a multiplication by convolution coefficients which is a part of the loop not seen in that code snippet and two lines below there is pointer arithmetics.At [line:194] there is a summation by not shown in code snippet load onto ymm0 register.GCC implementation probably preloads ymm registers and do multiplication on registers directly.

James_S_7 · ‎10-07-2013

Do you think that this ("GCC implementation probably preloads ymm registers and do multiplication on registers directly.") is the reason gcc is performing so much slower than its Visual Studio counterpart?

Bernard · ‎10-07-2013

Hi James

I cannot answer it because you did not upload a full disassembly of GCC generated code.But I suppose that ymm register(s) must have been loaded with either with convolution function input or convolution function coefficients.On Haswell two loads can performed in parallel.In VS code you have load of one data stream and mul of that stream with another stream loaded from the memory or cache I think that two operations can be performed in parallel by using physical registers of register file.The last operation is dependent on the previous two operations.

SergeyKostrov · ‎11-21-2013

>>In Visual Studio: >> >>C Implementation: 30ms >> >>AVX Implementation: 5ms >> >>In GCC: >> >>C Implementation: 9ms >> >>AVX Implementation: 57ms In essence, your results are very different from my results based on performance evaluation of some linear algebra algorithms. I would rate three the most widelly used C++ compilers as follows: 1. Intel C++ compiler ( versions 12.x and 13.x ) 2. GCC-like MinGW ( version 4.8.1 ) 3. Microsoft C++ compiler ( VS 2010 ) Take into account, that core parts of these linear algebra algorithms individually optimized for every C++ compiler in order to get as better as possible performance because every compiler uses different techniques to optimize codes, to do vectorization, etc. Another thing is compiler options and I've also tuned that as better as possible.

Bernard · ‎11-22-2013

Did you rate compilers according to the code optimization techniques?

SergeyKostrov · ‎11-22-2013

The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.

TimP · ‎11-23-2013

When comparing performance of AVX intrinsics against compiler's choice of AVX instructions, you must observe the recommendation that _mm256_loadu_ps must be used only on aligned data for Sandy Bridge. Even on the newer generations, splitting unaligned loads, as the AVX compilation options do, will frequently run faster. _mm256_storeu_ps requires aligned data for satisfactory performance on both Sandy and Ivy Bridge CPUs, so compilers will use peeling for alignment or split them to AVX-128 when permitted to do so.

The CPU architects were aware of the tendency of VS2010 coders to use _mm256_loadu_ps and so put in a fix in Ivy Bridge to alleviate the penalty for unaligned data.

VS2012 introduced a limited degree of auto-vectorization as an alternative to vectorization by intrinsics. gcc 4.6 as well is a bit too old for use in evaluating AVX auto-vectorization.

We never found out why so much emphasis was placed on reduced numbers of instructions with AVX when it was well known that this would produce little performance gain in many situations.

Bernard · ‎11-23-2013

Sergey Kostrov wrote:

The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.

Was the performance of Intel C++ version 12.x better than MS VC++ compiler?

I bet that Intel compiler writers expertise could outperform competing compilers mainly in the area of code optimization as a function of specific microarchitecture and code parallelization and vectorization.

TimP · ‎11-23-2013

You could make up a benchmark entirely within the range of situations where MSVC++ (VS2012 or 2013) auto-vectorizes, and find that compiler performing fully as well as the others.

You could set ground rules, as many people do, where you enable aggressive optimizations on one compiler and not another.

Any percentage performance rankings are highly dependent on benchmark content.

You might perhaps set up a table of which compilers perform selected categories of optimizations, according to compilation flags.

Bernard · ‎11-23-2013

When I will receive my Parallel Studio licence file I plan to test Intel, MSVC++ and MinGW compilers.

Thanks for interesting advise on how to perform such test.

SergeyKostrov · ‎11-25-2013

>>...Was the performance of Intel C++ version 12.x better than MS VC++ compiler? Yes.

TimP · ‎11-25-2013

Sergey Kostrov wrote:

>>...Was the performance of Intel C++ version 12.x better than MS VC++ compiler?

Yes.

MSVC++ sometimes optimizes loop carried data dependency recursions and switch better than ICL.

On the other side, in auto-vectorization (first implemented in VS2012, where ICL had it for well over a decade), the following optimizations seem to be missing in MSVC++:

taking advantage of __RESTRICT to enable vectorization

simd optimization of sum and inner_product reductions

simd optimization based on assertions to overcome "protects exception"

simd optimization of OpenMP for loops (some of these not introduced in ICL or gcc until this year)

simd optimization of non-unitary strides

vectorizable math functions

simd optimization of STL transform()

optimizations depending on non-overlapping array sections (for which ICL requires assertions, but gcc optimizes without assertion)

simd optimizations depending on in-lining

optimization based on "node splitting"

optimization of std::max and min (g++ doesn't optimize these, although it seemingly could use gfortran machinery to do so)

g++ can optimize fmax/fmin when -ffinite-math-only is set (so why not std:max/min?)

optimization based on data alignment assertion

Of course, most of these optimizations are more relevant to floating point and parallelizable applications than to those for which MSVC++ is more directly targeted. Even in the floating point applications, MSVC++ is likely to optimize at least 50% of vectorizable loops.

Bernard · ‎11-25-2013

iliyapolak wrote:

Quote:

Sergey Kostrov wrote:
The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.

Was the performance of Intel C++ version 12.x better than MS VC++ compiler?

I bet that Intel compiler writers expertise could outperform competing compilers mainly in the area of code optimization as a function of specific microarchitecture and code parallelization and vectorization.

I should have asked about how much was Intel compiler faster than its Microsoft counterpart.