Solved: AVX Optimizations and Performance: VisualStudio vs GCC - Page 3

James_S_7 · ‎10-01-2013

Greetings,

I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:

1. Windows 7 w/ Visual Studio 2010 on a i7-2760QM

Optimization: Maximize Speed (/O2)

Inline Function Expansion: Only __inline(/Ob1)

Enable Intrinsic Functions: No

Favor Size or Speed: Favor fast code (/Ot)

2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE

Flags: -O3 -mavx -m64 -march=corei7-avx -mtune=corei7-avx

For my testing I ran the C implementation and the AVX implementation on both platforms and got the following timing results:

In Visual Studio:

C Implementation: 30ms

AVX Implementation: 5ms

In GCC:

C Implementation: 9ms

AVX Implementation: 57ms

As you can tell my AVX numbers on Linux are very large by comparison. My concern and reason for this post is that I may not have a proper understanding of using AVX and the settings to properly them in both scenarios. For example, take my Visual Studio run. If I change the flag Enable Intrinsics to Yes, my AVX numbers go from 5ms to 59ms. Does that mean disabling the compiler to optimize with intrinsics and manually setting them in Visual Studio give that much better results? Last I checked there is nothing similar in gcc. Could Microsoft be that more capable of a better compile than gcc in this case? Any ideas why my AVX numbers on gcc are just that much larger? Any help is most appreciated. Cheers.

SergeyKostrov · ‎12-06-2013

>>...2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE I recommend you to upgrade GCC to version 4.8.1 Release 4. AVX Performance Tests [ Microsoft C++ compiler VS 2010 ( AVX ) ] ... Matrix Size: 8192 x 8192 Processing... ( Add - 1D-based ) _TMatrixSetF::Add - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 02 - Completed: 58.50000 ticks _TMatrixSetF::Add - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Add - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Add - Pass 05 - Completed: 58.50000 ticks Add - 1D-based - Passed Processing... ( Sub - 1D-based ) _TMatrixSetF::Sub - Pass 01 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 02 - Completed: 66.25000 ticks _TMatrixSetF::Sub - Pass 03 - Completed: 62.25000 ticks _TMatrixSetF::Sub - Pass 04 - Completed: 62.50000 ticks _TMatrixSetF::Sub - Pass 05 - Completed: 62.50000 ticks Sub - 1D-based - Passed ...

View solution in original post

Bernard · ‎02-21-2014

It seems that ecx contains pointer to aligned data which is accessed lineary(array index is incremented lineary) hence probably usage of

vmulps ymm3,ymm3,ymmword ptr[ecx] instruction.