- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greetings,
I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:
1. Windows 7 w/ Visual Studio 2010 on a i7-2760QM
Optimization: Maximize Speed (/O2)
Inline Function Expansion: Only __inline(/Ob1)
Enable Intrinsic Functions: No
Favor Size or Speed: Favor fast code (/Ot)
2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE
Flags: -O3 -mavx -m64 -march=corei7-avx -mtune=corei7-avx
For my testing I ran the C implementation and the AVX implementation on both platforms and got the following timing results:
In Visual Studio:
C Implementation: 30ms
AVX Implementation: 5ms
In GCC:
C Implementation: 9ms
AVX Implementation: 57ms
As you can tell my AVX numbers on Linux are very large by comparison. My concern and reason for this post is that I may not have a proper understanding of using AVX and the settings to properly them in both scenarios. For example, take my Visual Studio run. If I change the flag Enable Intrinsics to Yes, my AVX numbers go from 5ms to 59ms. Does that mean disabling the compiler to optimize with intrinsics and manually setting them in Visual Studio give that much better results? Last I checked there is nothing similar in gcc. Could Microsoft be that more capable of a better compile than gcc in this case? Any ideas why my AVX numbers on gcc are just that much larger? Any help is most appreciated. Cheers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry but I am confused.Did you use inline AVX assembly in your code or SImd AVX intrinsics?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My apologies for not being more specific. I used SIMD AVX intrinsics...more specifically the functions: _mm256_loadu_ps, _mm256_mul_ps, _mm256_add_ps, and _mm256_storeu_ps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First question I see that you are comparing compiled code on two different processor generations.How do you measure your code performance?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am measuring performance with timing of the operation (operation is performing the convolution on the data). So, I am using native libraries to grab a timestamp and determine the length in milliseconds. Yes, they are different generations, but I would presume the newer generation would give better numbers on AVX than the older. This is why I am thinking this something wrong with the gcc version or how I have set the optimization flags with it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you looked at disassembled code as it was generated by those two compilers?Some of the intrinsics are not directly translated to single machine code instruction , but I presume that you are doing convolution on digital data so the intrinsics used mainly should be load ,store ,add and mul.Moreover there are an additional factors like memory and cache performance and overall load of the system at the time of measurement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are also additional factors like uncertainties related to thread being swapped in the middle of your code being measured.So basicly when the thread's execution is resumed the wait time can be also included.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliypolak,
Thank you very much for your responses. I retrieved the assembly code for gcc and Visual Studio for both the AVX and C implementations of what I am doing. The Visual Studio comparison was fairly clear- The AVX implementation showed the following assembly for where my AVX calls were made:
; Line 190
vmovups ymm3, YMMWORD PTR [eax]
; Line 192
vmulps ymm3, ymm3, YMMWORD PTR [ecx]
add eax, edi
add ecx, 32 ; 00000020H
; Line 194
vaddps ymm0, ymm3, ymm0
The C implementation was much larger by comparison (I will not post) and contained a plethora of moves, adds, and multiplies. Thus, it was clear to see that the Visual Studio compiler utilized the AVX intrinsics and reduced by code size considerably. The gcc assembly, however, was not as clear. The AVX version contains what I believe to be the AVX assembly, but it differs from what Visual Studio produced:
vmulps %ymm1, %ymm6, %ymm1
vmulps %ymm1, %ymm5, %ymm1
etc... As this occurs 5 times over. I do notice that in visual studio the vmulps call referenced a pointer location with "YMMWORD PTR [ecx]" whereas gcc uses direct variables. The C implementation of gcc did not contain any of the AVX assembly, however, it was shorter in length than the AVX version in overall size.
In regards to your second question, the code running on linux with gcc has its affinity set to avoid context switching if that is what you were referring to. Thanks again for all of your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
VS implementation as seen in that assembly code snippet loads ( or dereferences a pointer to the array) [line:190] probably an input to your convolution function next at [line:192] there is a multiplication by convolution coefficients which is a part of the loop not seen in that code snippet and two lines below there is pointer arithmetics.At [line:194] there is a summation by not shown in code snippet load onto ymm0 register.GCC implementation probably preloads ymm registers and do multiplication on registers directly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you think that this ("GCC implementation probably preloads ymm registers and do multiplication on registers directly.") is the reason gcc is performing so much slower than its Visual Studio counterpart?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi James
I cannot answer it because you did not upload a full disassembly of GCC generated code.But I suppose that ymm register(s) must have been loaded with either with convolution function input or convolution function coefficients.On Haswell two loads can performed in parallel.In VS code you have load of one data stream and mul of that stream with another stream loaded from the memory or cache I think that two operations can be performed in parallel by using physical registers of register file.The last operation is dependent on the previous two operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you rate compilers according to the code optimization techniques?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When comparing performance of AVX intrinsics against compiler's choice of AVX instructions, you must observe the recommendation that _mm256_loadu_ps must be used only on aligned data for Sandy Bridge. Even on the newer generations, splitting unaligned loads, as the AVX compilation options do, will frequently run faster. _mm256_storeu_ps requires aligned data for satisfactory performance on both Sandy and Ivy Bridge CPUs, so compilers will use peeling for alignment or split them to AVX-128 when permitted to do so.
The CPU architects were aware of the tendency of VS2010 coders to use _mm256_loadu_ps and so put in a fix in Ivy Bridge to alleviate the penalty for unaligned data.
VS2012 introduced a limited degree of auto-vectorization as an alternative to vectorization by intrinsics. gcc 4.6 as well is a bit too old for use in evaluating AVX auto-vectorization.
We never found out why so much emphasis was placed on reduced numbers of instructions with AVX when it was well known that this would produce little performance gain in many situations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.
Was the performance of Intel C++ version 12.x better than MS VC++ compiler?
I bet that Intel compiler writers expertise could outperform competing compilers mainly in the area of code optimization as a function of specific microarchitecture and code parallelization and vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could make up a benchmark entirely within the range of situations where MSVC++ (VS2012 or 2013) auto-vectorizes, and find that compiler performing fully as well as the others.
You could set ground rules, as many people do, where you enable aggressive optimizations on one compiler and not another.
Any percentage performance rankings are highly dependent on benchmark content.
You might perhaps set up a table of which compilers perform selected categories of optimizations, according to compilation flags.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I will receive my Parallel Studio licence file I plan to test Intel, MSVC++ and MinGW compilers.
Thanks for interesting advise on how to perform such test.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...Was the performance of Intel C++ version 12.x better than MS VC++ compiler?
Yes.
MSVC++ sometimes optimizes loop carried data dependency recursions and switch better than ICL.
On the other side, in auto-vectorization (first implemented in VS2012, where ICL had it for well over a decade), the following optimizations seem to be missing in MSVC++:
taking advantage of __RESTRICT to enable vectorization
simd optimization of sum and inner_product reductions
simd optimization based on assertions to overcome "protects exception"
simd optimization of OpenMP for loops (some of these not introduced in ICL or gcc until this year)
simd optimization of non-unitary strides
vectorizable math functions
simd optimization of STL transform()
optimizations depending on non-overlapping array sections (for which ICL requires assertions, but gcc optimizes without assertion)
simd optimizations depending on in-lining
optimization based on "node splitting"
optimization of std::max and min (g++ doesn't optimize these, although it seemingly could use gfortran machinery to do so)
g++ can optimize fmax/fmin when -ffinite-math-only is set (so why not std:max/min?)
optimization based on data alignment assertion
Of course, most of these optimizations are more relevant to floating point and parallelizable applications than to those for which MSVC++ is more directly targeted. Even in the floating point applications, MSVC++ is likely to optimize at least 50% of vectorizable loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
Quote:
Sergey Kostrov wrote:The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.
Was the performance of Intel C++ version 12.x better than MS VC++ compiler?
I bet that Intel compiler writers expertise could outperform competing compilers mainly in the area of code optimization as a function of specific microarchitecture and code parallelization and vectorization.
I should have asked about how much was Intel compiler faster than its Microsoft counterpart.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page