- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greetings,
I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:
1. Windows 7 w/ Visual Studio 2010 on a i7-2760QM
Optimization: Maximize Speed (/O2)
Inline Function Expansion: Only __inline(/Ob1)
Enable Intrinsic Functions: No
Favor Size or Speed: Favor fast code (/Ot)
2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE
Flags: -O3 -mavx -m64 -march=corei7-avx -mtune=corei7-avx
For my testing I ran the C implementation and the AVX implementation on both platforms and got the following timing results:
In Visual Studio:
C Implementation: 30ms
AVX Implementation: 5ms
In GCC:
C Implementation: 9ms
AVX Implementation: 57ms
As you can tell my AVX numbers on Linux are very large by comparison. My concern and reason for this post is that I may not have a proper understanding of using AVX and the settings to properly them in both scenarios. For example, take my Visual Studio run. If I change the flag Enable Intrinsics to Yes, my AVX numbers go from 5ms to 59ms. Does that mean disabling the compiler to optimize with intrinsics and manually setting them in Visual Studio give that much better results? Last I checked there is nothing similar in gcc. Could Microsoft be that more capable of a better compile than gcc in this case? Any ideas why my AVX numbers on gcc are just that much larger? Any help is most appreciated. Cheers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, situations where I see gcc out-performing icc:
Unrolled source, requiring re-roll optimization, where the compiler replaces the source code unrolling by its own optimization. icc dropped re-roll back around version 10.0. We can argue that source code unrolling is an undesirable practice, given that modern software and hardware techniques eliminate the need for it in trivial cases. Unroll-and-jam is another story.
Variable stride (even when written with CEAN so as to get AVX-128 in icc) e.g.
for (i__ = *n1; i__ <= i__2; i__ += i__3)
a[i__] += b[*n - (k += j) + 1];
Some cases where intrinsics are used to dictate code generation, or vectorization isn't possible, and the superior automatic unrolling facilities of gcc come in, if you are willing to use them (and adjust the unrolling limit to your CPU) e.g.
CFLAGS = -O3 -std=c99 -funroll-loops --param max-unroll-times=2 -ffast-math -fno-cx-limited-range -march=corei7-avx -fopenmp -g3 -gdwarf-2
Those debug options are recommended for using Windows gcc with Intel(r) VTune(tm), in case you missed the hint.
gcc -ffast-math -fno-cx-limited-range is roughly equivalent to icl -fp:fast=1, the latter being a default.
If you don't study the options, you won't get the best out of gcc.
Intel corei7-4 CPUs want less software unrolling than their predecessors, while gcc's aggressive unlimited unrolling was better suited to the Harpertown generation. I'm not getting consistently satisfactory results with avx2 with either compiler; avx2 seems to expose bugs in gcc with openmp, while icc drops some i7-2 optimizations which remain better on i7-4. I'll wait to optimize for corei7-5 when it arrives, if my retirement permits.
It's not usually difficult to discover and correct situations where icc doesn't match gcc performance, while there are many situations where it's easier to get full performance with icc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Aggressively unrolling more than two beside increasing register usage pressure will only utilize two execution Ports(depends on instructions) in parallel per cycle.When coupled only with prefetching outstanding decoded machine code instructions (micro-ops) which correspond to prefetched data can be pipelined inside SIMD execution stack.Here ICache can speed up execution by caching decoded frequently machine code instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sergey
Thanks for posting the results of compiler comparison.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
I recompiled my software on a machine that had gcc 4.8.2 and even updated my compiler flags to reflect the following:
-O3 -march=core-avx-i -mtune=core-avx-i
I am however getting on the average the exact same timing numbers as before...which to me is very odd. I can't help but to think I am missing something trivial...
Thanks again for your help in this matter.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you post a full disassembly of GCC generated code?Moreover I would like to advise you to perform profiling of FIR code with the help of VTune.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Also, a priority boost to High or Realtime will improve performance by ~1.5 percent ( applicable to codes compiled with any C++ compiler ):
Hi Sergey
Did you try to disable some hardware like NIC's and rerun your tests?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Seems that Visual Studio has enough hint to put your "kernel" (YMMWORD [ecx]) right into the instruction, that means the it knows that the [ecx] pointer is aligned. It is hard to say more without the source code. But I guess on g++ it does an additional vmovups to load the kernel register, even worst it might be from somewhere actually not aligned, even even worst maybe it does this for every packs of pixel where it can do once for all the loop.
Maybe Visual is being more agressive on the inlining, have you tried that kind of flags on g++ that forces inlining deeply (which is critical in a convolution algorithm if you wrote it in multiple functions / functors ) ?
Is it a good idea to put flags like " -mtune=corei7-avx" that might perform optimization that counter yours ?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page