1) I spent more than a day playing with AVX intrinsics just to find out, that despite I made almost as fast as my assembler code (with ICC actually slightly faster), ICC itself produced even better code! So it seems I'm going for ICC after all, but :
- I need the software to be working on everything from SSE2 upwards, hence /arch:SSE2
- I want auto-dispatcher for AVX, since I found out the AVX code is faster on Sandy bridge and very much faster on Haswell
So I used /QaxCORE-AVX, but there was no difference and in debugger I verified it didn't create (or run) AVX code, it was using just SSE2. But it did create a great AVX code with /arch:AVX, but that wouldn't work on older CPUs, so it is not usable. So how can I enable the dispatching?
2) My software is full of vectorial cycles such as
for (int i=0; i<cnt; i++) dst = (a + b) * c
In these cases I know that I want this particular part of the code dispatched into multiple architectures (perhaps even FMA and newer in some cases). So should I mark these parts of the code somehow? Or how does this work? Is there some guide about writing code, so that it is easier for vectorization?
3) Does the vectorization (and other optimizations) work the same way on OSX as on Windows? I'll need both and I'm a little bit scared as things are usually much more problematic on OSX.
4) I actually compiled a big project with ICC and compared the realtime performance and sadly the difference much noticeable compared to MSVC, but the code ICC produced is like 30% bigger, which makes me think if despite the ICC produces better vectorized code, the code is so big, that the code cache misses are so frequent that it may degrade performance back to original level.
5) Can I use Profile guided optimizations with just ICC without buying VTune? Is it worth the trouble at all?
Thanks in advance!
You may have meant to use QaxAVX. I don't know why some of the newer variants include the CORE string, but it's spelled the same way with or without the "a" clause.
I suppose it's probably not worth while to request both AVX and AVX2 code paths. The method to request 3 or more is due to be fully documented in the near future.
I don't know why it should be more risky for MAC, e.g. -axAVX.
I've tried going after the question of recommended patterns for vectorization by posting example code and commentary on google sites (tprincesite). There is good advice in the docs installed with the compiler and others posted on Intel site and elsewhere.
Thank you, it does work indeed! A few questions though:
1) Will it work on AMD CPU's? There are some anbigous comments in the bottom of the page.
2) Is there a way to somehow mark in the code, that I want "this particular" part of the code dispatched?
Your other post also has some related questions on auto-vec/avx etc., with links to articles. The one on how to manually dispatch below might come in handy as well.
Partly. I now understand that AMD processors will be supported unless "-xAVX" is used for example. But about the dispatching - I was actually asking about different thing (sorry I didn't repost) - I was thinking if there's a pragma or something, that would make the compiler "force autodispatch". The thing is, my software is a huge monster, but only small fragments are really "the processing core". ICC seems to target these well, but for sake of easy of my conscience I was thinking that it could be handy to mark the parts of code that actually need the maximum performance. Optimizing everything is often an overkill. Which is probably why "profile guided optimization" doesn't really lead to a performance gain (or negligible), but it minimizes the executable size, A LOT! So apparently PGO makes the compiler understand, which blocks really need the maximum "care", which I could kind of mark myself. But maybe it's a nonsense, automatically it works well, so...
If your target is CPU which supports AVX (including AMD), you would set -mavx. For AMD CPU with fma3 support, as well as Intel AVX2, you could set -mCORE-AVX2 (seems to disagree with gcc spelling -march=avx2). The -m options avoid run-time dispatching and don't depend on recognizing CPU architecture (simply fail if instructions aren't supported).
Intel dispatching on AMD CPU has been intended to select the default path, although certain options such as -axavx may work; in the past it has been necessary to select also -imf-arch-consistency=true to disable some math library architecture options which came out wrong on certain AMD CPUs.
You certainly can set the source files which aren't performance critical to a single architecture such as sse3 and remove ipo options so as to reduce build time and object size. As you indicate, code sections which might benefit, for example, by both avx and avx2 paths, might be isolated.
Well put, Tim. Additionally, if the user wants to use ipo may be try to use on very critical/hoptspots per-se as well and that shouldn't affect the build time too. You can always try out and test for any performance gain which indeed depends on the context of the code/app.
Is there a way to make Intel Compiler (Let's say version 18.1 and above) to create few code path which will be dispatched on run time based on features?
Namely they will choose AVX code path on any CPU which supports AVX (Including AMD) or SSE according to the CPU features (And not maker).