Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Hans_Van_Zutphen
Beginner
89 Views

Multiple code paths and intrinsics

I have some code where I'm using a combination of automatically vectorized code (with many different possible CPU paths, including SSE2, SSE4.2, AVX and AVX2) and some hand-written intrinsics.

One function may contain both types of loops. So, what I would like is to be able to tell the compiler which of the hand-written types of code to use (they are written using SSE2 and AVX2). But I really don't want to have to write separate dispatcher functions for each loop - including having to make all the variables that are needed available in the dispatched function.

Is there an easier way to do this? At this moment I have made some #define's and I can compile 2 different versions of my code, and right now this means that - unless I know that for a specific version it will definitely run on an AVX2 CPU - I'm just using the SSE2 version.

To make things worse though, I now need to compile my code also for an ARM chip, and - since there is no Intel compiler for ARM - I need to use GCC, which unfortunately doesn't vectorize most of the code that the Intel compiler does. And the performance is not good enough. So I'm kinda forced to manually vectorize this code - the problem is that this means vectorizing dozens, potentially hundreds of loops, and if I do that, I still want the Intel compiled version to support both SSE2 and AVX2. The difference in performance between the SSE2 and AVX2 compiled versions is about 7% for the whole program, and I really cannot afford loosing that extra performance. The easiest way to vectorize those loops would be using some simple class to encapsulate SSE/AVX intrinsics so I hardly need to change the code at all, but I would still need to get both SSE2 and AVX2 code out of it without having to write all the code twice. (Copying it is also not really an option, and in the future I also want to be able to add AVX512 support).

 

0 Kudos
4 Replies
TimP
Black Belt
89 Views

You may have seen code such as my publicly posted stuff using a lot of #if __AVX2__ and #if __INTEL_COMPILER as a result of a lot of research about how to vectorize with gcc as well as icc.  With those #if conditions, choosing the compiler and architecture switches automatically makes choices between source code paths at compile time.  I don't understand whether you like or dislike that method.

I suppose AVX512 should work well as a promotion of loops designed for AVX2, and there should be a few additional opportunities specific to AVX512.  As (up to now), compilers have set all the pre-defined macro symbols for the older ISA included in each new one, any AVX2 guarded code would be chosen automatically for AVX512, just as now AVX guarded code becomes active under AVX2.

I have cases where I wrote code with AVX intrinsics but don't allow the compiler to use it except under AVX2 (due to the improved unaligned support); likewise a few cases where code which is valid but slow for SSE2 becomes active only with SSE4.1.

In my code there are more #if __MIC__ relicts than are now necessary, as the Intel compilers have resolved a fair number of those cases where MIC required differences in setting of #pragma omp simd and the like.  There are still cases where the compiler handles vector relative misalignment only when compiling for MIC; maybe it will do that for AVX512 target as well.  Apparently, there will be no AVX512 client CPUs, at least not until long after such server CPUs come out.

A part of this effort is in choosing to avoid vectorization where compilers are excessively aggressive.  With icc. #pragma novector is available, but with gcc, #pragma simd safelen(1) seems the most frequently useful one (which sometimes works also with icc).  

icc, more than gcc, sometimes needs a barrier to avoid fusing an effectively vectorizable loop with a non-vectorizable or incompatibly aligned one. The main choices are to put #pragma omp simd on the vectorizable loop, or to place the Intel-specific #pragma nofusion between the loops.

Hans_Van_Zutphen
Beginner
89 Views

#if __AVX2__ won't work on run time. And I want to have multiple code paths included in my executable.

So, now I have something like this (extremely simplified sample):

#pragma vector aligned
for (int c=0; c<4096; c++)
{
  a = 0;
}

This works fine with the Intel compiler, but when things get a bit more complicated than this, gcc for ARM won't vectorize it. I *really* don't want to have to replace all loops like these in my code by both SSE(x) and AVX (and in the future AVX512) loops.

What would be acceptable for me would be to make some class and then write something like this (pseudo-code!):

for (int c=0; c<4096; c+=VECSIZE)
{
  ((VECTYPE*)a).SetZero();
}

But I would need the compiler to automatically generate multiple code paths for different VECSIZE values.

TimP
Black Belt
89 Views

Intel C++ doesn't like to vectorize your example either, as it recognizes the opportunity for memset.  Sorry to quibble.

I'm not about to suggest that there is a generally practical way to use intrinsics of varying widths.

I haven't checked thoroughly whether #if __AVX2__ and the like work correctly with ICL compile options like -QaxHost -arch:SSE3, but I guess you're implying you don't like the multi-ISA target option.  I can think of reasons but can't guess yours.  I'd be surprised if you really need to target SSE2 when all CPUs of the last decade are at least SSE3.

Hans_Van_Zutphen
Beginner
89 Views

Lots of people are still using my software on older systems, among others because the intention is that you run *only* my software on it and nothing else, so many people use an old pc for it. In fact, when I dropped support for SSE (without the 2!) about 2 years ago some people started to complain!

Reply