Could we have an official

Hans_v_ · ‎12-19-2018

I'm running into a very annoying compiler optimization issue that's causing crashes on older systems (CPU, OS). I'm just using an example here to demonstrate the issue:

switch (codepath)
{
    case AVX:
        __m256 bla = _mm256_setzero_ps();
        *x = bla;
        break;
    case SSE2:
        __m128 bla = _mm_setzero_ps();
        *x = bla;
        break;
    default:
        float bla = 0;
        *x = bla;
        break;
}

The problem is, that for some reason the compiler "thinks" that it's a good idea to move certain instructions outside of the switch() statement. So, in assembly code, it can do a _mm256_setzero_ps() before checking for CPU type - I'm guessing that it works for both the AVX and SSE2 case here because half of the register is shared. Also, in the SSE2 code, probably due to the fact that I'm also using AVX code in the same function, movps instructions are replaced by vmovps.

What I want:

1 executable that targets multiple instruction sets
No "Genuine Intel" checks. This needs to work on AMD as well.
Easy to use, if possible I don't want to write separate functions for separate targets (I need this in dozens of places in my code)
For debugging purposes, a way to dynamically choose other code paths runtime

In debug mode everything works as expected, and in release mode, it appears to work fine for 42 of the 44 places where I'm doing this. But those other two are causing crashes, and I obviously don't want to have code that might break on each new compiler version.

Using the dispatch-behavior doesn't really work because it doesn't appear to work for AMD, and I can't overrule it when I run the binary.

Using #if __AVX__ as was suggested elsewhere in this forum doesn't work either, because the #if is parsed once for the entire build (not per code path, and it evaluates to 0, cannot be changed during runtime and probably having multiple code paths won't work well for AMD either).

Hans_v_ · ‎12-19-2018

Update: Moving the code to a separate function that's not inlined seems to fix it. But, that's quite ugly (especially if I have to use that in 44 places).

jimdempseyatthecove · ‎12-19-2018

switch (codepath)
{
    case AVX:
        *((__m256*)x) =  _mm256_setzero_ps(); // requires x to point to 256-bit vector
        break;
    case SSE2:
        *((__m128*)x) =_mm_setzero_ps(); // requires x to point to 128-bit vector
        break;
    default:
        *x = 0.0f; // implies x points to 32-bit float
        break;
}

Jim Dempsey

jimdempseyatthecove · ‎12-19-2018

Note, for the above, the compiler should .NOT. be able to xor a single ymm register outside the switch due to when the CPU not having AVX, this would cause a illegal instruction fault. IMHO to do so would break code.

Jim Dempsey

jimdempseyatthecove · ‎12-19-2018

extern void _allow_cpu_features(unsigned __int64);
...

switch (codepath)
{
    case AVX:
       {
        _allow_cpu_features(_FEATURE_AVX);
        __m256 bla = _mm256_setzero_ps();
        *x = bla;
        break;
      }
    case SSE2:
        {
          _allow_cpu_features(_FEATURE_SSE3);
        __m128 bla = _mm_setzero_ps();
        *x = bla;
        break;
       }
    default:
       {
        _allow_cpu_features(_FEATURE_GENERIC_IA32);
        float bla = 0;
        *x = bla;
        break;
     }
}

Jim Dempsey

Hans_v_ · ‎12-19-2018

_allow_cpu_features sounded like exactly what I need, but unfortunately the problem isn't solved by it. Even if I put an explicit _allow_cpu_features(_FEATURE_GENERIC_IA32) above the switch, it still crashes. I'm guessing that this call just lets you write intrinsics that would otherwise cause a compile error (how???) but doesn't actually control optimization?

(The other idea with the x isn't a solution, I just typed some placeholder code to explain the issue. Even if I just use a _mm_setzero_ps() somewhere in the middle of a loop, it's putting a ymm-xor instruction outside the switch.)

(I'm using Intel Composer 2017, 4.210).

jimdempseyatthecove · ‎12-19-2018

Can you tell if the crash is caused by lifting of the AVX2 instruction outside of the switch statement?

Also, keep in mind that your target platform must not be that of AVX2

Use /Qm32 .OR. /arch:SSE2 .or. /QxSSE2 .OR. potentially /arch:IA32

I do not think that you can select anything earlier than SSE2 with the newer compilers.

IOW specify the non-_allow_cpu_features(...) to be that with the least supported instruction set.

Jim Dempsey

Hans_v_ · ‎12-19-2018

Target platform is "Generic" with alternative paths for among others SSE2, AVX and AVX2 (a sequence of /Qax options). What do you mean by "lifting"? If I replace the AVX2 code by a noninlined function everything works, it's just that one (and it's really just one) register initialization instruction is moved to before the switch.

Either removing the AVX2 code or replacing it by a non-inlined function fixes this, in that case the SSE2 code is executed and works fine. Debug mode is always ok.

Obviously, if this can go wrong, the same thing can happen with the SSE2-code on a machine that doesn't support SSE2. I can't imagine that there are a lot of people who still have one of those, but 2 years ago I stopped building the non-SSE2 (Pentium 3 and earlier!) version of my software and I got multiple complaints from people. So I turned them back on (I need to maintain my non-vectorized code anyway for reference and testing/debugging).

TimP · ‎12-20-2018

Support for SSE1 (without SSE2) was removed from Intel compilers about 8 years ago (for a while the option was still there but not supported). Few applications which needed SSE for performance were still running on Pentium 3. ISVs with whom I dealt stopped supporting any non-SSE2 platform. More recently, Intel stopped supporting non-SSE2 as a development platform.

Hans_v_ · ‎12-20-2018

In case anyone else runs into the same issue: I've found a workaround. It's clumsy but it works.

Because I wanted to keep my code inside the function (instead of having to create separate sub-functions for each implementation separately), I ended up creating a class inside each case: statement that contains a __noinline static function that contains the code. Ideally I would have wanted to write my code like this, using some macros:

    CODEPATH_AVX2
        // AVX code
    CODEPATH_SSE2
        // SSE2 code
    CODEPATH_GENERIC
        // Unoptimized reference code
    CODEPATH_END

The problem with this is that I have to pass all the parameters to the class function, because otherwise I can't access them. With that, I ended up with the following, which is a bit clumsy as I said before, but it's workable:

        #define DEFPARS int size, float* restrict data
        #define USEPARS     size                  data
        CODEPATH_OPEN
        CODEPATH_AVX(DEFPARS)
            // AVX code
        CODEPATH_END(USEPARS)
        CODEPATH_SSE2(DEFPARS)
            // SSE2 / NEON code
        CODEPATH_END(USEPARS)
        CODEPATH_GENERIC_NEON(DEFPARS)
            // Reference code
        CODEPATH_CLOSE(USEPARS)
        #undef DEFPARS
        #undef USEPARS

One cool trick here is that by #defining the parameter list, when using it it's seen as a single parameter by the other macros. So no need to use a variable number of parameters etc.

These defines do the trick:

#define CODEPATH_OPEN

#define CODEPATH_CLOSE(args)                               \
                    }                                      \
                };                                         \
                OPTIMIZED::optimized(args);                \
            }                                              \

#define CODEPATH_AVX(args)                                 \
            if (CpuFeatures::hasAVX2)                      \
            {                                              \
                struct OPTIMIZED                           \
                {                                          \
                    __NOINLINE static void optimized(args) \
                    {

#define CODEPATH_SSE2(args)                                \
            if (CpuFeatures::hasSSE2)                      \
            {                                              \
                struct OPTIMIZED                           \
                {                                          \
                    __NOINLINE static void optimized(args) \
                    {

#define CODEPATH_GENERIC(args)                             \
            {                                              \
                struct OPTIMIZED                           \
                {                                          \
                    __NOINLINE static void optimized(args) \
                    {

#define CODEPATH_END(args)                                 \
                    }                                      \
                };                                         \
                OPTIMIZED::optimized(args);                \
            }                                              \
            else

This looks quite horrible, but it works, the #defines are located in 1 place, and it's needed to work around something that appears to be (if I can call it that) a compiler bug. So it's ugly but it does the job. My code is now working on systems without AVX.

jimdempseyatthecove · ‎12-21-2018

>> Target platform is "Generic" with alternative paths for among others SSE2, AVX and AVX2 (a sequence of /Qax options).

Not what I stated. Generate a target platform with only "Generic" specified on the command line, but then in the code section use _allow_cpu_features to enable the additional code paths as needed.

>>What do you mean by "lifting"?

Lifting is a term that was derived from code optimization that moves (lifts) loop invariant code to above the loop (executes only once, before the loop). In your case, a suspicion is that the zeroing of the temporary is lifted to before the switch statement .AND. because you have stated to generate multiple code paths (/Qax...) that the choice for the lifted code picked AVX2.

Please redo the experiment with generic target and without /Qax...

The resultant code will be much cleaner.

Jim Dempsey

Royi · ‎12-28-2018

Using the dispatch-behavior doesn't really work because it doesn't appear to work for AMD, and I can't overrule it when I run the binary.

I have the same problem.
It seems that if I use /arch:SSE3 /QaxAVX2 the generated code won't work on AMD CPU which is a contradiction to documentation.
Has anyone got official answer form Intel on that?

rwg · ‎02-07-2019

Same problem here. Code runs with /Qx:SSE2 but not with /Qx:SSE3 on following CPUs:

AMD Ryzen 7 2700X

AMD FX 8350

AMD FX 6300

AMD FX 4300

AMD X 640

and probably more.

Compiler Version: Intel(R) Visual Fortran Compiler 18.0.3.210 [IA-32]

Is this fixed in a newer Compiler Version?

Royi · ‎04-15-2019

Could we have an official Intel comment on this?

The automated dispatch feature of ICC is one of the main reasons to buy it yet it seems to malfunction on AMD CPU's.
I can't find a way to build code which runs on CPU's based on their features and not based on their manufacture in a reliable way.

Intel, please remove all those code path which are specific to Intel and just use the regular standard features of CPU's.

Runtime selection of SSE/AVX code paths, overridable