topic Incorrect dependence detection prevents vectorization in Analyzers

Incorrect dependence detection prevents vectorization

matthieuschaller — Mon, 18 May 2020 15:45:36 GMT

Hi,

I am trying to get an approximated math function to auto-vectorize in C. The code uses an union to play with floats on the one hand and with the bit representation seen as an int on the other hand. This creates a spurious dependency and the compiler refuses to vectorize the loop despite everything else being safe.

The code of the function looks like:

__attribute__((always_inline, const)) inline static float optimized_expf(const float x) {

  const float i = rintf(x * ((float)M_LOG2E));
  const float f = x - ((float)M_LN2) * i;

  float exp_f = 0.041944388f;
  exp_f = exp_f * f + 0.168006673f;
  exp_f = exp_f * f + 0.499999940f;
  exp_f = exp_f * f + 0.999956906f;
  exp_f = exp_f * f + 0.999999642f;

  union {
    int i;
    float f;
  } e;

  e.f = exp_f;
  e.i += ((int)i) << 23;  // Spurious ANTI dependence here

  return e.f;
}

The compiler optimisation report tells me the function cannot be vectorized because of an ANTI dependence between e.i and e.f in the last line of the function. This is obviously true but in the same sense as any of the previous operation also need to be run in order.
The use of the union seems to put the compiler in some incorrect state not understanding that I am only playing with the same bits as I was on the line above.
This would be the C equivalent of a C++ reinterpret_cast<>, nothing more.

Is there a way to tell the compiler that everything happening here is fine and that it needs not worry?
There are clearly vector instructions that exist for each of the lines of code here so (auto-)vectorization need not be an issue.

Thanks in advance for your help and suggestions!

Hi Matthieu,

Zakhar_M_Intel1 — Mon, 18 May 2020 18:41:58 GMT

Hi Matthieu,

The formal answer to your question is to use

#pragma omp simd

before your for-loop (the one invoking the function)

and

#pragma omp declare simd

before your function.

Some explanation of OpenMP pragma declare simd (and how to combine it with #pragma omp simd) can be found at e.g. https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/vectorization/explicit-vector-programming/simd-enabled-functions.html

OpenMP "pragma-simd" will override all compiler's dependencies assumptions and will force compiler to "vectorize at any cost".

BTW you don't need to use omp runtime (qopenmp) in order to active pragma simd. it is enough to use simd-specific compile-time flag (qopenmp-simd), so no OMP runtime will be used unless you already use it by other reasons.

That's being said; if compiler got confused with union usage - it might be little issue with final vector code generation anyway;so please double-check the output correctness after forcing vectorization with SIMD.

If questions remain (after using SIMD) - feel free to post the follow-up question in the same forum message thread.

Zakhar

Hi Zakhar,

matthieuschaller — Mon, 18 May 2020 19:16:25 GMT

Hi Zakhar,

Thanks for the quick response.

I am in general not a big fan of setting this pragma as (as you said) it will vectorize at all costs even if it is possibly problematic to do so. I do especially worry about having this pragma on if I am on a platform where the required instructions for my function either don't exist or are inefficient.
I worry that by forcing the compiler's hand I'll end up in a silly situation.

Thanks,
M.

Mathieu,

Zakhar_M_Intel1 — Mon, 18 May 2020 19:38:07 GMT

Mathieu,

OpenMP is a standard and OpenMP5.0 is a standard long ago. It is supported by multiple compilers to date: GCC, MSFT (from 2019), Intel, IBM, CLANG, even PGI and Cray (afair), some others. So, first of all, it is not proprietary thing, it is a cross-industry standard, used for x86 and beyond.

Secondly, the huge advantage of OpenMP is portability from ISA to ISA from CPU to CPU. So if your CPU is SSE-only then OMP compilers will generate SSE code for pragma SIMD case.. and for AVX512 one - it will leverage AVX512 under the hood.

So OMP SIMD is not just yet another proprietary pragma trick, but pretty much TRUE cross-vendor and cross-industry standard.

BTW you probably don't need pragma omp declare simd for the very first experiment, because you have inlined function and for the most cases it could be enough (so only #pragma omp simd is required). Although for the full portability I'd follow OMP SIMD standard approach (ie would use both pragma-s in final version)

.. and for your last (perhaps

Zakhar_M_Intel1 — Mon, 18 May 2020 19:50:00 GMT

.. and for your last (perhaps most important) concern - regarding "profitability" of SIMD - in general can not imagine it to degrade in performance when moving from older to newer architectures. So unless you "move back" (totally outside x86 or to SSE machine down from AVX) - you should expect better performance with newer hardware.

You can also double-check Advisor Vector Efficiency metric if you are worried about it.

And if you care of AVX512 "frequencies", then there are centralized flags in both Intel and GCC to control this behavior and those are totally orthogonal to omp simd anyway (so non-omp-simd codes equally suffer/or not suffer from those issues )

Finally, if nothing above can work for you, then we can try to re-post this question somewhere in Compiler forums to check how do they deal with union trick like yours.

Hi,

matthieuschaller — Tue, 19 May 2020 15:42:44 GMT

Hi,

Thanks! I wasn't worried about it not being standard but more about the second point you mention.
The main work horse machine for daily calculations we use only offers AVX (sadly!) so I am a bit worried that by "blindly" forcing vectorization (because it is all fine on AVX512 machines) I may create problems on these older systems.

I can confirm that in this specific case using this pragma works smoothly.

Thanks again,
M.

Hi,

Zakhar_M_Intel1 — Tue, 19 May 2020 18:37:59 GMT

Hi,

Glad it worked fine for you. I would not be worried about AVX vs AVX512 , esp for codes like YOURS. The only issue (imaginary ) possible would be kind of opposite to what you said; i.e. with NON-DEFAULT compiler flags you may end up AVX512 performance to be lower compared to AVX. But this is really orthogonal to OMP SIMD and this is not happening with default compilation flags anyway..

Zakhar.