Vectorization with SIMD-enabled functions works from functions, not from main()

Andrey_Vladimirov · ‎11-06-2015

Hello,

I have run into a situation that I cannot explain. I have a loop with a SIMD-enabled function and I use #pragma simd before it. This loop vectorizes if it is placed in a separate function, but does not vectorize if it is inside main(). I am using Intel C++ compiler 16.0.0.109. Please see code and vectorization reports below. Can anyone explain what is happening and if there is a way to work around this?

This is loop-in-main.cc:

__attribute__((vector)) void SimdEnabledFunction(double);

int main() {
  int n = 10000;
  double a;
#pragma simd
  for(int i = 0 ; i < n ; i++)
      SimdEnabledFunction(a);
}

This is the optimization report for it (loop does not vectorize):

[avladim@cfx-0 ~]$ icpc -qopenmp -c -qopt-report -qopt-report-stdout loop-in-main.cc
Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000


Begin optimization report for: main()

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (main()) [1] loop-in-main.cc(3,12)

loop-in-main.cc(7): (col. 3) warning #13379: loop was not vectorized with "simd"

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at loop-in-main.cc(7,3)
   remark #15520: simd loop was not vectorized: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria
   remark #13379: loop was not vectorized with "simd"
LOOP END
===========================================================================
[avladim@cfx-0 ~]$

This is the other code, loop-in-func.cc, where the loop is in a separate function:

__attribute__((vector)) void SimdEnabledFunction(double);

void UserFunction(int n, double* a) {
#pragma simd
  for(int i = 0 ; i < n ; i++)
      SimdEnabledFunction(a);
}

int main() {
  int n = 10000;
  double a;
  UserFunction(n, a);
}

This is the optimization report for it (SIMD LOOP WAS VECTORIZED):

[avladim@cfx-0 ~]$ icpc -qopenmp -c -qopt-report -qopt-report-stdout loop-in-func.cc
Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000


Begin optimization report for: main()

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (main()) [1] loop-in-func.cc(9,12)
  -> INLINE: (12,3) UserFunction(int, double *)

loop-in-func.cc(5): (col. 3) warning #13379: loop was not vectorized with "simd"

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at loop-in-func.cc(5,3) inlined into loop-in-func.cc(12,3)
   remark #15520: simd loop was not vectorized: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria
   remark #13379: loop was not vectorized with "simd"
LOOP END
===========================================================================

Begin optimization report for: UserFunction(int, double *)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (UserFunction(int, double *)) [2] loop-in-func.cc(3,37)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at loop-in-func.cc(5,3)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at loop-in-func.cc(5,3)
   remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at loop-in-func.cc(5,3)
<Remainder loop for vectorization>
   remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
LOOP END
===========================================================================
[avladim@cfx-0 ~]$

Andrey

KitturGanesh · ‎11-10-2015

Hi Andrey,
This is an intersting issue in that it's a bug in the compiler and I'll file the issue with our developers. As a workaround you can use the following:

Add the "nothrow" clause to the code snippet as below:

__attribute__((vector, nothrow)) void SimdEnabledFunction(double);

And it should vectorize the loop. I tried with the latest 16.0 release and it works fine:

% cat loop-main.cpp

__attribute__((vector, nothrow)) void SimdEnabledFunction(double);

int main() {
int n = 10000;
double a;
//#pragma simd
for(int i = 0 ; i < n ; i++)
SimdEnabledFunction(a);
}

%icpc -O3 -qopenmp -c -qopt-report -qopt-report-stdout loop-main.cpp
...

....

INLINE REPORT: (main()) [1] loop-main.cpp(3,12)

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at loop-main.cpp(7,3)

remark #15300: LOOP WAS VECTORIZED
LOOP END
===============================================

Thanks,
Kittur

Andrey_Vladimirov · ‎11-11-2015

Thank you, Kittur!

KitturGanesh · ‎11-11-2015

Pleasure, Andrey. BTW, I've filed the issue with the developers and will keep you updated when the release with a fix is out, thanks.

_Kittur

KitturGanesh · ‎11-12-2015

Hi Andrey,
OK, upon investigation on this issue it's found that this is not a bug indeed! Reason as follows:
--------------------------------------------------------------------------------------
The compiler automatically generates a try block for the program block (i.e. code inside {}) when it sees any local object or array created in that block, because those objects/arrays should be de-allocated in case an exception is thrown. That said:

In the first case, the function main contains an array allocation and so the try block is created and if the called routine is not marked as nothrow() the loop cannot be vectorized.
In the second case the function with the loop does not contain anything that requires the try-block creation. BTW, the first part of the report for the second compliation contains the message about that loop from inlined function that it is not vectorized in main and that part of the report was just skipped in description.

--------------------------------------------------------------------------------------

So, the workaround I suggested earlier is the correct workaround for this case for vectorizing the loop. Hope this helps...

Kittur

Andrey_Vladimirov · ‎11-19-2015

Hi Kittur,

this is very interesting!

For completeness of the picture, can you also explain the result below? I am using the same code as in loop-in-main.cc, but this time instead of "int n = 10000", I have "const int n = 10000". In this case the compiler vectorizes the loop. The only change is adding the const qualifier. Why does it change the result of vectorization?

__attribute__((vector)) void SimdEnabledFunction(double);

int main() {
  const int n = 10000;
  double a;
#pragma simd
  for(int i = 0 ; i < n ; i++)
      SimdEnabledFunction(a);
}

[avladim@cfx-0 ~]$ icpc -qopenmp -c -qopt-report -qopt-report-stdout loop-in-main-const.cc
Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000


Begin optimization report for: main()

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (main()) [1] loop-in-main-const.cc(3,12)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at loop-in-main-const.cc(7,3)
   remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END
===========================================================================
[avladim@cfx-0 ~]$

Andrey

KitturGanesh · ‎11-19-2015

Yes, that's interesting Andrey. One of the rules for vectorizing a loop is to ensure that the loop trip count is countable, that is it's known at entry to the loop at runtime and doesn't change during the duration of the loop execution and implies that the exit from the loop is not data dependent. That said, the trip count is indeed known without the const qualifier. I'll have to look into this and get back to you, thx.

_Kittur

KitturGanesh · ‎11-20-2015

Hi Andrey,
OK, here's why it works when you add the const qualifier. When no const is specified, the allocation of array (double a) is done by the call to a special routineto allocate that array and must be freed when leaving the block where the array is accessible. If an exception is possible, the compiler has to create a special try-catch block around the live range of such arrays with the catch part of that try-catch block containing the call vla_free(a).

Adding const modifier to n causes that local array to be allocated statically on stack so it does not require to be freed when an exception is thrown. Thus no try-catch is created and compiler does not see a possible early exit from the loop.

Hope the above helps understand why the loop vectorizes now!

Regards,
Kittur

KitturGanesh · ‎11-20-2015

Andrey, BTW with the latest update 1 release (which you can download from the Intel Registration Center), the vectorizer nicely outputs the message as well to that effect:
LOOP BEGIN at loop-main.cpp(7,3)
remark #15333: loop was not vectorized: exception handling for a call prevents vectorization [ loop-main.cpp(8,7) ]
LOOP END

_Kittur

Andrey_Vladimirov · ‎11-25-2015

Kittur Ganesh (Intel) wrote:

When no const is specified, the allocation of array (double a) is done by the call to a special routineto allocate that array and must be freed when leaving the block where the array is accessible.

This is fascinating! I am wondering why the compiler needs to call a function for it. Does it mean that allocation like "double a" in C/C++ may not end up on the stack?

KitturGanesh · ‎11-30-2015

Hi Andrey,
Good question. This looks like an issue and I've filed it with our developers. Reason, the call does make allocation on the stack but somehow it could be an issue with the front-end on constant propagation. I'll keep you updated on the outcome of the issue I've filed on this (constant propagation/stack) which is an interesting issue thereof. Again, appreciate for bringing this up.

_Kittur