Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7873 Discussions

Single Entry and Single Exit Criteria for loop vectorization.

amit_b_
Beginner
336 Views

I was reading A guide to vectorization with Intel C++ compilers: https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf  

I am referring to Single Entry and Single Exit Criteria Page No 8. I have specified two options a) Break b) Continue

A) Break

void no_vec(float a[], float b[], float c[])
{
    int i = 0;
    while(i < 100)
    {
        a = b * c;

        if(a < 0.0)
            break;
        ++i;
    }
}

===========================================================================

Begin optimization report for: no_vec(float *, float *, float *)

    Report from: Vector optimizations [vec]


LOOP BEGIN at breaktest.c(6,2)
   remark #15520: loop was not vectorized: loop with early exits cannot be vectorized unless it meets search loop idiom criteria
LOOP END
===========================================================================

B) Continue

void no_vec(float a[], float b[], float c[])
{
    int i = 0;
    while(i < 100)
    {
        a = b * c;

        if(a < 0.0)
            continue;
        ++i;
    }
}

===========================================================================

Begin optimization report for: no_vec(float *, float *, float *)

    Report from: Vector optimizations [vec]

Non-optimizable loops:


LOOP BEGIN at continuetest.c(6,2)
   remark #15523: loop was not vectorized: cannot compute loop iteration count before executing the loop.
LOOP END
===========================================================================

My Questions :

1) What difference continue and break makes for the optimizers to change the remark in optrpt

2) Is there any way to vectorize the loop, although it is necessary for loop to have data-dependent continue condition.

0 Kudos
4 Replies
TimP
Black Belt
336 Views

Did you try something like

#pragma  simd firstprivate(i) lastprivate(i)

for(i=0; i<100; ++i) if((a=b*c) <0)break;

with a compiler issued during the last year? If you are trying to find the limits of that "search loop idiom," you shouldn't get too fancy.

Your second case looks like it doesn't terminate, but vectorization does require a counted loop, even with the recent dispensation to permit early exit for a "search loop."

0 Kudos
Bernard
Black Belt
336 Views

In example "A"  if statement with break which can result in early exit from the loop prevents vectorization.

0 Kudos
jimdempseyatthecove
Black Belt
336 Views

The loops as stated, while not impossible, is hard to vectorize.

a = b * c;

When vecorized, on 4-wide vector, can be thought of as equivalent to:

a = b * c; a[i+1] = b[i+1] * c[i+1]; a[i+2] = b[i+2] * c[i+2]; a[i+3] = b[i+3] * c[i+3];

All done in parallel, however your break or continue, stops (not continues) on the first occurrence of the condition when read left to right.

Should index satisfy the condition the remainder of the vector is not to be (at least) stored in a[i+1] , a[i+2], a[i+3].

Inserting code, to provide vectorization .and. (visible to the program) perform only the operations specified in source, would have to perform something like (pseudo code):

temp[0] = b * c; temp[1] = b[i+1] * c[i+1]; temp[2] = b[i+2] * c[i+2]; temp[3] = b[i+3] * c[i+3];
mask[0] = (temp[0] < 0.0); mask[1] = (temp[1] < 0.0); mask[2] = (temp[2] < 0.0); mask[3] = (temp[3] < 0.0);
use vtestps (or vtestpd) to test all mask lanes for 0 and if true
  a = temp[0]; a[i+1] = temp[1]; a[i+2] = temp[2]; a[i+3] = temp[3];
else
   for(j=0; j < 4; ++j) {
     a[i+j] = temp;
     if(temp != 0) {
       i = i + j;
      (exit outer loop)
    }
endif

On long runs, the above vectorized code should run faster. Short runs, it would be slower.

Currently the compiler optimization engineers haven't picked up this "high hanging" fruit.

Jim Dempsey

0 Kudos
Bernard
Black Belt
336 Views

So in order to vectorize that loop compiler should create a temporary vector(XMM or YMM register) loaded with  zeroes and insert code for floating point comparison with float a[].

0 Kudos
Reply