Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

Not vectorized loop with if statement

luca_l_
Beginner
1,311 Views

This is the first time that I try to vectorize a loop, I'm trying to optimize this code.

In particular, according to intel advisor, this is the best candidate for vectorization:

   for (int j=-halfHeight; j<=halfHeight; ++j)
   {
      for(int i=-halfWidth; i<=halfWidth; ++i)
      {
	     const float rx = ofsx + j * a12;
	     const float ry = ofsy + j * a22;
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            // bilinear interpolation
            *out++ =
               (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
               (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
         } else {
            *out++ = 0;
         }
      }
   }

And this is the message in the optr file:

   remark #15522: loop was not vectorized: loop control flow is too complex. Try using canonical loop form from OpenMP specification

How can I solve this?

0 Kudos
6 Replies
SergeyKostrov
Valued Contributor II
1,311 Views
>>... >>remark #15522: loop was not vectorized: loop control flow is too complex. Try using canonical loop form from OpenMP specification... >>... The inner for-loop needs to be simplified. >>How can I solve this? Variant 1: Try to use ternary operators. Instead of ... if( A < B ) { D = val1 } ... use conditional operator ( ? : ) ... D = ( A < B ) ? ( val1 ) : ( val2 ) ... Variant 2: Do data mining before the core processing starts and copy all values that satisfy conditions of your if( ... ) statement to a helper array. In that case you will have an inner for-loop without if-statement.
0 Kudos
Anoop_M_Intel
Employee
1,311 Views

The generated the optimization report for helper.cpp using Intel C++ Compiler 17.0 Update 2 on Linux and saw the following for the loop:

Begin optimization report for: interpolate(const cv::Mat &, float, float, float, float, float, float, cv::Mat &)

    Report from: Vector optimizations [vec]

LOOP BEGIN at helpers.cpp(219,4)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed OUTPUT dependence between *out (234:14) and *out (238:14)
   remark #15346: vector dependence: assumed OUTPUT dependence between *out (238:14) and *out (234:14)

   LOOP BEGIN at helpers.cpp(223,7)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed OUTPUT dependence between *out (234:14) and *out (238:14)
      remark #15346: vector dependence: assumed OUTPUT dependence between *out (238:14) and *out (234:14)
   LOOP END
LOOP END

By changing the code as shown below, you can successfully vectorize this loop:

      for(int i=-halfWidth; i<=halfWidth; ++i, out++)
      {
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            // bilinear interpolation
            *out =
               (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
               (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
         } else {
            *out = 0;
            ret =  true; // touching boundary of the input
         }
      }

The corresponding optimization report after the change is:

Begin optimization report for: interpolate(const cv::Mat &, float, float, float, float, float, float, cv::Mat &)

    Report from: Vector optimizations [vec]

LOOP BEGIN at helpers.cpp(219,4)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at helpers.cpp(223,7)
      remark #15389: vectorization support: reference *out has unaligned access   [ helpers.cpp(234,14) ]
      remark #15389: vectorization support: reference *out has unaligned access   [ helpers.cpp(238,14) ]
      remark #15389: vectorization support: reference *out has unaligned access   [ helpers.cpp(238,14) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15328: vectorization support: indirect load was emulated for the variable <(im->data+*im->p*y)>, masked, 64-bit indexed, part of address is result of call to function   [ helpers.cpp(235,48) ]
      remark #15328: vectorization support: indirect load was emulated for the variable <(im->data+*im->p*y)[x+1]>, masked, 64-bit indexed, part of address is result of call to function   [ helpers.cpp(235,75) ]
      remark #15328: vectorization support: indirect load was emulated for the variable <(im->data+*im->p*(y+1))>, masked, 64-bit indexed, part of address is result of call to function   [ helpers.cpp(236,48) ]
      remark #15328: vectorization support: indirect load was emulated for the variable <(im->data+*im->p*(y+1))[x+1]>, masked, 64-bit indexed, part of address is result of call to function   [ helpers.cpp(236,75) ]
      remark #15305: vectorization support: vector length 4
      remark #15309: vectorization support: normalized vectorization overhead 0.069
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 1
      remark #15451: unmasked unaligned unit stride stores: 1
      remark #15457: masked unaligned unit stride stores: 1
      remark #15458: masked indexed (or gather) loads: 4
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 262
      remark #15477: vector cost: 236.250
      remark #15478: estimated potential speedup: 1.100
      remark #15482: vectorized math library calls: 2
      remark #15487: type converts: 14
      remark #15488: --- end vector cost summary ---
   LOOP END

   LOOP BEGIN at helpers.cpp(223,7)
   <Remainder loop for vectorization>
   LOOP END
LOOP END

 

0 Kudos
SergeyKostrov
Valued Contributor II
1,311 Views
>>...By changing the code as shown below, you can successfully vectorize this loop... ... < Codes from Post #3 Skipped > ... It is not clear how values for ... const float rx = ofsx + j * a12; const float ry = ofsy + j * a22; ... calculated. Is that what you've suggested: ... for( int j = -halfHeight; j <= halfHeight; ++j ) { const float rx = ofsx + j * a12; const float ry = ofsy + j * a22; for( int i = -halfWidth; i <= halfWidth; ++i, out++ ) { float wx = rx + i * a11; float wy = ry + i * a21; const int x = (int) floor(wx); const int y = (int) floor(wy); if( x >= 0 && y >= 0 && x < width && y < height ) { // compute weights wx -= x; wy -= y; // bilinear interpolation *out = (1.0f - wy) * ((1.0f - wx) * im.at(y,x) + wx * im.at(y,x+1)) + ( wy) * ((1.0f - wx) * im.at(y+1,x) + wx * im.at(y+1,x+1)); } else { *out = 0; ret = true; // touching boundary of the input } } } ...
0 Kudos
Serge_P_
Beginner
1,311 Views

The only problematic construct for vectorization in original code is *out++ = ... which is conditional induction. Its presence at both sides of loop makes loop vectorizable in theory, but in practice it requires sophisticated analysis from compiler to prove that dependency on 'out' value doesn't cross iteration boundary. 

I don't have currently icc at my disposal, but something as simple as change below should enable vectorization. However, this code is quite poorly written from performance standpoint.  It is highly advisable for such codes to implement one of 2 techniques to get rid of condition inside inner loop

 
16          *out =
17             (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
18             (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
19       else { 
20          *out= 0;
21       }
22       ++out;
 

However, this code is quite poorly written from performance standpoint.  It is highly advisable for such codes to implement one of 2 techniques to get rid of condition inside inner loop.

  1. Inner loop may be split into 2: one for interior points (x>0 && x < width) and one for the boundary points (x == 0 || x == width) 
  2. Or input data may be padded at all sides to make logic for inner points work for entire data set.

The problem is that boundary condition is dynamic and so each vector iteration of inner loop will pay the price of mask construction and blending which is just waste of resources for interior iterations (most of all iterations). 

0 Kudos
luca_l_
Beginner
1,311 Views

Serge P. wrote:

The only problematic construct for vectorization in original code is *out++ = ... which is conditional induction. Its presence at both sides of loop makes loop vectorizable in theory, but in practice it requires sophisticated analysis from compiler to prove that dependency on 'out' value doesn't cross iteration boundary. 

I don't have currently icc at my disposal, but something as simple as change below should enable vectorization. However, this code is quite poorly written from performance standpoint.  It is highly advisable for such codes to implement one of 2 techniques to get rid of condition inside inner loop

 

16
         *out =

17
            (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +

18
            (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));

19
      } else { 

20
         *out= 0;

21
      }

22
      ++out;

 

However, this code is quite poorly written from performance standpoint.  It is highly advisable for such codes to implement one of 2 techniques to get rid of condition inside inner loop.

  1. Inner loop may be split into 2: one for interior points (x>0 && x < width) and one for the boundary points (x == 0 || x == width) 
  2. Or input data may be padded at all sides to make logic for inner points work for entire data set.

The problem is that boundary condition is dynamic and so each vector iteration of inner loop will pay the price of mask construction and blending which is just waste of resources for interior iterations (most of all iterations). 

Could you please rewrite your answer with a better code formatting please?

0 Kudos
Serge_P_
Beginner
1,311 Views

I will try but by some reason code looks fine in preview, but poorly when posted.

---

The only problematic construct for vectorization in original code is *out++ = ... which is conditional induction. Its presence at both sides of if statement makes loop vectorizable in theory, but in practice it requires sophisticated analysis from compiler to prove that dependency on 'out' value doesn't cross iteration boundary.  I don't have currently icc at my disposal, but something as simple as change below should enable vectorization. 

     if (...)

     {
         *out =
            (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
            (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
      } else { 
         *out= 0;
      }
      ++out;  /// <<< This is single inductive increment

However, this code overall is quite poorly written from performance standpoint.  It is highly advisable for such codes to implement one of 2 techniques to get rid of condition inside inner loop.

  1. Inner loop may be split into 2: one for interior points (x>0 && x < width) and one for the boundary points (x == 0 || x == width). 
  2. Or input data may be padded at all sides to make logic for inner points work for entire data set.

The problem is that boundary condition is dynamic and so each vector iteration of inner loop will pay the price of mask construction and blending which is just waste of resources for interior iterations (most of all iterations). 

0 Kudos
Reply