Advisor will always report

luca_l_ · ‎05-01-2017

I don't know if this is the right section, I'm sorry in that case.

I'm trying to optimize this function:

bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
{         
   bool ret = false;
   // input size (-1 for the safe bilinear interpolation)
   const int width = im.cols-1;
   const int height = im.rows-1;
   // output size
   const int halfWidth  = res.cols >> 1;
   const int halfHeight = res.rows >> 1;
   float *out = res.ptr<float>(0);
   const float *imptr  = im.ptr<float>(0);
   for (int j=-halfHeight; j<=halfHeight; ++j)
   {
      const float rx = ofsx + j * a12;
      const float ry = ofsy + j * a22;
      #pragma omp simd
      for(int i=-halfWidth; i<=halfWidth; ++i, out++)
      {
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            int rowOffset = y*im.cols;
            int rowOffset1 = (y+1)*im.cols;
            // bilinear interpolation
            *out =
                (1.0f - wy) *
                ((1.0f - wx) * 
                imptr[rowOffset+x] +
                wx * 
                imptr[rowOffset+x+1]) +
                (       wy) *
                ((1.0f - wx) * 
                imptr[rowOffset1+x] + 
                wx *
                imptr[rowOffset1+x+1]);
         } else {
            *out = 0;
            ret =  true; // touching boundary of the input            
         }
      }
   }
   return ret;
}

I'm using Intel Advisor to optimize it and even though the inner for has already been vectorized, Intel Advisor detected inefficient memory access patterns:

60% of unit/zero stride access
40% of irregular/random stride access

In particular there are 4 gather (irregular) access in the following three instructions:

The problem of gather access from my understanding happens when the accessed element is of the type a, where b is unpredictable. This seems to be the case with imptr[rowOffset+x], where both rowOffset and x are unpredictable.

At the same time, I see this Vertical Invariant which should happen (again, from my understanding) when elements are accessed with a constant offset. But actually I don't see where this constant offset

So I have 3 questions:

Did I understood the problem of gather accesses correctly?
What about the Vertical Invariant access? I'm less sure about this point.
Finally, how can I improve/solve the memory access here?

Compiled with icpc 2017 update 3 with the following flags:

INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl

TimP · ‎05-01-2017

Advisor will always report gather (indirect addressing) as a potentially inefficient memory access pattern. How bad it is would be data and platform dependent, and is not actually measured, other than by reported timings. If you cared to profile under VTune, you could try to assess how much time is spent on cache events, but it might not lead to coding improvements.

Your memory accesses spaced by im.cols appear to account for the vertical stride diagnosis. If the generated code is successfully vectorized with unity stride parallel stores, it may be the best you can do.

Inefficient memory access pattern and irregular stride accesses