Location of __assume affects performance

Igor_T_ · ‎01-30-2016

I am using an 8-th order finite difference time stepping function (for 2D acoustic wave equation) shown below.

I am observing substantial (up to 25%) performance increase from placing Intel's __assume statement inside the inner loop, compared to placing it at the beginning of the function body. (This happens regardless of number of OpenMP threads).

The code is compiled by Intel 2016-update1 compiler, Linux, with -O3 optimization option, and for AVX-capable architecture (Xeon E5-2695 v2). Compiler options I use: -std=c++11 -march=native -O3 -openmp

Is it a compiler problem?

/* Finite difference, 8-th order scheme for acoustic 2D equation.
    p       - current pressure
    q       - previous and next pressure
    c       - velocity
    n0 x n1 - problem size
    p1      - stride
*/

void fdtd_2d( float const* const __restrict__ p,
              float      * const __restrict__ q,
              float const* const __restrict__ c,
              int          const              n0,
              int          const              n1,
              int          const              p1 )
{
    // Stencil coefficients.
    static const float C[5] = { -5.6944444e+0f, 1.6000000e+0f, -2.0000000e-1f, 2.5396825e-2f, -1.7857143e-3f };

    // INTEL OPTIMIZER PROBLEM?
    //     PLACING THE FOLLOWING LINE INSIDE THE LOOP BELOW 
    //     INSTEAD OF HERE SPEEDS UP THE CODE!
    // __assume( p1 % 16 == 0 );

    #pragma omp parallel for default(none)
    for ( int i1 = 0; i1 < n1; ++i1 )
    {
        float  const* const __restrict__ ps = p + i1 * p1;
        float       * const __restrict__ qs = q + i1 * p1;
        float  const* const __restrict__ cs = c + i1 * p1;

        #pragma omp simd aligned( ps, qs, cs : 64 )
        for ( int i0 = 0; i0 < n0; ++i0 )
        {
            // INTEL OPTIMIZER PROBLEM?
            //     PLACING THE FOLLOWING LINE HERE 
            //     INSTEAD OF THE ABOVE SPEEDS UP THE CODE!
            __assume( p1 % 16 == 0 );

            auto lap = C[0] * ps[i0];
            for ( int r = 1; r <= 4; ++r )
                lap += C * ( ps[i0 + r] + ps[i0 - r] + ps[i0 + r * p1] + ps[i0 - r * p1] );

            qs[i0] = 2.0f * ps[i0] - qs[i0] + cs[i0] * lap;
        }
    }
}

TimP · ‎01-31-2016

Although __assume was advertised as avoiding a need to repeat an assertion in each local scope where it may be useful, it hasn't worked out that way.

McCalpinJohn · ‎02-01-2016

I can't find a description of the "__assume" statement in the documentation for the Intel 15 or Intel 16 compiler documentation. The search feature seems to ignore the leading underscores? Searching for "__assume_aligned" brings up two results: "Function Annotations and the SIMD Directive for Vectorization" and "Programming Guidelines for Vectorization" -- neither of which mention the "_assume" statement.

The discussion at https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization says

Clauses such as __assume_aligned and __assume tell the compiler that the property holds at the particular point in the program where the clause appears.

The "const" property on p1 should enable the compiler to carry the assertion from the __assume() statement forward or backward through the whole routine, but there is no guarantee that the compiler will exploit this.

Did you try looking at the assembly code to see where the two versions differed?