Tim, thank you for new

Andrey_Vladimirov · ‎09-19-2016

I am trying to reproduce the results of a legacy routine that does sparse matrix-vector multiplication with, well, legacy code like this:

void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) {
  int i;
  const int bl = 2;

  for (i = 0; i < n; i++) {
    int l = (ia[i+1] - ia);
    int c = ia - 1;
    double s = 0.0;
    while ( l ) {  
      s += a[ c     ] * x[ ja[c    ] - 1 ];
      s += a[ c + 1 ] * x[ ja[c + 1] - 1 ];
      l -= bl;
      c += bl;
    }
    b = s;
  }
}

When I compile this code with Intel C compiler 16.0.1.150 using "icc -qopenmp -qopt-report=5 foo1.c", the loop in line 9 gets vectorized:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined.
   "isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo1.c(1,127)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at src/foo1.c(5,3)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/foo1.c(9,5)
      remark #15389: vectorization support: reference a has unaligned access   [ src/foo1.c(11,7) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15399: vectorization support: unroll factor set to 4
      remark #15309: vectorization support: normalized vectorization overhead 0.149
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 1 
      remark #15458: masked indexed (or gather) loads: 2 
      remark #15460: masked strided loads: 4 
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 29 
      remark #15477: vector loop cost: 18.500 
      remark #15478: estimated potential speedup: 1.540 
      remark #15488: --- end vector loop cost summary ---
   LOOP END

   LOOP BEGIN at src/foo1.c(9,5)
   <Remainder loop for vectorization>
      remark #15389: vectorization support: reference a has unaligned access   [ src/foo1.c(11,7) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.780
      remark #15301: REMAINDER LOOP WAS VECTORIZED
   LOOP END

   LOOP BEGIN at src/foo1.c(9,5)
   <Remainder loop for vectorization>
   LOOP END
LOOP END
===========================================================================

After that I insert "pragma omp parallel for" before the outer loop:

void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) {
  int i;
  const int bl = 2;
#pragma omp parallel for
  for (i = 0; i < n; i++) {
    int l = (ia[i+1] - ia);
    int c = ia - 1;
    double s = 0.0;
    while ( l ) {  
      s += a[ c     ] * x[ ja[c    ] - 1 ];
      s += a[ c + 1 ] * x[ ja[c + 1] - 1 ];
      l -= bl;
      c += bl;
    }
    b = s;
  }
}

And the loop in line 9 is no longer vectorized:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined.
   "isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo2.c(1,127)


    Report from: OpenMP optimizations [openmp]

src/foo2.c(4:1-4:1):OMP:foo:  OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at src/foo2.c(5,3)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at src/foo2.c(9,5)
      remark #15523: loop was not vectorized: loop control variable c was found, but loop iteration count cannot be computed before executing the loop
   LOOP END
LOOP END
===========================================================================

My question is: can someone from Intel help me to understand WHY the j-loop is vectorized in my foo1, but not in foo2?

I have a humble request to answer my question "why" rather than divert the discussion into how to write better vectorizable code than in the above snippets. My goal is not to optimize performance, but rather, to write a routine that reproduces bit-to-bit the results of a legacy routine, in which this loop is NOT vectorized. For that purpose, I would like the loop in line 9 NOT to be vectorized in my new code, because otherwise vector reduction changes the order of operations, which slightly changes the numerical results. From the tests that I showed above it looks like that I can disable vectorization with one of the two ways:

"Parallelize" the i-loop, possibly with 1 thread, so that the compiler gives up on vectorization of the inner loop, but I don't know if this workaround will survive until the next compiler version.
Put "pragma novector" before the j-loop, which does prevent vectorization, but does not recover bit-to-bit the results of the original code (possibly because the original code has some automatic unrolling).

Understanding why the compiler vectorizes the code in foo1 but not in foo2 will help me to better set up a strategy for writing code that produces bitwise-identical results to a legacy code.

TimP · ‎09-20-2016

Is there a difference between parallel for and parallel for simd? I am concerned that you are looking for answers which may be invalidated by compiler updates, and I don't have your old version here.

The availability of compile options such as -fp-model source is intended to address such requirements so I'm not certain why you exclude that.

TimP · ‎09-20-2016

Under 2017 compiler, omp parallel for simd resulted in vectorization with a huge reported loss of expected performance. It reports having to store and read back indices from memory, which must defeat the purpose of omp parallel. So it appears that newer Intel compilers do offer more verbose explanation, if you are willing to try a few options. It's still possible to get the remark about failing to precompute loop count. In fact, as you hinted, if sufficient performance to justify vectorization were required (even in the absence of threading), you would want to edit the loop into "modern code."

In this case, the inner loop might be written as a for with #pragma omp simd linear(c) reduction(+: s) to get a slight speedup from simd, but it still complains about reading c from memory. Evidently, "modern code" by the marketers' definition also requires sufficient modification to get simd stride 1 memory access, but that appears likely to worsen roundoff.

As expected, fp-model source produces:

remark #15331: loop was not vectorized: precise FP model implied by the command line or a directive prevents vectorization. Consider using fast FP model

These options to prevent implicit simd optimization of sum reduction have been more popular in the gcc world, where for a long time attempting to combine omp parallel with simd reduction was problematical (although seemingly fixed in current versions).

jimdempseyatthecove · ‎09-20-2016

>>Put "pragma novector" before the j-loop, which does prevent vectorization, but does not recover bit-to-bit the results of the original code (possibly because the original code has some automatic unrolling).

The original code may have use x87 (FPU) instructions in place of SSE/AVX instructions. Also, depending on data and rounding modes enabled, denormals may have (not) been flushed to zero differently.

Jim Dempsey

TimP · ‎09-20-2016

I have encountered anomalies in intel 2016 and 2017 compilers as to whether underfloor is set according to documented flags. Gradual underflow wasn't viable for knc but should be ok on sandy bridge and newer host.

Andrey_Vladimirov · ‎09-21-2016

Tim, thank you for response! I checked -fp-model source, and it does disable vectorization, and so does "#pragma novector" and "#pragma intel optimization_level 0", but they do not reproduce the original results.

Upon closer inspection, things seem to be more complicated than just requiring no vectorization. For the original code that I am trying to reproduce, the compiler issues this interesting remark:

LOOP BEGIN at mathsupport.c(3127,9)
      remark #15523: loop was not vectorized: loop control variable l was found, but loop iteration count cannot be computed before executing the loop
      remark #25478: While Loop Unrolled by 2
LOOP END

I think the automatic loop unroll (on top of the manual unroll) changes the order of operations. Perhaps the compiler is computing 2 running sums for the unrolled operations and then adding them together?

I also found (too late) that putting the outer loop in "#pragma omp parallel", as it turns out, does disable vectorization but does not reproduce the original results. This is my fault for not checking it thoroughly before posting - looks like I asked the wrong question.

And to answer myself (and possibly render this discussion completely useless), here is the tweak that I found that makes my new function produce the same results as the original function: I have to declare "l" as "unsigned int l" instead of "int l":

void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) {
  int i;
  const int bl = 2;

  for (i = 0; i < n; i++) {
    unsigned int l = (ia[i+1] - ia); // This reproduces the original code behavior with failed vectorization and automatic unroll by 2
    int c = ia - 1;
    double s = 0.0;
    while ( l ) {  
      s += a[ c     ] * x[ ja[c    ] - 1 ];
      s += a[ c + 1 ] * x[ ja[c + 1] - 1 ];
      l -= bl;
      c += bl;
    }
    b = s;
  }
}

This makes vectorization fail, but retains the automatic unroll by 2 - as long as I compile with "-O3":

   LOOP BEGIN at foo3.c(8,5)
      remark #15523: loop was not vectorized: loop control variable l was found, but loop iteration count cannot be computed before executing the loop
      remark #25478: While Loop Unrolled by 2  
   LOOP END

I am sorry, looks like this issue ventured into a dark area where compiler behavior and code quirks work in mysterious ways. Thanks for the tips!

Andrey_Vladimirov · ‎09-21-2016

Jim,

Thank you for the tip!

I checked in the assembly, and it seems to be using instructions such as "mulsd" and "addsd" on xmm registers - I guess these are not legacy instructions, or are they?

Also, do you have a tip on how to check whether denormal numbers are handled by the code that the compiler produced?

Andrey

TimP · ‎09-21-2016

Sorry about the auto-corrected title. It's about ftz et al.

enquire.c still requires fp-model source. Protect-parens isn't sufficient. This sets ftz off, producing the run time message under -v option "smallest numbers are not kept normalized" but ftz can be set in command line so as to over-ride that part of fp-model source .

Windows option Qlong-double isn't working in 64 bit mode, although it works with gcc (as long as -ffast-math isn't set).

Paranoia for either double or float data type doesn't work when compiled in a single source file (but does work with functions split into individual files) with 2017 compiler, regardless of ip setting. There seems to be a violation of parentheses with the single file compilation, violating protect-parens setting. ftz seems to work as documented.

TimP · ‎09-21-2016

The effect of ftz options on generated code is insertion of mxcsr setting in main() to override the setting by the os prior to transferring control to your executable. I'd prefer to check expressions like 1.5*DBL_MIN - DBL_MIN.

I've been meaning to run paranoia single and double precision over a range of compile options with ifort 15,16,17. :protect_parents is required.

If there's interest, I will post working paranoia on github. The original couldn't work with normal optimizations. I haven't seen a correct C version. enquire might do the job.

Linux traditionally doesn't tinker with mxcsr so the typical intel compiler setting of -ftz is done in main().

Windows x64 is expected to set ftz before launching a .exe. windows releases have been known to overlook expected policy, but that could break something in run time library, as could deviating from x87 53 bit precision.

Various situations might require explicit simd mxcsr intrinsic to set the underflow consistently.

Andrey_Vladimirov · ‎09-22-2016

Tim, thank you for new pointers! When you say that paranoia does not work, do you mean that it reports false positive issues or does not report issues when it should?

TimP · ‎09-22-2016

Andrey Vladimirov wrote:

Tim, thank you for new pointers! When you say that paranoia does not work, do you mean that it reports false positive issues or does not report issues when it should?

When compiling as a single source file, paranoia reports a discrepancy between 2 of its methods for assessing number of mantissa bits, which is taken as a run time failure. This indicates to me that protect-parens has failed at that point. -fp-model source corrects this problem, allowing use of single source file, but in general one would want the default level of optimization rather than requiring so much use of omp simd directives specific to intel compilers.

Loop gets vectorized in scalar code, but no vectorization inside parallel loop