OpenMP does not like fmax/fabs

Jon_U_ · ‎09-14-2013

We have a code that is exhibiting greatly different runtimes between a Fortran and C version. The problem has been isolated to one simple loop: #pragma omp parallel for reduction(max:dt) for(i = 1; i <= NR; i++){ for(j = 1; j <= NC; j++){ dt = fmax( fabs(t-t_old), dt); t_old = t; } } Which runs about 12 times slower than the equivalent Fortran loop: !$omp parallel do reduction(max:dt) Do j=1,NC Do i=1,NR dt = max( abs(t(i,j) - told(i,j)), dt ) Told(i,j) = T(i,j) Enddo Enddo !$omp end parallel do Removing the dt assignment eliminates the disparity. Also, running these as serial codes shows no disparity, do the problem is not that the actual C implementation is just so bad. Also, eliminating just the reduction does not close the gap, so it is not the reduction operation itself. All of those tests lead us to the conclusion that there is some terrible interaction between OpenMP and fmax/abs. Any help appreciated. Thanks in advance, Jon

TimP · ‎09-14-2013

I guess you're not using an Intel compiler, as C OpenMP reduction(max:) isn't supported by those compilers.

I haven't checked the implementation of fmax in non-Intel compilers, but I do suspect it's not optimizable, even if you make the effort to write your code explicitly to keep the OpenMP reduction in the outer loop and make a vector reduction in the inner loop. fmax is likely slower than Fortran max() or C++ std::max(); it's a more complicated operation.

Did you compare against something like plain old-fashioned C derived from your Fortran, with the options required for vectorization and vector report by your undisclosed compiler?

      for (i__ = 2; i__ <= i__2; ++i__)
          if (a[i__] > x)
              x = a[i__];

I haven't tested this with gcc-4.9 #pragma omp parallel simd reduction since some bugs in that were fixed this week. gcc has performed the vector reduction reasonably well in recent versions, except that gcc doesn't "riffle" the reduction as Intel C does (-O2 -ftree-vectorize -ffast-math -ftree-vectorizer-verbose=1). If you're using an ancient compiler, you can't expect much.

Up to now, C nested loop max|min reductions often had to be done with omp critical in the outer loop and vector reduction in the inner loop; this is still the case with Intel C. Intel C++ will auto-vectorize an inner loop written with std::max() (but g++ will not).

If writing for a many-core platform, straightforward use of omp critical probably won't work as well as some sort of tree reduction (which would be implied by a competent omp parallel reduction) or saving the results of the inner loops in an array with attention to avoidance of false sharing.

So this is a fairly simple example of how you must put forth several times as much effort to get performance with C as with Fortran, and it does matter which compiler and how you use it.

TimP · ‎09-14-2013

A big problem: you neglected to make j private. C99 syntax is preferable:

for(int i = 1;...

for(int j = 1;

OpenMP makes the parallel for induction variable private by default, but not (in C or C++) the others, contrary to Fortran, where they are all default private.

array definitions are important also; I'm trying to guess at what you might have done.

Even the possibility that you aren't taking into account the difference between C 0-based arrays and Fortran 1-based could be significant.

TimP · ‎09-14-2013

gcc (and probably other compilers) don't have an in-line intrinsic for fmax. The difference in timing between a scalar library function call and vectorized Fortran would overwhelm your gain from OpenMP parallel, particularly if you have that race condition in C but not Fortran.

I made a version using plain old C but needed -ffast-math for gcc to vectorize it by maxpd or vmaxpd instructions

Definitions like double **restrict t, double **restrict t_old proved sufficient to overcome the aliasing problem where C otherwise requires the compiler to protect against unknown overlaps among the arrays. Yes, the designers of Fortran understood this stuff over 50 years ago which C was forced to deal with as an afterthought 25 years ago.

I do know both reasonably expert Fortran and C programmers who don't believe in the aliasing rules, but you really can't afford to avoid learning about such things if you're dealing with parallel programming.

A still partly broken version:

#include <math.h>
#include <omp.h>
#define NR 1111
#define NC 2222
void
ju (double ** restrict t,double ** restrict t_old)
{
//double t[NR][NC];
//double t_old[NR][NC];
double dtm=0;
#pragma omp parallel for reduction(max:dtm)
for (int i = 0; i < NR; i++)
    {
    double dt=0;
      for (int j = 0; j < NC; j++)
        {
          double tmp = fabs (t - t_old);
          if(tmp > dt) dt = tmp;
          t_old = t;
        }
        if(dt>dtm)dtm=dt;
    }
}
gcc -O2 -fopenmp -ftree-vectorize -S -std=c99 -ffast-math -march=corei7-avx ju.c

Jon_U_ · ‎09-14-2013

> I guess you're not using an Intel compiler, as C OpenMP reduction(max:) isn't supported by those compilers

I am using icc Version 14.0.0. OpenMP 3.1 (which has a C max reduction) has been supported since Version 13.x. That is well documented! And I am comparing against ifort 14.0.0.

TimP · ‎09-14-2013

In fact, the current icc does accept this example of omp parallel reduction(max: ) although we've been told this OpenMP feature would not be fully supported until next year. #pragma omp simd reduction(max: ) is still rejected by current Intel compilers and not likely to be supported in 14.0.

With the addition of #pragma omp simd, icc seems to optimize the inner loop as I wrote it above (and also in a similar version with fmax), although I'd like to see successful testing.

I agree, it may be considered a bug when the _OPENMP macro implies OpenMP 3.1 support, if not all parts of that standard are satisfied. We've been informed Intel plans to claim OpenMP 4.0 support in a compiler which may implement all the omp simd reduction cases except for user defined reduction.

TimP · ‎09-20-2013

I've done some testing with fmax|fmin|fmaxf|fminf with Intel and gnu compilers. I find that these don't vectorize for icc without addition of pragmas, but when I add pragmas to promote vectorization, a majority of those involving reduction are giving wrong results.

I do find that OpenMP 3.1 parallel reduction is working well, even though we have had documents stating that max|min reduction is not yet supported. Even the case where I make an indexed max reduction using omp parallel reduction(max: ) lastprivate ( ) is working, so it may be there isn't a need for user defined reduction for that case.

OpenMP 4.0 omp simd is not yet supported for max|min reduction, so this is a factor in poor performance of fmax.

In my tests with gcc with -ffinite-math-only set, fmax et al. are working interchangeably with old-fashioned C code (vectorized with correct results). With current icc, good performance requires use of old-style C or C++ std::max(). I suppose -ffinite-math-only removes the distinction between fmax and std::max which gives icc trouble with vectorization of fmax.

It's a little strange to find that icc optimizes std::max better than gcc, while for fmax it's the other way around.

According to some web search references, maxpd instructions were defined to fit exactly the old C idiom if(b>a)b=a; which explains the asymmetric treatment of NaN operands. I've found no history to explain the no more comprehensible but incompatible treatment of fmax where either operand is NaN. So I don't see fmax and the like as a solution to the lack of a suitable max intrinsic in C89.

I remember this nit-pickiness all the way back to when compilers first introduced support for the fmax instruction on Macintosh.

jimdempseyatthecove · ‎09-29-2013

>>I've found no history to explain the no more comprehensible but incompatible treatment of fmax where either operand is NaN.

FWIW see: http://bytes.com/topic/c/answers/528404-max-nan-0-should-nan

(#6, TimP's comments)

My position is in line with #44. There is a requirement for a "not available" encoding, preferrably as a non-signaling type of encoding.

As for the return value for fmin/fmax with one NaN, at least in C, I would favor the definable value.
http://www.cplusplus.com/reference/cmath/fmax/?kw=fmax

Jim Dempsey

jimdempseyatthecove · ‎09-29-2013

An additional note with regards to "not-available". In a different IDZ forum post (which I won't take the time to look up for you), I discuss, and show by example, a use for "not-available-yet", alternatively "pending" use of QNAN. Briefly, when using parallel programming, often you deal with loops involving A(i+1) and/or A(i-1). When i is located at the boundary of a thread's segment, the programmer requires a low-cost means of communication between the threads executing the adjacent segments to assure proper sequencing. In some of the cases, inserting a/some QNAN's at the junctures prior to the loop can then be used for proper sequencing.

Jim Dempsey

TimP · ‎03-22-2014

Recent versions of gcc vectorize fmax|fmin by the aid of the -ffinite-math-only option (implied by -ffast-math).

icc can sometimes vectorize fmax but the code to preserve the peculiar functionality (which is lost in the gcc vectorization) is tortured and slow.

This creates a situation in c++ where fmax (fmaxf for float data type) is preferred by g++ but std::max is preferred by icpc.

As gfortran has a special hook for vectorization of max|min by use of maxps instructions and the like (producing similar performance as ifort), it would be preferable in my opinion if g++ would do the same for std::max.