Macro Bug?

Brouard__Nicolas · ‎09-13-2014

Hi,

I am in the Trial period of the Intel Parallel Studio 2015. We have a scientific software which works on Linux, Windows, and OS/X . Users were surprised by better performance on OS/X than on Windows (double time)! Thus instead of cross-compiling on OS/X (or Linux) using gcc for WIndows (32 or even current mingw64), I decided to test "native" compilers like Visual Studio 2013 as well as your Intel Parallel Studio 2015.

My current test concerns a Xeon 1225 V3 on Windows 8 64 bit. Clearly, the performance difference between OS/X vs Windows was linked to compilers!

I am not able to give precise comparisons yet because I could not find the same final results with the Intel 64. The program maximizes a Likelihood surface using a Powell algorithm. Up to now, any compiler on any machine 32 or 64 bit produced the same results with 34 iterations (http://euroreves.ined.fr/imach/wiki/index.php/Main_Page#Performance).

But compiled with intel ilcx64 it converges in 62 iterations. Strangely, final results from Intel are not wrong but even better and more accurate!

It deserved investigation and here is what I first found:

The optimization program, the Powell implementation, described in Numerical Recipes in C(powell function), defines a SQR macro:

#define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 :sqrarg*sqrarg)

and then this macro is used in the statement:

      t = 2.0*(fp-2.0*(*fret)+fptt)*SQR(fp-(*fret)-del) - del*SQR(fp-fptt);
      if (t < 0.0){

the compiled version gives wrong values of t which are sometimes positive instead of being negative. The path to reach the maximum is then different from other compiled versions.

If the SQR macro is replaced by a standard multiplication, it will work. Another workaround consists in cutting in two parts:

      t = 2.0*(fp-2.0*(*fret)+fptt)*SQR(fp-(*fret)-del);
      t = t - del*SQR(fp-fptt);
      if (t < 0.0){

It means that the ILC compiler has some difficulties to manage two successive SQR macros on the same line.

I don't know how many people in the world are using the Numerical Recipes in C library, but I was wanting to let you know about this interesting fact.

Thanks for the trial anyway,

Nicolas

Alexander_W_Intel · ‎09-15-2014

Hi Nicolas,

thanks for your detailed explanation.

I don't think this issue is caused by the macros. Macros are simple text replacements. What you can try is to compile the file with the options "-E -P". You will get the preprocessed output before the compilation and can check if this is correct.
I assume this is more an optimization issue. Does Compiling with "-O0" helps with your issues?

Is it possible that you create a small reproducer of this issue that will compile and run?

Thanks,
Alex

jimdempseyatthecove · ‎09-15-2014

This may be a case of when at convergence, fp-fptt produces -0.0. And this may later affect the (t < 0.0) causing a difference in the number of iterations.

The macro may have been used in the old code to avoid a multiplication as a means of optimization. While that may have been appropriate 20-30 years ago, with today's processors this will produce slower code. The use of the macro may also introduce (expose) the issue of how to handle tiny differences that results in a negative 0.0 (which is a valid IEEE number).

Jim Dempsey

TimP · ‎09-15-2014

The unnecessary test for 0. would appear to be more of a performance issue than use of macro per se.

Brouard__Nicolas · ‎09-27-2014

Hi,

Thank you for your comments, here is a test.c and two kinds of results; you should get 4 lines OK with a negative value, but for the Intel compiler (either x64 or or 32) you get only 2 OK. For any other compiler that I tested, Visual Studio 2013 (32 or 64 bit), Apple compiler x64, gcc of whatever kind 32 or 64 Linux or crosscompiled, you get the 4 OK:

#include <stdio.h>
#include <math.h>
static double sqrarg;
#define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 :sqrarg*sqrarg)

main(){

	double fp, fptt, fpx;
	double t, tSQR, tSQ, tS;
	double	*fret;
	double del;


	fret = &fpx;
	fp = 429817.002238171520;
	*fret = 62670.208497842141;
	fptt = 133712.749565220380;
	del = 233176.627080652750;

	if (fptt < fp) {
		t = 2.0*(fp - 2.0*(*fret) + fptt)*((sqrarg = (fp - (*fret) - del)) == 0.0 ? 0.0 : sqrarg*sqrarg) - del*((sqrarg = (fp - fptt)) == 0.0 ? 0.0 : sqrarg*sqrarg);

		tSQR = 2.0*(fp - 2.0*(*fret) + fptt)*SQR(fp - (*fret) - del) - del*SQR(fp - fptt);
		tSQ = 2.0*(fp - 2.0*(*fret) + fptt)*(fp - (*fret) - del)*(fp - (*fret) - del) - del*(fp - fptt)*(fp - fptt);
		tS = 2.0*(fp - 2.0*(*fret) + fptt)*SQR(fp - (*fret) - del);
		tS = tS -del*SQR(fp - fptt);
		printf("    t=%.12lf\n tSQR=%.12lf\n  tSQ=%.12lf\n   tS=%.12lf\n", t, tSQR, tSQ, tS);
		printf("Powell fp=%.12lf\n   *fret=%.12lf\n    fptt= %.12lf\n     del=%.12lf\n", fp, *fret, fptt, del);
		if (tSQR < 0.0)  printf(" tSQR negative OK\n");
		if (tSQ < 0.0)  printf(" tSQ negative OK\n");
		if (tS < 0.0)  printf(" tS negative OK\n");
		if (t < 0.0)  printf(" t negative OK\n");

	}
	getchar();
}

Here are the two kinds of results, with 2 OK (wrong) and 4 OK (good);

___ icl 64 or 32
    t=56394494024495904.000000000000
 tSQR=56394494024495904.000000000000
  tSQ=-4715147759914590.000000000000
   tS=-4715147759914590.000000000000
Powell fp=429817.002238171520
   *fret=62670.208497842141
    fptt= 133712.749565220380
     del=233176.627080652750
 tSQ negative OK
 tS negative OK

____ vs 64 or 32, gcc 32 or 64, Apple Linux, ming64

    t=-4715147759914590.000000000000
 tSQR=-4715147759914590.000000000000
  tSQ=-4715147759914590.000000000000
   tS=-4715147759914590.000000000000
Powell fp=429817.002238171520
   *fret=62670.208497842141
    fptt= 133712.749565220380
     del=233176.627080652750
 tSQR negative OK
 tSQ negative OK
 tS negative OK
 t negative OK

1.0E16 are big numbers I know, but not that big! It is true that if we approch the maximum, differences are smaller and we get the correct results using the Intel compilers. It is also true that it is easy to suppress the macro which is probably useless today. The macro is expanded (test.i /P) correctly (even if two on the same line) but the test with 0 seems problematic.

Here is a excerpt of the book Numerical Recipes in C, if you are interested in the origin of this macro:

1.2 Some C Conventions for Scientific Computing 27

We have already alluded to the problem of computing small integer powers of numbers, most notably the square and cube. The omission of this operation from C is perhaps the language’s most galling insult to the scientific programmer. All good FORTRAN compilers recognize expressions like (A+B)**4 and produce in-line code, in this case with only one add and two multiplies. It is typical for constant integer powers up to 12 to be thus recognized.

In C, the mere problem of squaring is hard enough! Some people “macro-ize” the operation as

#define SQR(a) ((a)*(a))

However, this is likely to produce code where SQR(sin(x)) results in two calls to the sine routine! You might be tempted to avoid this by storing the argument of the squaring function in a temporary variable:

static float sqrarg;
#define SQR(a) (sqrarg=(a),sqrarg*sqrarg)

The global variable sqrarg now has (and needs to keep) scope over the whole module, which is a little dangerous. Also, one needs a completely different macro to square expressions of type int. More seriously, this macro can fail if there are two SQR operations in a single expression. Since in C the order of evaluation of pieces of the expression is at the compiler’s discretion, the value of sqrarg in one evaluation of SQR can be that from the other evaluation in the same expression, producing nonsensical results. When we need a guaranteed-correct SQR macro, we use the following, which exploits the guaranteed complete evaluation of subexpressions in a conditional expression:

static float sqrarg;
#define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 : sqrarg*sqrarg)

Nicolas

Alexander_W_Intel · ‎09-30-2014

Hi,

thanks for the code example. I played around with your code example and it looks more a general issue. If you for example split the statement

t = 2.0*(fp - 2.0*(*fret) + fptt)*((sqrarg = (fp - (*fret) - del)) == 0.0 ? 0.0 : sqrarg*sqrarg) - del*((sqrarg = (fp - fptt)) == 0.0 ? 0.0 : sqrarg*sqrarg);

into

t = 2.0*(fp - 2.0*(*fret) + fptt)*((sqrarg = (fp - (*fret) - del)) == 0.0 ? 0.0 : sqrarg*sqrarg); 
t -= del*((sqrarg = (fp - fptt)) == 0.0 ? 0.0 : sqrarg*sqrarg);

you get the correct result. Strange is that the issue is even showing up with the options -O0 -fp-model strict .

I escalated it to our engineering. The internal reference number is: DPD200545918

I will update the thread as soon as I have news.

Thanks, Alex

Alexander_W_Intel · ‎10-07-2014

Hi,

it looks like the statements you have are undefined by language standards. You can see this by enabling the warning -Wsequence-point in GCC. This option shows some, but not all, sequencing issues.

warning: operation on âsqrargâ may be undefined [-Wsequence-point]

         t = 2.0*(fp - 2.0*(fret) + fptt)*((sqrarg = (fp - (fret) - del)) == 0.0 ? 0.0 : sqrarg*sqrarg) - del*((sqrarg = (fp - fptt)) == 0.0 ? 0.0 : sqrarg*sqrarg);

From the GCC documentation about this problem:

-Wsequence-point

Warn about code that may have undefined semantics because of violations of sequence point rules in the C and C++ standards.

The C and C++ standards define the order in which expressions in a C/C++ program are evaluated in terms of sequence points, which represent a partial ordering between the execution of parts of the program: those executed before the sequence point, and those executed after it. These occur after the evaluation of a full expression (one which is not part of a larger expression), after the evaluation of the first operand of a &&, ||, ? : or , (comma) operator, before a function is called (but after the evaluation of its arguments and the expression denoting the called function), and in certain other places. Other than as expressed by the sequence point rules, the order of evaluation of subexpressions of an expression is not specified. All these rules describe only a partial order rather than a total order, since, for example, if two functions are called within one expression with no sequence point between them, the order in which the functions are called is not specified. However, the standards committee have ruled that function calls do not overlap.

It is not specified when between sequence points modifications to the values of objects take effect. Programs whose behavior depends on this have undefined behavior; the C and C++ standards specify that “Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be read only to determine the value to be stored.”. If a program breaks these rules, the results on any particular implementation are entirely unpredictable.

Examples of code with undefined behavior are a = a++;, a = b[n++] and a[i++] = i;. Some more complicated cases are not diagnosed by this option, and it may give an occasional false positive result, but in general it has been found fairly effective at detecting this sort of problem in programs.

The standard is worded confusingly, therefore there is some debate over the precise meaning of the sequence point rules in subtle cases. Links to discussions of the problem, including proposed formal definitions, may be found on the GCC readings page, at http://gcc.gnu.org/readings.html

We still track this as an issue with low priority to improve the compatibility with GCC also on cases which are undefined by the language standard.

Thanks Alex