optimizer bug? sincos vs. sin, then cos... large speed differe

dghyams · ‎10-20-2010

I have run into a strange optimizer quirk, and my best guess is that there are two different bugs causing it. I have attached a small code below to demonstrate. Two functions that do the same thing (in the sample code, called powfast and powslow) have very different run times:

$ /usr/local/intel/Compiler/11.1/038/bin/intel64/icc -msse4.1 -O3 intelbug.C
$./a.out
Result: 1.467233472551115E+07 -3.423849100156368E+05
Result: 1.467233472551115E+07 -3.423849100156368E+05
FAST: 0.477064 seconds.
SLOW: 0.490420 seconds.
$
$ /usr/local/intel/Compiler/11.1/069/bin/intel64/icc -msse4.1 -O3 intelbug.C
$ ./a.out
Result: 1.467233472551115E+07 -3.423849100156368E+05
Result: 1.467233472551115E+07 -3.423849100156368E+05
FAST: 0.022687 seconds.
SLOW: 0.432179 seconds.
$
$ /usr/local/intel/Compiler/11.1/072/bin/intel64/icc -msse4.1 -O3 intelbug.C
$ ./a.out
Result: 1.467233472551115E+07 -3.423849100156368E+05
Result: 1.467233472551115E+07 -3.423849100156368E+05
FAST: 0.025833 seconds.
SLOW: 0.685143 seconds.

This is on a Core 2 architecture, Linux. I looked at the assembly generated for powfast() and powslow(), and the main difference that I can tell is that the slow one uses the sincos() function, and the fast one uses sin(), then cos() to calculate the values required in the function. What I'm thinking is this (if that helps at all!):
1) the same code should be generated for both functions, really. The only difference in the C++ code to move where n gets multiplied. Surely dependency analysis should say that it's the same argument being passed to sin and cos, and therefore sincos should get called.
2) the sincos() function itself has not gotten the major speed boost that apparently happened between rev. 038 and 069. In fact, it got slower still between revs 069 and 072.

The code that reproduces the problem is attached.

[bash]#include 
#include 
#include 

class Cplx
{
private:

public:
  // data.
  double a[2];

  Cplx(double r, double i) {
     a[0] = r;
     a[1] = i; 
  };

#if 1
  // fast
  Cplx powfast(const double n) const {
    const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n

    const double theta = ::atan2(a[1],a[0]); // theta*n
    Cplx lval(r*::cos(theta*n),r*::sin(theta*n));

    return lval;
  }
#endif

#if 1
  // slow
  Cplx powslow(const double n) const {
    const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n

    const double theta = ::atan2(a[1],a[0])*n; // theta*n
    Cplx lval(r*::cos(theta),r*::sin(theta));

    return lval;
  }
#endif

  const Cplx& operator+=(const Cplx &rval) {
    a[0] += rval.a[0];
    a[1] += rval.a[1];
    return *this;
  }


} ;


double wtime() {
struct timeval thetime;
  gettimeofday(&thetime,NULL);
  return (double)thetime.tv_sec + (double)thetime.tv_usec/1.0e+6;
}


int main(int argc, char * argv[])
{

   int i;
   int niter = 30000000;

   Cplx a(0.5,0.02);
   Cplx b(0.6,-0.01);
   Cplx c(0.0,0.0);

   // test the fast version
   double ck1 = wtime();
   for (i = 0; i < niter; i++)
   {
     c += b.powfast(1.4);
   }
   double ck2 = wtime();
   printf("Result: %.15E %.15E\n",c.a[0],c.a[1]);
   c.a[0] = c.a[1] = 0.0;

   // test the slow version
   double ck3 = wtime();
   for (i = 0; i < niter; i++)
   {
     c += b.powslow(1.4);
   }
   double ck4 = wtime();
   printf("Result: %.15E %.15E\n",c.a[0],c.a[1]);



   printf("FAST: %.6f seconds.\n",ck2-ck1);
   printf("SLOW: %.6f seconds.\n",ck4-ck3);


   return(0);
}


[/bash]

TimP · ‎10-20-2010

In my copy of icc, both cases are "optimized" by calculating the trig values once (using __libm_sse2_sincos once ahead of both loops) and then summing them up, taking .015 sec or less on Core I7, with identical code in both loops. I don't see how you could draw a conclusion about the speed of a single invocation of sincos() when most of the time is spent adding copies of its result without calling it again, at least in your "fast" case.
As you may have noticed, you're not the first to point out that 11.1.073 was introduced for several reasons, even though 12.0 should come out next month.

jimdempseyatthecove · ‎10-20-2010

A proper test would make sure the optimization does not lift the code out of the loop

double num = 1.4;
double delta = 1.0 / niter;
for (i = 0; i lt; niter; i++)
{
c += b.powfast(num);
num += delta;
}

[bash]Jim


[/bash]

dghyams · ‎10-21-2010

OK, sorry about that...the example code has been corrected such that the arguments to each variety of the pow() is variant (just modified the loops as per Jim's post above). Here are the new results. There still seems to be a 10-15% bias in favor of what I would call the faster method (premultiplying the n, and that's at least good news that the code is performing as one might expect). What's bothering me is that I would have figured that the optimizer would generate identical code for both...

I don't read assembly very well, and the only reason that I focused in on the sincos issue was that, if I generate the assembler for both versions and run it though diff, that was the one major difference that I noticed (and it's still the case in the update test code as well). All other differences were changes in register naming, etc.

And the initial test was flawed, yes, but I still don't quite get why such a slight change in coding would cause such a change in the emitted code. The optimizer should be able to tell that neither n nor theta changes, and therefore it should use the sincos function in both cases.

The updated test code is pasted below.

[bash]#include 
#include 
#include 

class Cplx
{
private:

public:
  // data.
  double a[2];

  Cplx(double r, double i) {
     a[0] = r;
     a[1] = i; 
  };

#if 1
  Cplx pownopre(const double n) const {
    const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n

    const double theta = ::atan2(a[1],a[0]); // theta*n
    Cplx lval(r*::cos(theta*n),r*::sin(theta*n));

    return lval;
  }
#endif

#if 1
  Cplx powpre(const double n) const {
    const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n

    const double theta = ::atan2(a[1],a[0])*n; // theta*n
    Cplx lval(r*::cos(theta),r*::sin(theta));

    return lval;
  }
#endif

  const Cplx& operator+=(const Cplx &rval) {
    a[0] += rval.a[0];
    a[1] += rval.a[1];
    return *this;
  }


} ;


double wtime() {
struct timeval thetime;
  gettimeofday(&thetime,NULL);
  return (double)thetime.tv_sec + (double)thetime.tv_usec/1.0e+6;
}


int main(int argc, char * argv[])
{

   int i;
   int niter = 30000000;

   Cplx a(0.5,0.02);
   Cplx b(0.6,-0.01);
   Cplx c(0.0,0.0);
   double arg = 1.4;
   double delta = 1.0/niter;

   // test the fast version
   double ck1 = wtime();
   for (i = 0; i < niter; i++)
   {
     c += b.powpre(arg);
     arg += delta;
   }
   double ck2 = wtime();
   printf("Result: %.15E %.15En",c.a[0],c.a[1]);
   c.a[0] = c.a[1] = 0.0;
   arg = 1.4;

   // test the slow version
   double ck3 = wtime();
   for (i = 0; i < niter; i++)
   {
     c += b.pownopre(arg);
     arg += delta;
   }
   double ck4 = wtime();
   printf("Result: %.15E %.15En",c.a[0],c.a[1]);



   printf("POW-PRE: %.6f seconds.n",ck2-ck1);
   printf("POW-NOPRE: %.6f seconds.n",ck4-ck3);


   return(0);
}


[/bash]

dghyams · ‎10-26-2010

Just a friendly bump ;)

Dale_S_Intel · ‎10-26-2010

Sorry it took me a while to look into this. It seems you're correct. At first glance, it appears to me also that the two cases should be essentially identical. I assume the faster code is due to the optimizer using sincos in one case, dunno why it doesn't in the other. Also, the older version of 11.1 was faster. I'll look into it a little more carefully and let you know what I find.

Dale

Dale_S_Intel · ‎10-26-2010

I've submitted two issues on this case, DPD200162187 on the regression since 11.1.038 and DPD200162194 on the failure to use sincos in the one case. I'll post here when I have more info on them.

Thanks for pointing this out!

Dale

Alexander_W_Intel · ‎06-29-2011

Hi,

I wanted to give you some updates. The issue(DPD200162194) with the sin, cos and sincos is fixed in our actual beta compiler. The performance regression(DPD200162187) does not appear with 12.0 compilers and later. I've done some performance mesaurements with your testcase and different compiler versions on this:

intel64 cpro11.1 038
FAST: 0.290056 seconds.
SLOW: 0.220154 seconds.

intel64 cpro11.1 075
FAST: 0.020241 seconds.
SLOW: 0.223576 seconds.

intel64 compXE12.0 191
FAST: 0.018573 seconds.
SLOW: 0.012121 seconds.

intel64 latest beta compiler (with resolved bug(DPD200162194))
FAST: 0.014271 seconds.
SLOW: 0.012120 seconds.

With your testcase it seems to be a good idea to test the last recent compiler.

Alex

optimizer bug? sincos vs. sin, then cos... large speed differences