- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

$ /usr/local/intel/Compiler/11.1/038/bin/intel64/icc -msse4.1 -O3 intelbug.C

$./a.out

Result: 1.467233472551115E+07 -3.423849100156368E+05

Result: 1.467233472551115E+07 -3.423849100156368E+05

FAST: 0.477064 seconds.

SLOW: 0.490420 seconds.

$

$ /usr/local/intel/Compiler/11.1/069/bin/intel64/icc -msse4.1 -O3 intelbug.C

$ ./a.out

Result: 1.467233472551115E+07 -3.423849100156368E+05

Result: 1.467233472551115E+07 -3.423849100156368E+05

FAST: 0.022687 seconds.

SLOW: 0.432179 seconds.

$

$ /usr/local/intel/Compiler/11.1/072/bin/intel64/icc -msse4.1 -O3 intelbug.C

$ ./a.out

Result: 1.467233472551115E+07 -3.423849100156368E+05

Result: 1.467233472551115E+07 -3.423849100156368E+05

FAST: 0.025833 seconds.

SLOW: 0.685143 seconds.

This is on a Core 2 architecture, Linux. I looked at the assembly generated for powfast() and powslow(), and the main difference that I can tell is that the slow one uses the sincos() function, and the fast one uses sin(), then cos() to calculate the values required in the function. What I'm thinking is this (if that helps at all!):

1) the same code should be generated for both functions, really. The only difference in the C++ code to move where n gets multiplied. Surely dependency analysis should say that it's the same argument being passed to sin and cos, and therefore sincos should get called.

2) the sincos() function itself has not gotten the major speed boost that apparently happened between rev. 038 and 069. In fact, it got slower still between revs 069 and 072.

The code that reproduces the problem is attached.

[bash]#include#include #include class Cplx { private: public: // data. double a[2]; Cplx(double r, double i) { a[0] = r; a[1] = i; }; #if 1 // fast Cplx powfast(const double n) const { const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n const double theta = ::atan2(a[1],a[0]); // theta*n Cplx lval(r*::cos(theta*n),r*::sin(theta*n)); return lval; } #endif #if 1 // slow Cplx powslow(const double n) const { const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n const double theta = ::atan2(a[1],a[0])*n; // theta*n Cplx lval(r*::cos(theta),r*::sin(theta)); return lval; } #endif const Cplx& operator+=(const Cplx &rval) { a[0] += rval.a[0]; a[1] += rval.a[1]; return *this; } } ; double wtime() { struct timeval thetime; gettimeofday(&thetime,NULL); return (double)thetime.tv_sec + (double)thetime.tv_usec/1.0e+6; } int main(int argc, char * argv[]) { int i; int niter = 30000000; Cplx a(0.5,0.02); Cplx b(0.6,-0.01); Cplx c(0.0,0.0); // test the fast version double ck1 = wtime(); for (i = 0; i < niter; i++) { c += b.powfast(1.4); } double ck2 = wtime(); printf("Result: %.15E %.15E\n",c.a[0],c.a[1]); c.a[0] = c.a[1] = 0.0; // test the slow version double ck3 = wtime(); for (i = 0; i < niter; i++) { c += b.powslow(1.4); } double ck4 = wtime(); printf("Result: %.15E %.15E\n",c.a[0],c.a[1]); printf("FAST: %.6f seconds.\n",ck2-ck1); printf("SLOW: %.6f seconds.\n",ck4-ck3); return(0); } [/bash]

Link Copied

7 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As you may have noticed, you're not the first to point out that 11.1.073 was introduced for several reasons, even though 12.0 should come out next month.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

double num = 1.4;

double delta = 1.0 / niter;

for (i = 0; i lt; niter; i++)

{

c += b.powfast(num);

num += delta;

}

[bash]Jim

[/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

OK, sorry about that...the example code has been corrected such that the arguments to each variety of the pow() is variant (just modified the loops as per Jim's post above). Here are the new results. There still seems to be a 10-15% bias in favor of what I would call the faster method (premultiplying the n, and that's at least good news that the code is performing as one might expect). What's bothering me is that I would have figured that the optimizer would generate identical code for both...

I don't read assembly very well, and the only reason that I focused in on the sincos issue was that, if I generate the assembler for both versions and run it though diff, that was the one major difference that I noticed (and it's still the case in the update test code as well). All other differences were changes in register naming, etc.

And the initial test was flawed, yes, but I still don't quite get why such a slight change in coding would cause such a change in the emitted code. The optimizer should be able to tell that neither n nor theta changes, and therefore it should use the sincos function in both cases.

The updated test code is pasted below.

[bash]#include#include #include class Cplx { private: public: // data. double a[2]; Cplx(double r, double i) { a[0] = r; a[1] = i; }; #if 1 Cplx pownopre(const double n) const { const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n const double theta = ::atan2(a[1],a[0]); // theta*n Cplx lval(r*::cos(theta*n),r*::sin(theta*n)); return lval; } #endif #if 1 Cplx powpre(const double n) const { const double r = ::pow(a[0]*a[0] + a[1]*a[1],n/2.0); // r^n const double theta = ::atan2(a[1],a[0])*n; // theta*n Cplx lval(r*::cos(theta),r*::sin(theta)); return lval; } #endif const Cplx& operator+=(const Cplx &rval) { a[0] += rval.a[0]; a[1] += rval.a[1]; return *this; } } ; double wtime() { struct timeval thetime; gettimeofday(&thetime,NULL); return (double)thetime.tv_sec + (double)thetime.tv_usec/1.0e+6; } int main(int argc, char * argv[]) { int i; int niter = 30000000; Cplx a(0.5,0.02); Cplx b(0.6,-0.01); Cplx c(0.0,0.0); double arg = 1.4; double delta = 1.0/niter; // test the fast version double ck1 = wtime(); for (i = 0; i < niter; i++) { c += b.powpre(arg); arg += delta; } double ck2 = wtime(); printf("Result: %.15E %.15En",c.a[0],c.a[1]); c.a[0] = c.a[1] = 0.0; arg = 1.4; // test the slow version double ck3 = wtime(); for (i = 0; i < niter; i++) { c += b.pownopre(arg); arg += delta; } double ck4 = wtime(); printf("Result: %.15E %.15En",c.a[0],c.a[1]); printf("POW-PRE: %.6f seconds.n",ck2-ck1); printf("POW-NOPRE: %.6f seconds.n",ck4-ck3); return(0); } [/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Just a friendly bump ;)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dale

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for pointing this out!

Dale

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I wanted to give you some updates. The issue(DPD200162194) with the sin, cos and sincos is fixed in our actual beta compiler. The performance regression(DPD200162187) does not appear with 12.0 compilers and later. I've done some performance mesaurements with your testcase and different compiler versions on this:

**intel64 cpro11.1 038**

FAST: 0.290056 seconds.

SLOW: 0.220154 seconds.

**intel64 cpro11.1 075**

FAST: 0.020241 seconds.

SLOW: 0.223576 seconds.

**intel64 compXE12.0 191**

FAST: 0.018573 seconds.

SLOW: 0.012121 seconds.

**intel64 latest beta compiler (with resolved bug(DPD200162194))**

FAST: 0.014271 seconds.

SLOW: 0.012120 seconds.

With your testcase it seems to be a good idea to test the last recent compiler.

Alex

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page