No matter how I adjust the optimization settings, it doesn't speed up at all.

losemind · ‎07-30-2007

Hi all,

No matter how I adjust the optimization setting, I was unable to see any speed improvments. I am optimizing for ultimate speed. My PC I am working on is a Intel Xeon 2.4GHz (I tested it only supports up to SSE2). I am doing a numerically intensive numerical integration program. What can I do?

Here is the original command line from MS VS2003.NET's C++ property panel in the project properties menu:

/c /O2 /I "C:Program FilesMATLABR2007aexternincludewin32" /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "_MBCS" /D "try2_EXPORTS" /D "_WINDLL" /FD /EHsc /MT /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd /O3 /Ot /Og /QaxN /QParallel /QxN

----------

I've overriden it with:

/O3 /Ot /Og /QaxN /QParallel /QxN

And also tried

/QaxP /QParallel /QxP,

--------------

But with no speedup at all.

I am also looking for some good cookbook/reference about speed optimization using Intel C++ compiler, but I couldn't find good optimization cookbook for this version 10.1. In our applications, we want ultimate speed for numerically intensive computations -- mostly numerical integration.

I am not sure how to perform the PGO and other high level speed optimization techniques...

Thanks for your help!

Dale_S_Intel · ‎07-30-2007

When you say you get no speedup, what are you comparing against? Do you mean you get the same performance as the version built with MS VS2003.NET? Or do you mean that different optimization settings don't affect the performance?

Have you tried comparing -Od and -O2 as a sanity check?
Can you characterize your code or post a small example?

Also, if you really want "ultimate speed" you might want to update your hardware a bit.

Dale

losemind · ‎07-30-2007

Hi,

I just found out my PC actually has two Xeon 2.4GHz cpus..., each CPUs supports up to SSE2...

Are there any ways that I can utilize these two CPUs?

Thanks!

------------------------

My program needs a 3rd party library, which is huge. So I guess it doesn't make sense for you to get the 3rd party library esp. it is expensive. I guess it's better for me to strip down my program a little bit so we don't need to use that library and then I will post it out! Thanks!

Dale_S_Intel · ‎07-30-2007

>Are there any ways that I can utilize these two CPUs?

Yes there are several possibilities, including openmp and various threading techniques.

If you get a small test case or kernel I can try to take a look at it.

I'm still curious, when you say you're not seeing a performance benefit, do you mean that you get the same performance as when you build with MS?

Dale

losemind · ‎07-31-2007

MADschouten:
>Are there any ways that I can utilize these two CPUs?

Yes there are several possibilities, including openmp and various threading techniques.

If you get a small test case or kernel I can try to take a look at it.

I'm still curious, when you say you're not seeing a performance benefit, do you mean that you get the same performance as when you build with MS?

Dale

Thanks a lot Dale.

Let me strip down my program a little bit so you don't have to download a 3rd party library, which is kind of big.

While I was playing with the optimization settings for my program,

I got the following error message:

------------------
OMP abort: Initializing libguide.lib, but found libguide40.lib already initialized.
This may cause performance degradation and correctness issues.
Set environment variable KMP_DUPLICATE_LIB_OK=TRUE to ignore
this problem and force the program to continue anyway.
Please note that the use of KMP_DUPLICATE_LIB_OK is unsupported
and using it may cause undefined behavior.
For more information, please contact Intel Premier Support.

----------------

What's wrong?

losemind · ‎07-31-2007

Okay here is the core of my code -- I stripped it down without referring to the 3rd party library, so you will have no difficulty to run it: it is a purely numerical evaluation, with all those settings I've tried there was no improvement at all.

I want to make sure this part is as fast as possible.

Next step is to issue two copies of this core part to run parallelly, because I have two CPUs, although they are slow...

--------------------

inline void AA4(double t, double kappa, double *eta, double *rho, double y0, double theta,
double c, double g, double v, double eps, double x0, double xbar, double a,
double h, double khat, double lambda, double sigma2, double *AA1)
{
double t1;
double t10;
double t100;
double t102;
double t103;
double t105;
double t111;
double t113;
double t114;
double t115;
double t116;
double t117;
double t120;
double t121;
double t125;
double t129;
double t130;
double t131;
double t135;
double t136;
double t137;
double t14;
double t144;
double t145;
double t146;
double t147;
double t15;
double t150;
double t151;
double t152;
double t153;
double t154;
double t158;
double t159;
double t16;
double t160;
double t161;
double t162;
double t166;
double t168;
double t17;
double t170;
double t172;
double t174;
double t176;
double t177;
double t181;
double t182;
double t183;
double t184;
double t186;
double t188;
double t190;
double t191;
double t196;
double t197;
double t2;
double t20;
double t201;
double t205;
double t208;
double t21;
double t210;
double t216;
double t217;
double t218;
double t223;
double t225;
double t227;
double t228;
double t229;
double t23;
double t232;
double t233;
double t234;
double t235;
double t236;
double t239;
double t24;
double t240;
double t241;
double t243;
double t246;
double t247;
double t249;
double t25;
double t251;
double t255;
double t259;
double t260;
double t261;
double t267;
double t27;
double t275;
double t277;
double t28;
double t280;
double t284;
double t286;
double t290;
double t30;
double t32;
double t321;
double t325;
double t332;
double t35;
double t36;
double t37;
double t373;
double t374;
double t375;
double t376;
double t378;
double t379;
double t38;
d ouble t380;
double t381;
double t386;
double t387;
double t388;
double t39;
double t390;
double t393;
double t394;
double t4;
double t40;
double t400;
double t401;
double t407;
double t408;
double t409;
double t41;
double t411;
double t414;
double t416;
double t419;
double t420;
double t422;
double t423;
double t426;
double t428;
double t429;
double t43;
double t432;
double t434;
double t440;
double t441;
double t45;
double t451;
double t452;
double t454;
double t455;
double t458;
double t459;
double t462;
double t463;
double t468;
double t469;
double t471;
double t474;
double t475;
double t476;
double t479;
double t480;
double t485;
double t486;
double t49;
double t493;
double t50;
double t500;
double t504;
double t505;
double t51;
double t510;
double t511;
double t516;
double t517;
double t519;
double t522;
double t524;
double t526;
double t53;
double t530;
double t54;
double t56;
double t6;
double t60;
double t61;
double t63;
double t67;
double t68;
double t69;
double t7;
double t73;
double t75;
double t77;
double t78;
double t79;
double t81;
double t85;
double t86;
double t88;
double t89;
double t91;
double t92;
double t94;
double t95;
double t97;
double t98;
{
t1 = g*eps;
t2 = -c+t1;
t4 = 1/h;
t6 = exp(-h*t);
t7 = 1.0-t6;
t10 = exp(t2*khat*t4*t7);
t14 = khat*t4*t7*g*v;
t15 = cos(t14);
t16 = t10*t15;
t17 = y0-theta;
t20 = exp(-kappa*t);
t21 = 1.0-t20;
t23 = t21/kappa;
t25 = t*theta;
t27 = rho[1];
t28 = eta[1];
t30 = t27*t28*lambda;
t32 = -t2;
t24 = t23/t28;
t35 = 1.0+t24*t32;
t36 = t35*t35;
t37 = t21*t21;
t38 = kappa*kappa;
t40 = t37/t38;
t41 = t28*t28;
t43 = g*g;
t45 = v*v;
t39 = t43*t45;
t49 = log(t36+t40/t41*t39);
t51 = c-t1+kappa*t28;
t53 = t51*t51;
t54 = t39;
t56 = 1/(t53+t54);
t60 = rho[2];
t61 = eta[2];
t63 = t60*t61*lambda;
t50 = t23/t61;
t67 = 1.0+t50*t32;
t68 = t67*t67;
t69 = t61*t61;
t75 = log(t68+t40/t69*t54);
t77 = c-t1+kappa*t61;
t79 = t77*t77;
t81 = 1/(t79+t54);
t85 = rho[0];
t86 = t85*lambda;
t88 = t45*t;
t89 = eta[0];
t91 = c-t1+kappa*t89;
t92 = t91*t91;
t94 = 1/(t92+t54);
t97 = rho[5];
t98 = t97*lambda;
t100 = eta[5];
t102 = c-t1+kappa*t100;
t103 = t102*t102;
t105 = 1/(t103+t54);
t73 = g*v;
t111 = atan2(t24*t73,t35);
t113 = v*t56;
t116 = t60*lambda;
t117 = t116*t32;
t125 = t97*t100*lambda;
t78 = t23/t100;
t129 = 1.0+t78*t32;
t130 = t129*t129;
t131 = t100*t100;
t137 = log(t130+t40/t131*t54);
t145 = atan2(t78*t73,t129);
t147 = v*t105;
t151 = rho[4];
t152 = eta[4];
t154 = t151*t152*lambda;
t95 = t23/t152;
t158 = 1.0+t95*t32;
t159 = t158*t158;
t160 = t152*t152;
t166 = log(t159+t40/t160*t54);
t168 = c-t1+kappa*t152;
t170 = t168*t168;
t172 = 1/(t170+t54);
t114 = t30*t49;
t115 = t51*t56;
t120 = t63*t75;
t121 = t77*t81;
t135 = t30*t111;
t136 = g*t113;
t144 = t125*t137;
t146 = t102*t105;
t150 = t125*t145;
t153 = g*t147;
t161 = t154*t166;
t162 = t168*t172;
t176 = t2*t17*t23+t25*t1+t114*t115/2.0+t120*t121/2.0-t86*t43*t88*t94-t98*
t43*t88*t105+t135*t136-t117*t*t121-t116*t43*t88*t81+t144*t146/2.0+t150*t153-t25
*c+t161*t162/2.0;
t177 = t151*lambda;
t181 = rho[3];
t182 = t181*lambda;
t183 = t182*t32;
t184 = eta[3];
t186 = c-t1+kappa*t184;
t188 = t186*t186;
t190 = 1/(t188+t54);
t196 = t27*lambda;
t197 = t196*t32;
t201 = t98*t32;
t208 = atan2(t95*t73,t158);
t210 = v*t172;
t216 = t177*t32;
t223 = atan2(t50*t73,t67);
t225 = v*t81;
t229 = t85*t89*lambda;
t174 = t23/t89;
t233 = 1.0+t174*t32;
t234 = t233*t233;
t235 = t89*t89;
t241 = log(t234+t40/t235*t54);
t249 = atan2(t174*t73,t233);
t251 = v*t 94;
t255 = t181*t184*lambda;
t191 = t23/t184;
t259 = 1.0+t191*t32;
t260 = t259*t259;
t261 = t184*t184;
t267 = log(t260+t40/t261*t54);
t275 = atan2(t191*t73,t259);
t277 = v*t190;
t280 = t86*t32;
t205 = t186*t190;
t217 = t154*t208;
t218 = g*t210;
t227 = t63*t223;
t228 = g*t225;
t232 = t229*t241;
t236 = t91*t94;
t239 = t229*t249;
t240 = g*t251;
t243 = t255*t267;
t246 = t255*t275;
t247 = g*t277;
t284 = -t177*t43*t88*t172-t183*t*t205-t182*t43*t88*t190-t197*t*t115-t201*t*
t146+t217*t218-t196*t43*t88*t56-t216*t*t162+t227*t228+t232*t236/2.0+t239*t240+
t243*t205/2.0+t246*t247-t280*t*t236;
t286 = exp(t176+t284);
t290 = t73;
t321 = v*t;
t325 = t*g;
t332 = -t135*t115+t25*t290+t114*t136/2.0-t246*t205+t243*t247/2.0+t290*t17*
t23-t217*t162+t161*t218/2.0-t227*t121+t120*t228/2.0+t177*g*t321*t168*t172-t216*
t325*t210+t182*g*t321*t186*t190;
t373 = -t183*t325*t277+t196*g*t321*t51*t56-t197*t325*t113-t239*t236+t232*
t240/2.0-t150*t146+t144*t153/2.0+t86*g*t321*t91*t94+t98*g*t321*t102*t105-t280*
t325*t251+t116*g*t321*t77*t81-t201*t325*t147-t117*t325*t225;
t374 = t332+t373;
t375 = cos(t374);
t376 = t286*t375;
t378 = sin(t14);
t379 = t10*t378;
t380 = sin(t374);
t381 = t286*t380;
t386 = a*a;
t387 = t386*t386;
t388 = t386*sigma2;
t393 = sigma2*sigma2;
t394 = c*c;
t400 = t393*t43;
t401 = eps*eps;
t407 = sqrt(t387+4.0*t388*c-4.0*t388*t1+4.0*t393*t394-8.0*t393*c*t1+4.0*
t400*t401+4.0*t400*t45);
t408 = 2.0*t407;
t409 = 2.0*t386;
t411 = 4.0*sigma2*c;
t414 = 4.0*sigma2*g*eps;
t416 = sqrt(t408+t409+t411-t414);
t419 = exp(0.5*a*t);
t420 = t416*t419;
t422 = 0.25*t416*t;
t423 = cosh(t422);
t426 = sqrt(t408-t409-t411+t414);
t428 = 0.25*t426*t;
t429 = cos(t428);
t432 = sinh(t422);
t434 = sin(t428);
t440 = 0.25*t416*t423*t429-0.25*t426*t432*t434+0.5*a*t432*t429;
t441 = t440*t440;
t451 = 0.25*t426*t423*t429+0.25*t416*t432*t434+0.5*a*t423*t434;
t452 = t451*t451;
t454 = 1/(t441+t452);
t455 = t440*t454;
t458 = t426*t419;
t459 = t451*t454;
t462 = 0.25*t420*t455+0.25*t458*t459;
t463 = t462*t462;
t468 = 0.25*t458*t455-0.25*t420*t459;
t469 = t468*t468;
t471 = log(t463+t469);
t390 = a*xbar/sigma2;
t474 = exp(t390*t471);
t475 = (t16*t376-t379*t381)*t474;
t476 = atan2(t468,t462);
t479 = 2.0*t390*t476;
t480 = cos(t479);
t485 = (-t379*t376-t16*t381)*t474;
t486 = sin(t479);
t493 = t2*t432*t429+t290*t423*t434;
t500 = -t290*t432*t429+t2*t423*t434;
t504 = exp(t493*t440*t454+t500*t451*t454);
t505 = (t475*t480-t485*t486)*t504;
t510 = -t500*t440*t454+t493*t451*t454;
t511 = cos(t510);
t516 = (t485*t480+t475*t486)*t504;
t517 = sin(t510);
t519 = t505*t511+t516*t517;
t522 = c/g-eps;
t524 = t522*t522;
t526 = 1/(t524+t45);
t530 = t516*t511-t505*t517;
AA1[0] = t519*t522*t526+t530*t526*v;
AA1[1] = t530*t522*t526-t519*t526*v;
return;
}
}

gordan · ‎08-01-2007

I'm afraid you're out of luck if that is really as organized as your code can be. No compiler will make that run faster. For a start, unless you are doing the same operation on multiple aligned variables (e.g. arrays), you'll see no benefit from vectorization. Without vectorization, you're not going to see much speed-up compared to the old 387 FPU.

Re-organize your similar data into arrays. Organize data so that where you are performing the same operation on multiple elements, it is done in a loop. Without that, you are out of luck, I'm afraid. Loops vectorize. Random combinations of operations don't.

Also have a look here:
http://lists.gnu.org/archive/html/help-gsl/2007-07/msg00057.html
http://lists.gnu.org/archive/html/help-gsl/2007-07/msg00058.html

We've had a reasonably good discussion on the subject of making things go faster on the GSL mailing list.

Dale_S_Intel · ‎08-01-2007

Yeah, I would also strongly suggest that you rewrite this. In this kernel, there's no loops so you out of luck as far as generating automatically parallelized or vectorized loops. It looks suspiciously like you're trying to do complex arithmetic, in which case I would suggest you avail yourself of the "Complex" feature in C99. With that long list of variables and expressions and assignments, you're making it difficult for the compiler to do much to help you. It almost looks like you're either trying to write asm in C or dusty Fortran in C.

If you want to parallelize this at a higher level, you could maybe try openmp. Depending on what your call site looks like, you might have trouble making sure that there are no cross iteration dependences.

I'm also curious about the numerical stability of this. I have no idea what valid input is for this function, but playing around with it in the debugger, there seemed to be a lot of values with wildly different orders of magnitude in a given expression, which is often a recipe for questionable output.

Dale

losemind · ‎08-01-2007

Hi Dale,

Yes, it's complex arithmatics. You mention about C99. How do I use it? We are using a dual Xeon 2.4GHz cpus supporting SSE2 and Intel Compiler 10.0 integrated with MS VS.NET 2003. Thank you very much!

Here is the reasoning of my colleague of why we want to generate so chaotic the code:

Here question is about comparing four approaches in C/C++ and Fortran:

C/C++ approaches 1 and 2. vs. Fortran approaches 1 and 2.

----------------------------

I am using Intel Visual Compiler 10.1 which is integrated with MS VS
2003.NET. I am using Intel Xeon 2.4GHz dual-CPUs which support only up
to SSE2.

I have a question, which is kind of preliminary study and thought
experiment before I take pains to convert my program from C/C++ to
Fortran.

My question is:

I read somewhere that Fortran has native support for complex numbers
(double precision) and their operations. I am not sure what do they
mean. To be more specific, I have a very complicated function f(x),
which is complex valued, and only the real part of f(x) is useful to
me. x itself is real-valued.

I am concerned with and trying to estimate the speed of two
approaches:

1. Evaluate the f(x) with all intermediate variables defined to be
complex-valued (using the complex data structure or complex class).
And finally, extract and return the real part of the value;

2. Derive the closed-form expression of the real-part of f(x). For my
complicated function, I used Maple to automatically generate C code
for the closed-form expression of the real-part. The generated C code
used hundreds of intermediate variables to store common factors, and
hundreds of multiplications. Although it looks very long, everything
is done in real-domain.

In C/C++, my experiment shows that approach 2 outperforms approach 1.
Maybe that's because I did a very lousy programming of 1 -- it looks
like the speed-up is about 5-10x faster.

--------------------------

Here is my question for the Fortran part ( I am a little bit hesitate
to convert my program into Fortran because I am really green in
Fortran and what if after so much more work the speedup is not worthy
of the extra time):

Does the native support of complex numbers in Fortran mean that
approach 2, using Fortran, will have almost the same speed as approach
1 using Fortran?

Just wanted to hear some opinion before I spend much effort on the
coding... Thanks a lot!

jimdempseyatthecove · ‎08-02-2007

Almost all of your variables are short lived. The persistance required is from 2 to a few statements in duration. I would suggest youreduce the number of short lived variables.

losemind · ‎08-02-2007

Why? I don't care about size but I care about speed....

JimDempseyAtTheCove:
Almost all of your variables are short lived. The persistance required is from 2 to a few statements in duration. I would suggest youreduce the number of short lived variables.

jimdempseyatthecove · ‎08-03-2007

If you care about speed you will reduce the number of temporary variables.

For example

If you have a function that has 100 temporary variables defined but where of these 100 temporary variables the code could be reduced into using and reusing 5 temporary variables then the probability is good that most, if not all of your temporary variables can be registerized (into the SSE XMM registers). This will eliminate writing all that temp data to RAM (and then discard it soon afterwords).

Jim Dempsey

losemind · ‎08-03-2007

I agree. But if you notice, observing from the math,

these temporary variables are for common factors,

which means that,

if I don't create these temp variables,

some common factors will have to be computed multiple times...

I have no choice... right?

losemind · ‎08-03-2007

Experts, please advise more C++ compiler optimization options/settings that I can try with?

For example, how do I specify the compiler to use as many registers as possible?

How do I instructor the compiler to maximize the alignment? I don't care about space...

gordan · ‎08-04-2007

I think you need to stop clutching at straws and accept that your code will not get any faster than it's alredy going by just changing compiler options. If you are willing/able to re-write it in such a way that it vectorizes you may see a speed-up of around an order of magnitude. You need to optimize your algorithm first, not your implementation.

Unfortunately, your code is extremely opaque and it is unclear what the overall purpose of it is. This makes it difficult to suggest any algorithms you could try.

Compilers use all suitable registers without being explicitly asked to.There is nothing signifficant you can do to improve that short of changing your platform. x86 before x86-64 only has 8 general purpose registers. x86-64 adds more, so that may help you if your OS and compiler can cope.

Alignment is only really an issue when vectorizing operations on arrays. You seem to have few arrays, and even fewer consistent operations on them. Having a quick glance at your code (and I have to say that is quite painful - comparing Fortran to C is a bit like trying to write organisationally equivalent code in Java and assembly - it's a ridiculous idea), you could re-write some of it to gain performance. Consider this:

t27 = rho[1];
t28 = eta[1];
t30 = t27*t28*lambda;
...
t60 = rho[2];
t61 = eta[2];
t63 = t60*t61*lambda;
...
t85 = rho[0];
t86 = t85*lambda;
...
t97 = rho[5];
t98 = t97*lambda;
...
t151 = rho[4];
t152 = eta[4];
t154 = t151*t152*lambda;
...
t181 = rho[3];
t182 = t181*lambda;

You could re-write parts of this as follows:

static double temp1[6];
static double temp2[6];
static unsigned int i;
for (i = 0; i < 6; i++)
temp1 = rho * lambda;

This will vectorize. You can then use contents of temp[] in your other calculations, in the above case, something like:
for (i = 0; i < 6; i++)
temp2 = eta * temp1;

You are also doing:
t27 = rho[1];
...
t30 = t27*t28*lambda;
...
t196 = t27*lambda;

You don't have to do (t27 * lambda) multiple times. If you do the loop above with temp1[], you can substitude this in. Partial caching is useful.

There are plenty of other similar things you can do, but you WILL have to rewrite your code to gain speed. There's plenty of performance to be gained there, but you will have to get your hands dirty. Optimize your algorithm. The compiler is a helpful tool, not a magic bulled to make good code out of bad code - and I'm afraid you are dealing with some extremely bad code here. You cannot optimize your implementation until you have optimized your underlying algorithm.

Dale_S_Intel · ‎08-21-2007

Agreed. The point here is that the code you've posted is essentially already compiled (by Maple?). It is very difficult for the compiler to go back and rediscover what the original source looked like and do the right thing, so it's not surprising that changing switches doesn't help as the compiler doesn't have much to work with. You can do some tweaking like converting some of the straight line code into loops and such, but I don't know if that would help (e.g. a small 6 iteration loop is unlikely to benefit from vectorization). I would suggest you start over and implement it using C99's complex support. From 'icl -help' it looks like you can specifically enable that with
/Qstd=C99
but make sure it's being compiled as C source code, not as C++, which doesn't have the same Complex support. Judicious googling for "C99 complex support" or similar should tell you how it works.

In general, you shouldn't spend time worrying about common subexpressions, short lived variables and the like as the compiler does a good job of handling stuff like that. When you try to do it yourself is when you get into trouble.

I can't tell from this code whether it is likely to be parallelizable at a higher level (i.e. parallel calls to this function), but if it is, then Openmp would be worthwhile trying.

Dale