Is short vector math lib slow?

Olaf_Krzikalla · ‎10-29-2010

Hi,

in the following code snippet version 1 performs measurably faster than version 2:

--- begin of code ---
#define _mm_extract_pd(R, I) (*((double*)(&R)+I))
_m128d art_vect428;

// version 1:
ph = atan(_mm_extract_pd(art_vect428, 0));
ph[i+1] = atan(_mm_extract_pd(art_vect428, 1));
// end of v1

// version 2:
_mm_storeu_pd(&ph, _mm_atan_pd(art_vect428));
// end of v2
--- end of code ---

Looking at the assembly shows, that _mm_storeu_pd is decomposed in two writes (as it is done in v1), so that actually can't be the reason. Are there any other explanations beside "_mm_atan_pd is slower than two calls to atan"? It seems that other functions (sin, cos) show a similiar behavior.

Best regards
Olaf Krzikalla

System: icc 11.1.48, windows 7, Intel Core 2 Duo P8600
Cmdline: /c /O3 /Oi /Qipo /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /Gy /fp:fast /Fo"Release/" /W3 /nologo /Zi

Feilong_H_Intel · ‎11-02-2010

Hi Olaf,

I'm interested in reproducing this issue. Could you please attach/upload your test program in this thread? You can make your reply private if needed.

Thanks,
Feilong

Olaf_Krzikalla · ‎11-02-2010

Hi Feilong,

while I minimized the code I detected, that the issue seems to be more complex. It happens only if the
unroll - pragma is active. If it is commented out, both versions run at about the same speed. The "not SLOW_VERSION" code achieve that speed also with unrolling, but the other one (SLOW_VERSION defined) is slower. Here is the code (windows only), look for the SLOW_VERSION define:

--- begin of code ---

#include
#include
#include

#ifndef _mm_extract_pd
#define _mm_extract_pd(R, I) (*((double*)(&R)+I))
#endif

#define NOMINMAX
#include

double get_time(void)
{
static LARGE_INTEGER time64, ticksPerSecond;
if (ticksPerSecond.QuadPart == 0)
QueryPerformanceFrequency(&ticksPerSecond);

QueryPerformanceCounter(&time64);
return ((double)time64.QuadPart)/ticksPerSecond.QuadPart;
}

#define N 400 // number of sites
#define H 0.10 // space increment
#define TIME 1e-2 // total time_xx
#define TAU 1e-11 // time_xx increment

double s[1+N]; // arclength
__declspec(align(16)) double x[1+N]; // x-coordinate of grid points
__declspec(align(16)) double y[1+N]; // y-coordinate of grid points
__declspec(align(16)) double ph[1+N]; // angle with x-axis
double time_xx,tau;

void initialize(void);
void timeloop(void);

int main()
{
double startTime; int i;
initialize();
startTime = get_time();

while(time_xx timeloop(); if(tau<4e-8) tau *= 1.001;
}
printf("time needed: %f sec\n", get_time() - startTime);
for(i=1;i<=5;i++) printf("%d, %d;", s, ph);

return 0;
}

double func0(double x) { return tanh(2*x); }

void initialize(void)
{
int i; double dx,dy,ds=H; FILE *fp;
time_xx = 0.0; tau = TAU; x[0] = 0.0;
for(i=1;i<=N;i++) { x = x[i-1]+ds; ds *= 1.00; }
for(i=0;i<=N;i++) { y = func0(x); }
s[0] = 0.0;
for(i=1;i<=N;i++) {
dx = x-x[i-1]; dy = y-y[i-1];
s = s[i-1] + sqrt(dx*dx+dy*dy); ph = atan(dy/dx); }
}

#define SLOW_VERSION

void timeloop(void)
{
__m128d dx_vect418, x_i_vect419, x_i_1_vect420, dy_vect421, y_i_vect422, y_i_1_vect423, temp_vect424, art_vect425, art_vect426, art_vect427, art_vect428, art_vect429;
double temp;
int i;
double dx, dy, vp, h, xp, yp, xpp, ypp;
time_xx += tau;
#pragma unroll(4)
for (i = 1; i <= N - 1; i += 2)
{
x_i_vect419 = _mm_loadu_pd(&(x));
x_i_1_vect420 = _mm_loadu_pd(&(x[i - 1]));
dx_vect418 = _mm_sub_pd(x_i_vect419, x_i_1_vect420);
y_i_vect422 = _mm_loadu_pd(&(y));
y_i_1_vect423 = _mm_loadu_pd(&(y[i - 1]));
dy_vect421 = _mm_sub_pd(y_i_vect422, y_i_1_vect423);
art_vect425 = _mm_mul_pd(dx_vect418, dx_vect418);
art_vect426 = _mm_mul_pd(dy_vect421, dy_vect421);
art_vect427 = _mm_add_pd(art_vect425, art_vect426);
temp_vect424 = _mm_sqrt_pd(art_vect427);
art_vect428 = _mm_div_pd(dy_vect421, dx_vect418);
#ifdef SLOW_VERSION
_mm_storeu_pd(&ph, _mm_atan_pd(art_vect428));
#else
ph = atan(_mm_extract_pd(art_vect428, 0));
ph[i+1] = atan(_mm_extract_pd(art_vect428, 1));
#endif
s = s[i - 1] + _mm_extract_pd(temp_vect424, 0);
s[(i + 1)] = s[(i - 1 + 1)] + _mm_extract_pd(temp_vect424, 1);
}
for (; i <= N; i++)
{
dx = x - x[i - 1];
dy = y - y[i - 1];
temp = sqrt(dx * dx + dy * dy);
ph = atan(dy / dx);
s = s[i - 1] + temp;
}
}

--- end of code ---

I hope you can shed some light on the issue

Best Olaf

/edit: I have changed the main routine to prevent IPO from optimizing away the atan calls at all. Now the issue is even reproducible regardless of the unroll pragma.

TimP · ‎11-02-2010

Performance variations with unrolling are common and likely not to be reproduced on different CPU models. A likely reason would be fitting the Loop Stream Detection or not. I guess the largest loop body fitting loop stream detector might be achieved if it is possible to get 32-byte code alignment for the loop body. The worst situation may be when the loop doesn't activate LSD but is not unrolled sufficiently to approach full performance.
I suspect yours may be one of the models which prefers massive unrolling, while the Core I7 style CPUs perform more consistently with unroll by 4. Westmere style CPUs appear sometimes to exhibit a higher unroll requirement again. Something about the hardware register renaming action.
The addition of decoded instruction cache in future CPU models is supposed to help in avoiding problems with unfavorable amounts of unrolling.
Needless to say, unrolling makes it difficult to get performance when loop count doesn't match unrolling.
Your code presents more complex issues than are likely to be dealt with effectively here.

Olaf_Krzikalla · ‎11-03-2010

I don't think that the issue is directly related to unrolling as stated in my "/edit" clause. The behavior is reproducible irrespective of "#pragma unroll".
Meanwhile I've traced both functions at assembly level. atan does a lot of computations while _mm_atan_pd loads some values from static memory. Maybe I've just encountered the dreaded cache issue again.
Of course this immediately pops up the question if it's better to use math.h functions instead of svml functions at all.

Best Olaf

TimP · ‎11-03-2010

As far as I know, svml functions are intended primarily to support auto-vectorization. With IPP and MKL vector library functions also supported, the scope for explicit calling of svml functions is limited, and doesn't get much support priority. Ideally, is consistent with allowing icc to choose between scalar and svml functions. There is also if you are interested in more usage of Intel library functions.