- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

in the following code snippet version 1 performs measurably faster than version 2:

--- begin of code ---

#define _mm_extract_pd(R, I) (*((double*)(&R)+I))

_m128d art_vect428;

// version 1:

ph

*= atan(_mm_extract_pd(art_vect428, 0));*

ph[i+1] = atan(_mm_extract_pd(art_vect428, 1));

// end of v1

// version 2:

_mm_storeu_pd(&ph

ph[i+1] = atan(_mm_extract_pd(art_vect428, 1));

// end of v1

// version 2:

_mm_storeu_pd(&ph

*, _mm_atan_pd(art_vect428));*

// end of v2

--- end of code ---

Looking at the assembly shows, that _mm_storeu_pd is decomposed in two writes (as it is done in v1), so that actually can't be the reason. Are there any other explanations beside "_mm_atan_pd is slower than two calls to atan"? It seems that other functions (sin, cos) show a similiar behavior.

Best regards

Olaf Krzikalla

System: icc 11.1.48, windows 7, Intel Core 2 Duo P8600

Cmdline: /c /O3 /Oi /Qipo /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /Gy /fp:fast /Fo"Release/" /W3 /nologo /Zi// end of v2

--- end of code ---

Looking at the assembly shows, that _mm_storeu_pd is decomposed in two writes (as it is done in v1), so that actually can't be the reason. Are there any other explanations beside "_mm_atan_pd is slower than two calls to atan"? It seems that other functions (sin, cos) show a similiar behavior.

Best regards

Olaf Krzikalla

System: icc 11.1.48, windows 7, Intel Core 2 Duo P8600

Cmdline: /c /O3 /Oi /Qipo /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /Gy /fp:fast /Fo"Release/" /W3 /nologo /Zi

Link Copied

5 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I'm interested in reproducing this issue. Could you please attach/upload your test program in this thread? You can make your reply private if needed.

Thanks,

Feilong

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

while I minimized the code I detected, that the issue seems to be more complex. It happens only if the

unroll - pragma is active. If it is commented out, both versions run at about the same speed. The "not SLOW_VERSION" code achieve that speed also with unrolling, but the other one (SLOW_VERSION defined) is slower. Here is the code (windows only), look for the SLOW_VERSION define:

--- begin of code ---

#include

#include

#include

#ifndef _mm_extract_pd

#define _mm_extract_pd(R, I) (*((double*)(&R)+I))

#endif

#define NOMINMAX

#include

double get_time(void)

{

static LARGE_INTEGER time64, ticksPerSecond;

if (ticksPerSecond.QuadPart == 0)

QueryPerformanceFrequency(&ticksPerSecond);

QueryPerformanceCounter(&time64);

return ((double)time64.QuadPart)/ticksPerSecond.QuadPart;

}

#define N 400 // number of sites

#define H 0.10 // space increment

#define TIME 1e-2 // total time_xx

#define TAU 1e-11 // time_xx increment

double s[1+N]; // arclength

__declspec(align(16)) double x[1+N]; // x-coordinate of grid points

__declspec(align(16)) double y[1+N]; // y-coordinate of grid points

__declspec(align(16)) double ph[1+N]; // angle with x-axis

double time_xx,tau;

void initialize(void);

void timeloop(void);

int main()

{

double startTime; int i;

initialize();

startTime = get_time();

while(time_xx timeloop(); if(tau<4e-8) tau *= 1.001;

}

printf("time needed: %f sec\n", get_time() - startTime);

for(i=1;i<=5;i++) printf("%d, %d;", s

*, ph*

*);*

return 0;

}

double func0(double x) { return tanh(2*x); }

void initialize(void)

{

int i; double dx,dy,ds=H; FILE *fp;

time_xx = 0.0; tau = TAU; x[0] = 0.0;

for(i=1;i<=N;i++) { xreturn 0;

}

double func0(double x) { return tanh(2*x); }

void initialize(void)

{

int i; double dx,dy,ds=H; FILE *fp;

time_xx = 0.0; tau = TAU; x[0] = 0.0;

for(i=1;i<=N;i++) { x

*= x[i-1]+ds; ds *= 1.00; }*

for(i=0;i<=N;i++) { yfor(i=0;i<=N;i++) { y

*= func0(x**); }*

s[0] = 0.0;

for(i=1;i<=N;i++) {

dx = xs[0] = 0.0;

for(i=1;i<=N;i++) {

dx = x

*-x[i-1]; dy = y**-y[i-1];*

ss

*= s[i-1] + sqrt(dx*dx+dy*dy); ph**= atan(dy/dx); }*

}

#define SLOW_VERSION

void timeloop(void)

{

__m128d dx_vect418, x_i_vect419, x_i_1_vect420, dy_vect421, y_i_vect422, y_i_1_vect423, temp_vect424, art_vect425, art_vect426, art_vect427, art_vect428, art_vect429;

double temp;

int i;

double dx, dy, vp, h, xp, yp, xpp, ypp;

time_xx += tau;

#pragma unroll(4)

for (i = 1; i <= N - 1; i += 2)

{

x_i_vect419 = _mm_loadu_pd(&(x}

#define SLOW_VERSION

void timeloop(void)

{

__m128d dx_vect418, x_i_vect419, x_i_1_vect420, dy_vect421, y_i_vect422, y_i_1_vect423, temp_vect424, art_vect425, art_vect426, art_vect427, art_vect428, art_vect429;

double temp;

int i;

double dx, dy, vp, h, xp, yp, xpp, ypp;

time_xx += tau;

#pragma unroll(4)

for (i = 1; i <= N - 1; i += 2)

{

x_i_vect419 = _mm_loadu_pd(&(x

*));*

x_i_1_vect420 = _mm_loadu_pd(&(x[i - 1]));

dx_vect418 = _mm_sub_pd(x_i_vect419, x_i_1_vect420);

y_i_vect422 = _mm_loadu_pd(&(yx_i_1_vect420 = _mm_loadu_pd(&(x[i - 1]));

dx_vect418 = _mm_sub_pd(x_i_vect419, x_i_1_vect420);

y_i_vect422 = _mm_loadu_pd(&(y

*));*

y_i_1_vect423 = _mm_loadu_pd(&(y[i - 1]));

dy_vect421 = _mm_sub_pd(y_i_vect422, y_i_1_vect423);

art_vect425 = _mm_mul_pd(dx_vect418, dx_vect418);

art_vect426 = _mm_mul_pd(dy_vect421, dy_vect421);

art_vect427 = _mm_add_pd(art_vect425, art_vect426);

temp_vect424 = _mm_sqrt_pd(art_vect427);

art_vect428 = _mm_div_pd(dy_vect421, dx_vect418);

#ifdef SLOW_VERSION

_mm_storeu_pd(&phy_i_1_vect423 = _mm_loadu_pd(&(y[i - 1]));

dy_vect421 = _mm_sub_pd(y_i_vect422, y_i_1_vect423);

art_vect425 = _mm_mul_pd(dx_vect418, dx_vect418);

art_vect426 = _mm_mul_pd(dy_vect421, dy_vect421);

art_vect427 = _mm_add_pd(art_vect425, art_vect426);

temp_vect424 = _mm_sqrt_pd(art_vect427);

art_vect428 = _mm_div_pd(dy_vect421, dx_vect418);

#ifdef SLOW_VERSION

_mm_storeu_pd(&ph

*, _mm_atan_pd(art_vect428));*

#else

ph#else

ph

*= atan(_mm_extract_pd(art_vect428, 0));*

ph[i+1] = atan(_mm_extract_pd(art_vect428, 1));

#endif

sph[i+1] = atan(_mm_extract_pd(art_vect428, 1));

#endif

s

*= s[i - 1] + _mm_extract_pd(temp_vect424, 0);*

s[(i + 1)] = s[(i - 1 + 1)] + _mm_extract_pd(temp_vect424, 1);

}

for (; i <= N; i++)

{

dx = xs[(i + 1)] = s[(i - 1 + 1)] + _mm_extract_pd(temp_vect424, 1);

}

for (; i <= N; i++)

{

dx = x

*- x[i - 1];*

dy = ydy = y

*- y[i - 1];*

temp = sqrt(dx * dx + dy * dy);

phtemp = sqrt(dx * dx + dy * dy);

ph

*= atan(dy / dx);*

ss

*= s[i - 1] + temp;*

}

}

--- end of code ---

I hope you can shed some light on the issue

Best Olaf

/edit: I have changed the main routine to prevent IPO from optimizing away the atan calls at all. Now the issue is even reproducible regardless of the unroll pragma.}

}

--- end of code ---

I hope you can shed some light on the issue

Best Olaf

/edit: I have changed the main routine to prevent IPO from optimizing away the atan calls at all. Now the issue is even reproducible regardless of the unroll pragma.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I suspect yours may be one of the models which prefers massive unrolling, while the Core I7 style CPUs perform more consistently with unroll by 4. Westmere style CPUs appear sometimes to exhibit a higher unroll requirement again. Something about the hardware register renaming action.

The addition of decoded instruction cache in future CPU models is supposed to help in avoiding problems with unfavorable amounts of unrolling.

Needless to say, unrolling makes it difficult to get performance when loop count doesn't match unrolling.

Your code presents more complex issues than are likely to be dealt with effectively here.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Meanwhile I've traced both functions at assembly level. atan does a lot of computations while _mm_atan_pd loads some values from static memory. Maybe I've just encountered the dreaded cache issue again.

Of course this immediately pops up the question if it's better to use math.h functions instead of svml functions at all.

Best Olaf

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page