Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

Strange behaviour of libguide40.dll on AMD CPU

cat-i
Beginner
316 Views

Hi,

there seems to be an issue with the Intel OpenMP implementation (libguide40.dll) running on AMD Athlon 64 X2 CPUs. To give you an overview of what happened and how a workaround to the problem is:

Our company has developed an image processing application that makes extensive use of OpenMP and is compiled with the Intel C++ compiler. After updating from from icc version 10.0 to 10.1 the program suddenly crashed on the machines of one of our clients. It was really hard to figure what actually caused the crash, since on ALL machines in our company the program worked perfectly, whilst on ALL machines of that client the program crashed (independently of any settings or working-data).

After 2 weeks of searching, we figured that the only difference between our machines and that of the client were that ours had Intel processors (ranging from P3 up to C2D) whilst the client machines had AMD Athlon 64 X2 cpus. We decided to buy an AMD machine with a similar CPU for our developement team, and indeed on that newly bought machine the crash could be reproduced. It crashed even with all /Qx and /Qax compiler flags turned off, however when disabling OpenMP it worked fine. (It was an exception thrown in libguide40.dll, so disabling OpenMP was one of the first things that we tried).

After enabling/disabling OpenMP for each source-file, we finally found the culprit. It was an actually quite simple for-loop that used a reduction of the + operator on an unsigned int. (Imagine a routine that counts black pixels in a binary image). When wechanged the code to not use a reduction, but did an atomic + operation instead the code works fine on all CPU's now.

Is this a bug in libguide40.dll? Mind you that compiled with the 10.0 version of icc or with msvc it worked fine on the AMD cpu.

I thought this short story might be helpful.

EDIT: The compiler that generated this bug was version 10.1.013 - I have not tried 10.1.014 yet. The target platform was IA32.

0 Kudos
5 Replies
Dale_S_Intel
Employee
316 Views
I haven't seen this problem personally.
It would be most helpful if you could submit an issue at premier.intel.com with a reproducible test case. Be sure to include the details of the AMD system on which you saw the problem. If you like, you can post the issue number here and I can track it.

Thanks!

Dale
0 Kudos
cat-i
Beginner
316 Views

I just tried to write a simple test-case with a similar structure to the loop used in my application, but unfortunately it does not reproduce the error. I'm really clueless right now, so here is the code snipped out of the application that produces the error:

void

BIT_IMG_CALL BIT_FindExpectation(XBitmap* pBmp,CRect* pRect,RGBQUAD back,unsigned int* exY,unsigned int* exU,unsigned int* exV,unsigned int* varY,unsigned int* varU,unsigned int* varV,CProgressCtrl* pProg)

{

float y,u,v;

float vy,vu,vv;

unsigned int count;

int yy;

int y0;

int w;

int h;

count=0;

w=pRect->right-pRect->left;

h=pRect->bottom-pRect->top;

if(pProg!=NULL)

{

pProg->SendNotifyMessage(PBM_SETRANGE32,0,h);

pProg->SendNotifyMessage(PBM_SETPOS,0,0);

pProg->SendNotifyMessage(PBM_SETSTEP,1,0);

}

y=0.0f;

u=0.0f;

v=0.0f;

vy=0.0f;

vu=0.0f;

vv=0.0f;

y0=pRect->top;

#pragma omp parallel for schedule(guided) reduction(+:count)

for(yy=y0;yybottom;yy++)

{

int xx;

int count2;

count2=0;

float YL,UL,VL;

float VYL,VUL,VVL;

&n bsp;YL=0.0f;

UL=0.0f;

VL=0.0f;

VYL=0.0f;

VUL=0.0f;

VVL=0.0f;

#pragma omp parallel for schedule(dynamic) reduction(+:count2)

for(xx=0;xx

{

RGBQUAD pixel;

RGBQUAD norm;

pixel=pBmp->GetPixel(xx+pRect->left,yy);

norm=BIT_NormalizeRGB(pixel);

if(!BIT_EqualRGB(norm,back))

{

float y2;

float u2;

float v2;

YUVQUAD yuv=BIT_rgb2yuv(pixel);

y2=(

float)yuv.yuvY;

u2=-128.f+(

float)yuv.yuvU;

v2=-128.f+(

float)yuv.yuvV;

#pragma omp atomic

YL+=y2;

#pragma

omp atomic

VYL+=y2*y2;

#pragma

omp atomic

UL+=u2;

#pragma

omp atomic

VUL+=u2*u2;

#pragma

omp atomic

VL+=v2;

#pragma

omp atomic

VVL+=v2*v2;

count2++;

}

}

if(pProg !=NULL)

{

pProg->SendNotifyMessage(PBM_STEPIT,0,0);

}

count+=count2;

#pragma

omp atomic

y+=YL;

#pragma

omp atomic

u+=UL;

#pragma

omp atomic

v+=VL;

#pragma

omp atomic

vy+=VYL;

#pragma

omp atomic

vu+=VUL;

#pragma

omp atomic

vv+=VVL;

}

if(pProg!=NULL)

pProg->SendNotifyMessage(PBM_SETPOS,h,0);

if(count!=0)

{

float C=(float)count;

y/=C;

u/=C;

v/=C;

vy/=C;

vu/=C;

vv/=C;

*exY=(

unsigned int)y;

*exU=(

unsigned int)(128.f+u);

*exV=(

unsigned int)(128.f+v);

*varY=(

unsigned int)sqrt(vy-y*y);

*varU=(

unsigned int)sqrt(vu-u*u);

*varV=(

unsigned int)sqrt(vv-v*v);

}

else

{

YUVQUAD yuv;

yuv=BIT_rgb2yuv(back);

*exY=yuv.yuvY;

*exU=yuv.yuvU;

*exV=yuv.yuvV;

*varY=0;

*varU=0;

*varV=0 ;

}

}

What this function does is calculating the expectation values of "non-background" pixels in YUV-colorspace.

The above function causes libguide40.dll to throw an exception, however when I set up a similar function in a console application it does not throw an exception in libguide40.dll. :-(

If the reductions in the above loops are replaced by corresponding atomic statements (remove the "reduction(+:count)" and insert "#pragma omp atomic" before count+=count2) no exception is thrown.

Since this workaround does the trick for me, I do not want to invest much more time into reproducing that error, hopefully this is helpful though.

EDIT: The exact CPU used on our developement AMD machine is an

AMD Athlon 64 X2 3800+ (Manchester), Socket 939

Family F, Model B, Stepping 1, Revision BH-E4

The CPU on the client machine that also reproduces this behaviour is an AMD Athlon 64 X2 3600+ (no further details available).

Both machines running on Windows XP SP2 (standard issue 32bit version)

0 Kudos
jimdempseyatthecove
Honored Contributor III
316 Views

Cat-i,

I am not from Intel but I use AMD Opteron CPUs.

Interesting, I haven't seen this problem but I do not normally use reduction operators.

If you find that the ATOMIC introduces too much overhead an alternative is to create thread private variables for your partial sums. Before your tally loop, enter a parallel region to zero out the thread private areas. Run the loop summing into the thread private variables. Then after the loop, enter a parallel region which uses the ATOMIC to sum into a global tally.

Also, from my experience ATOMIC does not always work. (in my case on REAL(8) in Intel Visual Fortran) so I have been resorting to using named critical sections.

That failing, you can also use the InterlockedComareExchange functions.

Jim Dempsey

0 Kudos
cat-i
Beginner
316 Views

Jim,

thank you for those brilliant ideas. I will definitely give threadprivate variables a try (I kind of feel bad for not having thought of that myself :-] ).

I can't really comment on atomics not working: As far as I have seen yet they seem to work pretty well for me, however I think it's crucial to keep in mind that they only synchronize the assignement operation, for example something like

#pragma

omp atomic

VVL=v2*v2;

will not synchronize the multiplication. At least that's how

I understood the documentation.

I'm not sure on the exact behaviour in statements like

#pragma

omp atomic

VVL+=v2*v2;

since this could actually be compiled as

VVL=VVL+v2*v2;

where the value of VVL might have changed

between the mem-read and mem-store.

(Resulting in erronous calculation). Does

anyone know about the precise behaviour there?

Thanks,

cat-i

0 Kudos
jimdempseyatthecove
Honored Contributor III
316 Views

cat-i

My problem with ATOMIC was in Intel Visual Fortran using real(8) or double in C-speak.

If you find thread local storage variables consuming too much overhead you could find a less overhead method in using a parallel region for all threads, then diviing up the loop yourself.

A second alternative is to use nested ompparallel for loops where the outer loopiterates on a larger stride. e.g. 128 indicies. Then it can clear astack localsum variable before entering in inner loop to perform the iterations for the members in the stride (e.g. 128 with last outer iteration mayrequire less thangroup size).Then on exit from inner loop, perform the reduction operator using the atomic. This will reduce the adverse cache interaction between threads (e.g. 1/128th number ofcollisions/invalidations)

Jim

0 Kudos
Reply