float pipeline

Intel_C_Intel · ‎03-23-2007

Hi guys,

I was implementing an image processing function when I stumbled upon a problem I do not quite understand. Essentially, I apply a custom filter to an image, and I obtained different results depending on the type of the input data.

The piece of code is fairly simple. For each pixel in an image, it processes its neighbours and outputs a float at that pixel. Most of the processing is done in floating-point value, as you can see from an excerpt of the code:

float sobelX, sobelY, G11=0.0f, G22=0.0f;

for(row=1 ; row

for(col=1 ; col

sobelX = 0.2f * image[row][col-1] -

0.2f * image[row][col+1];

sobelY = 0.2f * image[row-1][col] -

0.2f * image[row+1][col];

G11 = sobelX*sobelX;

G22 = sobelY*sobelY;

&nb sp; Output[row][col] = (float)sqrt(G11 + G22);

}

Now if the matrix image is made of int, this loop will take 0.28 secs. However, if image is made of float, it will only take 0.13 seconds. Such a big difference! Can someone enlighten me as to why that is the case?

Thanks in advance

Alex

Intel_C_Intel · ‎03-23-2007

Dear Alex,

You fail to mention compiler version and switches used or to provide acomplete example. Assuming you use cpu-specific switches, the innermost loop is vectorized for both int and float when I use something that looks similar to your fragment. Perhaps, in the full context, float vectorizes but int does not? Without more information, I am obviously forced to guess. Can you please try Qvec-report2 (Windows) or vec-report2 (Linux) in that case, or send me the full context?

Aart Bik

http://www.aartbik.com/

Intel_C_Intel · ‎04-03-2007

Thanks a lot for the answer. It took me some time, but I now have a stand-alone version (given at the end of this message). As for the information you are asking for, I am using the latest Intel compiler on a Windows XP machine with a Pentium 4 630 processor (classification 0F434). I am actually compiling from the Visual Studio interface, rather than the command line. The compiler switches displayedin the Visual Studio project property dialog are given below.

The interesting thing is that the behaviour I was referring to last week does not always occur, but rather occur when it should not. If "Use Intel Processor Extensions" is set to "P4 with SSE3", the int/float pipeline is slower (0.14 vs. 0.25) whereas if this option is set to "None", the code is (almost) as fast in both case (0.15 vs. 0.14).That does not make any sense to me. Any ideas why?

Thanks in advance

E.

#### COMPILER SWITCHES

compiler: /GL /c /Ox /Og /Ob2 /Oi /Ot /GA /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /GS /GR /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd

linker: /LTCG kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"C:ImSenseCodeOptimisationOptim01Release/Optim01.exe" /INCREMENTAL:NO /NOLOGO /TLBID:1 /DEBUG /PDB:"C:ImSenseCodeOptimisationOptim01ReleaseOptim01.pdb" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /IMPLIB:"C:ImSenseCodeOptimisationOptim01Release/Optim01.lib" /MACHINE:X86 /MANIFEST /MANIFESTFILE:"ReleaseOptim01.exe.intermediate.manifest"

##### CUT HERE #####

#include

LARGE_INTEGER start, end, frequency;

float

totalHi, totalLo;

void

resetTimer(void)

{

start.LowPart = start.HighPart = end.LowPart = end.HighPart = 0;

totalHi = totalLo = 0.0f;

QueryPerformanceCounter(&start);

}

void

stopTimer(char* msg)

{

float total_time=0.0f, frequency_time=0.0f;

QueryPerformanceCounter(&end);

total_time = ((

float)end.HighPart - (float)start.HighPart) * (float)pow(2.0f,32) +

((

float)end.LowPart - (float)start.LowPart);

QueryPerformanceFrequency(&frequency);

frequency_time = (

float)((frequency.HighPart) * (pow(2.0f,32))) + (float)frequency.LowPart;

printf(

"%s: %f ", msg, total_time / frequency_time);

}

#define

PIXEL(img, width, row, col) ((img) + ((row) * width) + (col))

void

Benchmark1(unsigned char* input, int width, int height)

{

int row=0, col=0;

float sobelX, sobelY, G11=0.0f, G22=0.0f;

float* tmp2=NULL, *output=NULL;

int* tmp3=NULL;

output = (

float*)malloc(width*height*sizeof(float));

tmp2 = (

float*)malloc(width*height*sizeof(float));

tmp3 = (

int*)malloc(width*height*sizeof(int));

QueryPerformanceFrequency(&frequency);

///// FLOAT PIPELINE WITH FLOAT INPUT

// create a float copy of input to test float pipeline

for(row=0 ; row

for(col=0 ; col

*PIXEL(tmp2, width, row, col) = (

float) *PIXEL(input, width, row, col);

}

resetTimer();

for(row=1 ; row

for(col=1 ; col

// blue channel

sobelX = 0.5f * *PIXEL(tmp2, width, row, col-1) -

0.5f * *PIXEL(tmp2, width, row, col+1);

sobelY = 0.5f * *PIXEL(tmp2, width, row-1, col) -

0.5f * *PIXEL(tmp2, width, row+1, col);

G11 += sobelX*sobelX;

G22 += sobelY*sobelY;

// then compute the gradient by square-rooting the accumulated value

*PIXEL(output, width, row, col) = (

float)sqrt(G11 + G22);

}

stopTimer(

"bench 1, all float pipeline");

///// FLOAT PIPELINE WITH INT INPUT

// create a int copy of input to test int pipeline

for(row=0 ; row

for(col=0 ; col

*PIXEL(tmp3, width, row, col) = (

int) *PIXEL(input, width, row, col);

}

resetTimer();

for(row=1 ; row

for(col=1 ; col

// blue channel

sobelX = 0.5f * *PIXEL(tmp3, width, row, col-1) -

0.5f * *PIXEL(tmp3, width, row, col+1);

sobelY = 0.5f * *PIXEL(tmp3, width, row-1, col) -

0.5f * *PIXEL(tmp3, width, row+1, col);

G11 += sobelX*sobelX;

G22 += sobelY*sobelY;

// then compute the gradient by square-rooting the accumulated value

*PIXEL(output, width, row, col) = (

float)sqrt(G11 + G22);

}

stopTimer(

"bench 1, int/float pipeline");

free(output); free(tmp2); free(tmp3);

return;

}

void

main(void)

{

// variables representing an input image

unsigned char* input; int width, height;

// misc variable

int loop;

// 1st set-up the image

width = 3650;

height = 2730;

input = (

unsigned char*)malloc(width*height*sizeof(unsigned char));

srand(rand());

for(loop=0 ; loopunsigned char)(rand()/255);}

// then perform benchmark

Benchmark1(input, width, height);

// release resources

free(input);

return;

}

##### CUT HERE #####

Intel_C_Intel · ‎04-03-2007

Dear Alex,

Using the 9.1 compilers, plain command line, and running on a 2.4GHz Core 2 Duo processor, this is what I get:

/O2
bench 1, all float pipeline: 0.394522
bench 1, int/float pipeline: 0.400839

/QxP [could use /QxT as well on this machine]
joho.cpp(86) : (col. 1) remark: LOOP WAS VECTORIZED.
joho.cpp(132) : (col. 1) remark: LOOP WAS VECTORIZED.
bench 1, all float pipeline: 0.267814
bench 1, int/float pipeline: 0.276876

Ironically, I don't have a Pentium 4 processor available at the moment. In your output window, do you see the same two loops vectorized (ignoring line changes due to my cut-and-pasting)?

Aart Bik
http://www.aartbik.com/

Intel_C_Intel · ‎04-03-2007

Thanks a lot for the prompt reply, Aart.

I indeed saw the message "LOOP WAS VECTORIZED" twice. Could you give me the full command line you used for compiling? That would be very useful so I can try to replicate your results. Thanks in advance.

Also, that is not really the point, but your results are rather slow?! I was expecting a Core 2 Duo to outperform a P4 significantly.

Thanks again

Alex

Intel_C_Intel · ‎04-03-2007

From the command line, I simply use

icl -O2 joho.cpp // default optimization
icl -QxN joho.cpp // Pentium 4 optimization

By the way, I got my hands on a P4 here in the lab and can confirm something strange:

/O2
bench 1, all float pipeline: 0.162543
bench 1, int/float pipeline: 0.149550

/QxN
bench 1, all float pipeline: 0.143731
bench 1, int/float pipeline: 0.239317

Intel_C_Intel · ‎04-03-2007

Hi Alex,

Loops 86 and 132 are simple initialization loops outside the actual timer of your kernels. When vectorized, register allocation changes downstream, causing the actual kernels to slow down unfortunately. You can eliminate this effect by using #pragma novector before the two loops.

But,it's much more interesting tovectorize the core computations in this case. By using Qvec-report3, you will find that the Intel compiler does not vectorize the core loops 98 and 144 due to data dependences on G11 and G22, which implement a hard-to-vectorize running sum.Since the sqrt is an expensive operation, I suspect you can still benefit a lot from automatic vectorization, albeit with some source level rewriting. In simplified form, rewrite something of the form:

for (i = 0; i < 1024; i++) {

G += a;

b = sqrt( G );

}

into this form, using a temporary buffer (assuming space allows):

for (i = 0; i < 1024; i++) {

G += a;

tmp = G;

}

for (i = 0; i < 1024; i++) {

b = sqrt( tmp );

}

Then the second loop will vectorize with potentially a large speedup. Please give this a try in your example. You will need two buffers, but can keep the sqrt and addition in the vector loop.

Also, even after this rewriting, you may want to use:

#pragma ivdep

for(col=1 ; col

...
}

to remove some assumed, but non-existing dependences between the pointers. In fact, if you use this pragma now, the loop will incorrectly vectorize the running sum, but with great speedup.

Hope this helps. Please let me know if vectorization worked for you.

Aart Bik

http://www.aartbik.com/

Intel_C_Intel · ‎04-04-2007

Hello Aart,

thanks for the quick answer. I tried to compile from the command line, but a dialog box pop-ups:

- Title: "link.exe - Unable to locate component"

- Message: ""This application has failed to start because mspdb80.dll was not found. Re-installing the application may fix this problem."

Obviously, no .exe isproduced (only an .obj is generated). Any idea of what is wrong with my setup?

Thanks

Alexis

Intel_C_Intel · ‎04-04-2007

Hi Alex,

If you use the .NET command prompt, the environment should have been setup correctly. This is irrelevant for getting the loop vectorized, however, since all this should work just as well from the Visual Studio interface if you set P4 with SSE3 (which essentially adds QxP to the command line). The important issue here is to get the core loops vectorized. Did you try the modifications I suggested?

Aart Bik

http://www.aartbik.com/

Intel_C_Intel · ‎04-04-2007

Hello again Aart,

You can ignore my previous message, I finally managed to compile from the command line (some environment variables problem).

I followed your suggestions, and the speed did improve. The most impressive gain was obtained by using the pragma directive, with a 2X improvement.

However, you mentioned that the running sums (G11 and G22 variables) creates problem for vectorization. Actually, the code I posted is only a portion of the full function, and these variables are not running sums in the full algorithm, they are re-initialized for each iteration of the inner loop (A new version with the full algo may be found below).

However, I still have the issue associated with the type of the input data. I do not obtain the same speed depending on whether the matrix is made of uchar or float.

> icl benchmark02.c
> benchmark02.exe
char input: 0.338497
float input: 0.228953

Even worse, not only adding thevectorization switch do not improve speed, it actually makes the uchar routine slower!

> icl -QxN -Qvec-report3 benchmark02.c
benchmark02.c(141) : (col. 2) remark: loop was not vectorized: contains unvector
izable statement at line 141.
benchmark02.c(148) : (col. 5) remark: LOOP WAS VECTORIZED.
benchmark02.c(45) : (col. 5) remark: loop was not vectorized: not inner loop.
benchmark02.c(48) : (col. 4) remark: loop was not vectorized: mixed data types.
benchmark02.c(87) : (col. 5) remark: loop was not vectorized: not inner loop.
benchmark02.c(88) : (col. 3) remark: loop was not vectorized: vectorization poss
ible but seems inefficient.
> benchmark02.exe
char input: 0.452141
float input: 0.229860

I really do not understand why -QxN makes the uchar function slower! Also is there a way to make loop 88 vectorized even if the compiler thinks it will be inefficient?

Thanks in advance

Alex

##### CUT HERE #####

#include

LARGE_INTEGER start, end, frequency;

float

totalHi, totalLo;

void

resetTimer(void)

{

start.LowPart = start.HighPart = end.LowPart = end.HighPart = 0;

totalHi = totalLo = 0.0f;

QueryPerformanceCounter(&start);

}

void

stopTimer(char* msg)

{

float total_time=0.0f, frequency_time=0.0f;

QueryPerformanceCounter(&end);

total_time = ((

float)end.HighPart - (float)start.HighPart) * ( float)pow(2.0f,32) +

((

float)end.LowPart - (float)start.LowPart);

QueryPerformanceFrequency(&frequency);

frequency_time = (

float)((frequency.HighPart) * (pow(2.0f,32))) + (float)frequency.LowPart;

printf(

"%s: %f ", msg, total_time / frequency_time);

}

#define

PIXEL(img, width, row, col, cha) ((img) + ((row) * (width) * 3) + (col)*3 + (cha))

#define

PIXELG(img, width, row, col) ((img) + ((row) * (width)) + (col))

void

BenchmarkChar(unsigned char *input, float *output, int width, int height)

{

int row=0, col=0;

float sobelX, sobelY, G11=0.0f, G22=0.0f;

resetTimer();

//#pragma ivdep

for(row=1 ; row

for(col=1 ; col

// red channel

sobelX = 0.5f * *PIXEL(input, width, row, col-1, 0) -

0.5f * *PIXEL(input, width, row, col+1, 0);

sobelY = 0.5f * *PIXEL(input, width, row-1, col, 0) -

0.5f * *PIXEL(input, width, row+1, col, 0);

G11 = sobelX*sobelX;

G22 = sobelY*sobelY;

// green channel

sobelX = 0.5f * *PIXEL(input, width, row, col-1, 1) -

0.5f * *PIXEL(input, width, row, col+1, 1);

sobelY = 0.5f * *PIXEL(input, width, row-1, col, 1) -

0.5f * *PIXEL(input, width, row+1, col, 1);

G11 += sobelX*sobelX;

G22 += sobelY*sobelY;

// blue channel

sobelX = 0.5f * *PIXEL(input, width, row, col-1, 2) -

0.5f * *PIXEL(input, width, row, col+1, 2);

sobelY = 0.5f * *PIXEL(input, width, row-1, col, 2) -

0.5f * *PIXEL(input, width, row+1, col, 2);

G11 += sobelX*sobelX;

G22 += sobelY*sobelY;

*PIXELG(output, width, row, col) = sqrt(G11 + G22);

}

stopTimer(

"char input");

return;

}

void

BenchmarkFloat(float *tmp, float *output, int width, int height)

{

int row=0, col=0;

float sobelX, sobelY, G11=0.0f, G22=0.0f;

resetTimer();

//#pragma ivdep

for(row=1 ; row

for(col=1 ; col

// red channel

sobelX = 0.5f * *PIXEL(tmp, width, row, col-1, 0) -

0.5f * *PIXEL(tmp, width, row, col+1, 0);

sobelY = 0.5f * *PIXEL(tmp, width, row-1, col, 0) -

0.5f * *PIXEL(tmp, width, row+1, col, 0);

G11 = sobelX*sobelX;

G22 = sobelY*sobelY;

// green channel

sobelX = 0.5f * *PIXEL(tmp, width, row, col-1, 1) -

0.5f * *PIXEL(tmp, width, row, col+1, 1);

sobelY = 0.5f * *PIXEL(tmp, width, row-1, col, 1) -

0.5f * *PIXEL(tmp, width, row+1, col, 1);

G11 += sobelX*sobelX;

G22 += sobelY*sobelY;

// blue channel

sobelX = 0.5f * *PIXEL(tmp, width, row, col-1, 2) -

0.5f * *PIXEL(tmp, width, row, col+1, 2);

sobelY = 0.5f * *PIXEL(tmp, width, row-1, col, 2) -

0.5f * *PIXEL(tmp, width, row+1, col, 2);

G11 += sobelX*sobelX;

G22 += sobelY*sobelY;

*PIXELG(output, width, row, col) = sqrt(G11 + G22);

}

stopTimer(

"float input");

return;

}

void

main(void)

{

// variables representing an input image

unsigned char* input; int width, height;

// output image

float *output, *tmp;

// misc variable

int loop, row=0, col=0;

QueryPerformanceFrequency(&frequency);

// 1st set-up the input image

width = 3650;

height = 2730;

input = (

unsigned char*)malloc(width*height*3*sizeof(unsigned char));

srand(rand());

for(loop=0 ; loopunsigned char)(rand()/255);}

// th en the output and tmp buffer

output = (

float*)malloc(width*height*sizeof(float));

tmp = (

float*)malloc(width*height*3*sizeof(float));

// copy input into buffer (for all float pipeline)

for(row=0 ; row

for(col=0 ; col

*PIXEL(tmp, width, row, col, 0) = (

float) *PIXEL(input, width, row, col, 0);

*PIXEL(tmp, width, row, col, 1) = (

float) *PIXEL(input, width, row, col, 1);

*PIXEL(tmp, width, row, col, 2) = (

float) *PIXEL(input, width, row, col, 2);

}

// perform benchmark

BenchmarkChar (input, output, width, height);

BenchmarkFloat(tmp, output, width, height);

// release resources

free(input);

return;

}

##### CUT HERE #####

Intel_C_Intel · ‎04-04-2007

Dear Alex,

I indeed noticed that in your original posting, G11 and G22 were private to each iteration, and the core loops vectorized without any problems (perhaps using a pragma ivdep). In your second posting, the variables are running sums and those need special handling before automatic vectorization becomes applicable.

This posting becomes very long with all the source code listing. Why dont we take this off-line? Please send your examples directly to me (aart.bik@intel.com) so all our source lines match. You can post the eventual findings after we resolved all vectorization issues.

Aart Bik

http://www.aartbik.com/

PS: to other forum readers: even though I enjoy resolving vectorization issues, posting my email is not an open invitation to dump megabytes of projects to my mailbox as has been done in the past; when unsolicited, I typically only addressconcise questions.

PPS: when I place the ivdep correctly, the second loop vectorizes; on a Core 2 Duo processor this yields a speedup of about 2.6; also, efficiency heuristics can be overriden with #pragma vector always

sequential:

char input: 0.286673
float input: 0.263570

vectorized:

char input: 0.251035
float input: 0.098048

Intel_C_Intel · ‎04-11-2007

Here is an update on what I managed to achieve so far thanks to Aart's tremendous help. We focused so far on making sure the inner loops were vectorized for both the char and float functions.

For the BenchmarkFloat function, it was simply a matter of providing the compiler with the appropriate hints to make the inner loop vectorize. Before the modification, the compiler reported that "remark: loop was not vectorized: vectorization possible but seems inefficient". A "#pragma vector always" directive was added just before the 'for' statement that initiates the inner loop, andthe message (which you can get with -Qvec-report2)changed to "remark: LOOP WAS VECTORIZED". This speed up things from 0.22 to 0.12 (on a P4). We also added a"#pragma ivdep"at the same location, but that did not change the timings.

For the BenchmarkChar functions, the problem was related to the memory ordering. Indeed, the pixel was stored in memory as RGBRGBRGBRGB......., which creates a non-unit stride between neighbouring values for one channel. Indeed, the code first performs some computation on the red channel, then do the same on the green, and then the blue. As far as I understand, vectorizing that is not efficient, because grabbing four consecutive values (to build the vector to be vectorized :) ) mean moving data around a lot (get the 1st, then put the one 4 bytes away next to 1st, ...).It would be much more efficient to have them already next to each other. So that is what Aart suggested, to store the image in memory by color planes, i.e. RRRRR....GGGGG...BBBBB..... This means we now have 3 different vectors, one for each channel, but the code vectorizes without problem. Resulting speed-up: 0.12 vs. 0.32 initially! The pragma directive were also added before the inner loop. Finally, Aart suggested to try this same approach for the BenchmarkFloat function, which did gain a little on the P4 (0.10 vs 0.12) but nothing on the Core 2 Duo.

I am now working on memory alignment issues to hopefully improve things further. I will also post my findings.

Alex