- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guys,
I was implementing an image processing function when I stumbled upon a problem I do not quite understand. Essentially, I apply a custom filter to an image, and I obtained different results depending on the type of the input data.
The piece of code is fairly simple. For each pixel in an image, it processes its neighbours and outputs a float at that pixel. Most of the processing is done in floating-point value, as you can see from an excerpt of the code:
float sobelX, sobelY, G11=0.0f, G22=0.0f;
for(row=1 ; row
for(col=1 ; col
sobelX = 0.2f * image[row][col-1] -
0.2f * image[row][col+1];
sobelY = 0.2f * image[row-1][col] -
0.2f * image[row+1][col];
G11 = sobelX*sobelX;
G22 = sobelY*sobelY;
&nb sp; Output[row][col] = (float)sqrt(G11 + G22);
}
}
Now if the matrix image is made of int, this loop will take 0.28 secs. However, if image is made of float, it will only take 0.13 seconds. Such a big difference! Can someone enlighten me as to why that is the case?
Thanks in advance
Alex
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Alex,
You fail to mention compiler version and switches used or to provide acomplete example. Assuming you use cpu-specific switches, the innermost loop is vectorized for both int and float when I use something that looks similar to your fragment. Perhaps, in the full context, float vectorizes but int does not? Without more information, I am obviously forced to guess. Can you please try Qvec-report2 (Windows) or vec-report2 (Linux) in that case, or send me the full context?
Aart Bik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot for the answer. It took me some time, but I now have a stand-alone version (given at the end of this message). As for the information you are asking for, I am using the latest Intel compiler on a Windows XP machine with a Pentium 4 630 processor (classification 0F434). I am actually compiling from the Visual Studio interface, rather than the command line. The compiler switches displayedin the Visual Studio project property dialog are given below.
The interesting thing is that the behaviour I was referring to last week does not always occur, but rather occur when it should not. If "Use Intel Processor Extensions" is set to "P4 with SSE3", the int/float pipeline is slower (0.14 vs. 0.25) whereas if this option is set to "None", the code is (almost) as fast in both case (0.15 vs. 0.14).That does not make any sense to me. Any ideas why?
Thanks in advance
E.
#### COMPILER SWITCHES
compiler: /GL /c /Ox /Og /Ob2 /Oi /Ot /GA /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /GS /GR /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd
linker: /LTCG kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"C:ImSenseCodeOptimisationOptim01Release/Optim01.exe" /INCREMENTAL:NO /NOLOGO /TLBID:1 /DEBUG /PDB:"C:ImSenseCodeOptimisationOptim01ReleaseOptim01.pdb" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /IMPLIB:"C:ImSenseCodeOptimisationOptim01Release/Optim01.lib" /MACHINE:X86 /MANIFEST /MANIFESTFILE:"ReleaseOptim01.exe.intermediate.manifest"
##### CUT HERE #####
#include
#include
#include
#include
LARGE_INTEGER start, end, frequency;
float
totalHi, totalLo;void
resetTimer(void){
start.LowPart = start.HighPart = end.LowPart = end.HighPart = 0;
totalHi = totalLo = 0.0f;
QueryPerformanceCounter(&start);
}
void
stopTimer(char* msg){
float total_time=0.0f, frequency_time=0.0f;QueryPerformanceCounter(&end);
total_time = ((
float)end.HighPart - (float)start.HighPart) * (float)pow(2.0f,32) +((
float)end.LowPart - (float)start.LowPart);QueryPerformanceFrequency(&frequency);
frequency_time = (
float)((frequency.HighPart) * (pow(2.0f,32))) + (float)frequency.LowPart;printf(
"%s: %f ", msg, total_time / frequency_time);}
#define
PIXEL(img, width, row, col) ((img) + ((row) * width) + (col))void
Benchmark1(unsigned char* input, int width, int height){
int row=0, col=0; float sobelX, sobelY, G11=0.0f, G22=0.0f; float* tmp2=NULL, *output=NULL; int* tmp3=NULL;output = (
float*)malloc(width*height*sizeof(float));tmp2 = (
float*)malloc(width*height*sizeof(float));tmp3 = (
int*)malloc(width*height*sizeof(int));QueryPerformanceFrequency(&frequency);
///// FLOAT PIPELINE WITH FLOAT INPUT // create a float copy of input to test float pipeline for(row=0 ; row*PIXEL(tmp2, width, row, col) = (
}
}
resetTimer();
for(row=1 ; rowsobelX = 0.5f * *PIXEL(tmp2, width, row, col-1) -
0.5f * *PIXEL(tmp2, width, row, col+1);
sobelY = 0.5f * *PIXEL(tmp2, width, row-1, col) -
0.5f * *PIXEL(tmp2, width, row+1, col);
G11 += sobelX*sobelX;
G22 += sobelY*sobelY;
// then compute the gradient by square-rooting the accumulated value*PIXEL(output, width, row, col) = (
float)sqrt(G11 + G22);}
}
stopTimer(
"bench 1, all float pipeline"); ///// FLOAT PIPELINE WITH INT INPUT // create a int copy of input to test int pipeline for(row=0 ; row*PIXEL(tmp3, width, row, col) = (
}
}
resetTimer();
for(row=1 ; rowsobelX = 0.5f * *PIXEL(tmp3, width, row, col-1) -
0.5f * *PIXEL(tmp3, width, row, col+1);
sobelY = 0.5f * *PIXEL(tmp3, width, row-1, col) -
0.5f * *PIXEL(tmp3, width, row+1, col);
G11 += sobelX*sobelX;
G22 += sobelY*sobelY;
// then compute the gradient by square-rooting the accumulated value*PIXEL(output, width, row, col) = (
float)sqrt(G11 + G22);}
}
stopTimer(
"bench 1, int/float pipeline");free(output); free(tmp2); free(tmp3);
return;}
void
main(void){
// variables representing an input image unsigned char* input; int width, height; // misc variable int loop; // 1st set-up the imagewidth = 3650;
height = 2730;
input = (
unsigned char*)malloc(width*height*sizeof(unsigned char));srand(rand());
for(loop=0 ; loopBenchmark1(input, width, height);
// release resourcesfree(input);
return;}
##### CUT HERE #####
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Alex,
Using the 9.1 compilers, plain command line, and running on a 2.4GHz Core 2 Duo processor, this is what I get:
/O2
bench 1, all float pipeline: 0.394522
bench 1, int/float pipeline: 0.400839
/QxP [could use /QxT as well on this machine]
joho.cpp(86) : (col. 1) remark: LOOP WAS VECTORIZED.
joho.cpp(132) : (col. 1) remark: LOOP WAS VECTORIZED.
bench 1, all float pipeline: 0.267814
bench 1, int/float pipeline: 0.276876
Ironically, I don't have a Pentium 4 processor available at the moment. In your output window, do you see the same two loops vectorized (ignoring line changes due to my cut-and-pasting)?
Aart Bik
http://www.aartbik.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I indeed saw the message "LOOP WAS VECTORIZED" twice. Could you give me the full command line you used for compiling? That would be very useful so I can try to replicate your results. Thanks in advance.
Also, that is not really the point, but your results are rather slow?! I was expecting a Core 2 Duo to outperform a P4 significantly.
Thanks again
Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From the command line, I simply use
icl -O2 joho.cpp // default optimization
icl -QxN joho.cpp // Pentium 4 optimization
By the way, I got my hands on a P4 here in the lab and can confirm something strange:
/O2
bench 1, all float pipeline: 0.162543
bench 1, int/float pipeline: 0.149550
/QxN
bench 1, all float pipeline: 0.143731
bench 1, int/float pipeline: 0.239317
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex,
Loops 86 and 132 are simple initialization loops outside the actual timer of your kernels. When vectorized, register allocation changes downstream, causing the actual kernels to slow down unfortunately. You can eliminate this effect by using #pragma novector before the two loops.
But,it's much more interesting tovectorize the core computations in this case. By using Qvec-report3, you will find that the Intel compiler does not vectorize the core loops 98 and 144 due to data dependences on G11 and G22, which implement a hard-to-vectorize running sum.Since the sqrt is an expensive operation, I suspect you can still benefit a lot from automatic vectorization, albeit with some source level rewriting. In simplified form, rewrite something of the form:
for (i = 0; i < 1024; i++) {
G += a;
b = sqrt( G );
}
into this form, using a temporary buffer (assuming space allows):
for (i = 0; i < 1024; i++) {
G += a;
tmp = G;
}
for (i = 0; i < 1024; i++) {
b = sqrt( tmp );
}
Then the second loop will vectorize with potentially a large speedup. Please give this a try in your example. You will need two buffers, but can keep the sqrt and addition in the vector loop.
Also, even after this rewriting, you may want to use:
#pragma ivdep
for(col=1 ; col
...
}
to remove some assumed, but non-existing dependences between the pointers. In fact, if you use this pragma now, the loop will incorrectly vectorize the running sum, but with great speedup.
Hope this helps. Please let me know if vectorization worked for you.
Aart Bik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Aart,
thanks for the quick answer. I tried to compile from the command line, but a dialog box pop-ups:
- Title: "link.exe - Unable to locate component"
- Message: ""This application has failed to start because mspdb80.dll was not found. Re-installing the application may fix this problem."
Obviously, no .exe isproduced (only an .obj is generated). Any idea of what is wrong with my setup?
Thanks
Alexis
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex,
If you use the .NET command prompt, the environment should have been setup correctly. This is irrelevant for getting the loop vectorized, however, since all this should work just as well from the Visual Studio interface if you set P4 with SSE3 (which essentially adds QxP to the command line). The important issue here is to get the core loops vectorized. Did you try the modifications I suggested?
Aart Bik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello again Aart,
You can ignore my previous message, I finally managed to compile from the command line (some environment variables problem).
I followed your suggestions, and the speed did improve. The most impressive gain was obtained by using the pragma directive, with a 2X improvement.
However, you mentioned that the running sums (G11 and G22 variables) creates problem for vectorization. Actually, the code I posted is only a portion of the full function, and these variables are not running sums in the full algorithm, they are re-initialized for each iteration of the inner loop (A new version with the full algo may be found below).
However, I still have the issue associated with the type of the input data. I do not obtain the same speed depending on whether the matrix is made of uchar or float.
> icl benchmark02.c
> benchmark02.exe
char input: 0.338497
float input: 0.228953
Even worse, not only adding thevectorization switch do not improve speed, it actually makes the uchar routine slower!
> icl -QxN -Qvec-report3 benchmark02.c
benchmark02.c(141) : (col. 2) remark: loop was not vectorized: contains unvector
izable statement at line 141.
benchmark02.c(148) : (col. 5) remark: LOOP WAS VECTORIZED.
benchmark02.c(45) : (col. 5) remark: loop was not vectorized: not inner loop.
benchmark02.c(48) : (col. 4) remark: loop was not vectorized: mixed data types.
benchmark02.c(87) : (col. 5) remark: loop was not vectorized: not inner loop.
benchmark02.c(88) : (col. 3) remark: loop was not vectorized: vectorization poss
ible but seems inefficient.
> benchmark02.exe
char input: 0.452141
float input: 0.229860
I really do not understand why -QxN makes the uchar function slower! Also is there a way to make loop 88 vectorized even if the compiler thinks it will be inefficient?
Thanks in advance
Alex
##### CUT HERE #####
#include
#include
#include
#include
LARGE_INTEGER start, end, frequency;
float
totalHi, totalLo;void
resetTimer(void){
start.LowPart = start.HighPart = end.LowPart = end.HighPart = 0;
totalHi = totalLo = 0.0f;
QueryPerformanceCounter(&start);
}
void
stopTimer(char* msg){
float total_time=0.0f, frequency_time=0.0f;QueryPerformanceCounter(&end);
total_time = ((
float)end.HighPart - (float)start.HighPart) * ( float)pow(2.0f,32) +((
float)end.LowPart - (float)start.LowPart);QueryPerformanceFrequency(&frequency);
frequency_time = (
float)((frequency.HighPart) * (pow(2.0f,32))) + (float)frequency.LowPart;printf(
"%s: %f ", msg, total_time / frequency_time);}
#define
PIXEL(img, width, row, col, cha) ((img) + ((row) * (width) * 3) + (col)*3 + (cha))#define
PIXELG(img, width, row, col) ((img) + ((row) * (width)) + (col))void
BenchmarkChar(unsigned char *input, float *output, int width, int height){
int row=0, col=0; float sobelX, sobelY, G11=0.0f, G22=0.0f;resetTimer();
//#pragma ivdep
for(row=1 ; rowsobelX = 0.5f * *PIXEL(input, width, row, col-1, 0) -
0.5f * *PIXEL(input, width, row, col+1, 0);
sobelY = 0.5f * *PIXEL(input, width, row-1, col, 0) -
0.5f * *PIXEL(input, width, row+1, col, 0);
G11 = sobelX*sobelX;
G22 = sobelY*sobelY;
// green channelsobelX = 0.5f * *PIXEL(input, width, row, col-1, 1) -
0.5f * *PIXEL(input, width, row, col+1, 1);
sobelY = 0.5f * *PIXEL(input, width, row-1, col, 1) -
0.5f * *PIXEL(input, width, row+1, col, 1);
G11 += sobelX*sobelX;
G22 += sobelY*sobelY;
// blue channelsobelX = 0.5f * *PIXEL(input, width, row, col-1, 2) -
0.5f * *PIXEL(input, width, row, col+1, 2);
sobelY = 0.5f * *PIXEL(input, width, row-1, col, 2) -
0.5f * *PIXEL(input, width, row+1, col, 2);
G11 += sobelX*sobelX;
G22 += sobelY*sobelY;
*PIXELG(output, width, row, col) = sqrt(G11 + G22);
}
}
stopTimer(
"char input"); return;}
void
BenchmarkFloat(float *tmp, float *output, int width, int height){
int row=0, col=0; float sobelX, sobelY, G11=0.0f, G22=0.0f;resetTimer();
//#pragma ivdep
for(row=1 ; rowsobelX = 0.5f * *PIXEL(tmp, width, row, col-1, 0) -
0.5f * *PIXEL(tmp, width, row, col+1, 0);
sobelY = 0.5f * *PIXEL(tmp, width, row-1, col, 0) -
0.5f * *PIXEL(tmp, width, row+1, col, 0);
G11 = sobelX*sobelX;
G22 = sobelY*sobelY;
// green channelsobelX = 0.5f * *PIXEL(tmp, width, row, col-1, 1) -
0.5f * *PIXEL(tmp, width, row, col+1, 1);
sobelY = 0.5f * *PIXEL(tmp, width, row-1, col, 1) -
0.5f * *PIXEL(tmp, width, row+1, col, 1);
G11 += sobelX*sobelX;
G22 += sobelY*sobelY;
// blue channelsobelX = 0.5f * *PIXEL(tmp, width, row, col-1, 2) -
0.5f * *PIXEL(tmp, width, row, col+1, 2);
sobelY = 0.5f * *PIXEL(tmp, width, row-1, col, 2) -
0.5f * *PIXEL(tmp, width, row+1, col, 2);
G11 += sobelX*sobelX;
G22 += sobelY*sobelY;
*PIXELG(output, width, row, col) = sqrt(G11 + G22);
}
}
stopTimer(
"float input"); return;}
void
main(void){
// variables representing an input image unsigned char* input; int width, height; // output image float *output, *tmp; // misc variable int loop, row=0, col=0;QueryPerformanceFrequency(&frequency);
// 1st set-up the input imagewidth = 3650;
height = 2730;
input = (
unsigned char*)malloc(width*height*3*sizeof(unsigned char));srand(rand());
for(loop=0 ; loopoutput = (
float*)malloc(width*height*sizeof(float));tmp = (
float*)malloc(width*height*3*sizeof(float)); // copy input into buffer (for all float pipeline) for(row=0 ; row*PIXEL(tmp, width, row, col, 0) = (
*PIXEL(tmp, width, row, col, 1) = (
float) *PIXEL(input, width, row, col, 1);*PIXEL(tmp, width, row, col, 2) = (
float) *PIXEL(input, width, row, col, 2);}
}
// perform benchmarkBenchmarkChar (input, output, width, height);
BenchmarkFloat(tmp, output, width, height);
// release resourcesfree(input);
return;}
##### CUT HERE #####
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Alex,
I indeed noticed that in your original posting, G11 and G22 were private to each iteration, and the core loops vectorized without any problems (perhaps using a pragma ivdep). In your second posting, the variables are running sums and those need special handling before automatic vectorization becomes applicable.
This posting becomes very long with all the source code listing. Why dont we take this off-line? Please send your examples directly to me (aart.bik@intel.com) so all our source lines match. You can post the eventual findings after we resolved all vectorization issues.
Aart Bik
PS: to other forum readers: even though I enjoy resolving vectorization issues, posting my email is not an open invitation to dump megabytes of projects to my mailbox as has been done in the past; when unsolicited, I typically only addressconcise questions.
PPS: when I place the ivdep correctly, the second loop vectorizes; on a Core 2 Duo processor this yields a speedup of about 2.6; also, efficiency heuristics can be overriden with #pragma vector always
sequential:
char input: 0.286673
float input: 0.263570
vectorized:
char input: 0.251035
float input: 0.098048
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is an update on what I managed to achieve so far thanks to Aart's tremendous help. We focused so far on making sure the inner loops were vectorized for both the char and float functions.
For the BenchmarkFloat function, it was simply a matter of providing the compiler with the appropriate hints to make the inner loop vectorize. Before the modification, the compiler reported that "remark: loop was not vectorized: vectorization possible but seems inefficient". A "#pragma vector always" directive was added just before the 'for' statement that initiates the inner loop, andthe message (which you can get with -Qvec-report2)changed to "remark: LOOP WAS VECTORIZED". This speed up things from 0.22 to 0.12 (on a P4). We also added a"#pragma ivdep"at the same location, but that did not change the timings.
For the BenchmarkChar functions, the problem was related to the memory ordering. Indeed, the pixel was stored in memory as RGBRGBRGBRGB......., which creates a non-unit stride between neighbouring values for one channel. Indeed, the code first performs some computation on the red channel, then do the same on the green, and then the blue. As far as I understand, vectorizing that is not efficient, because grabbing four consecutive values (to build the vector to be vectorized :) ) mean moving data around a lot (get the 1st, then put the one 4 bytes away next to 1st, ...).It would be much more efficient to have them already next to each other. So that is what Aart suggested, to store the image in memory by color planes, i.e. RRRRR....GGGGG...BBBBB..... This means we now have 3 different vectors, one for each channel, but the code vectorizes without problem. Resulting speed-up: 0.12 vs. 0.32 initially! The pragma directive were also added before the inner loop. Finally, Aart suggested to try this same approach for the BenchmarkFloat function, which did gain a little on the P4 (0.10 vs 0.12) but nothing on the Core 2 Duo.
I am now working on memory alignment issues to hopefully improve things further. I will also post my findings.
Alex
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page