Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16597 Discussions

Improvement of self-written OpenCL-Funktion (GaussianBlur)

Altera_Forum
Honored Contributor II
1,210 Views

Hello, I have implemented a Gaussian Filter on the FPGA (Cyclone V SoC) using OpenCL which works ok (2.5 times faster as on the ARM) but i´m not quite sure if it´s optimal for the FPGA. 

 

host-code: 

... 

status = clsetkernelarg(kernel,0,sizeof(cl_mem),&buffer_img); // Matrix which holds the Kernelparameters 

status = clsetkernelarg(kernel,1,sizeof(cl_mem),&buffer_mask); // Matrix which holds an graysclae image, 

status = clsetkernelarg(kernel,2,sizeof(cl_mem),&buffer_outputimg); // Matrix for output 

status = clsetkernelarg(kernel,3,sizeof(int),&img.cols); 

status = clsetkernelarg(kernel,4,sizeof(int),&maskwidth); 

 

 

size_t globalWorkSize[2]; 

globalWorkSize[0] = output.cols; 

globalWorkSize[1] = output.rows; 

status = clEnqueueNDRangeKernel(cmdQueue,kernel,2,NULL, globalWorkSize, NULL,0, NULL,NULL); 

... 

 

kernel-code: 

 

 

__kernel void convolve(__global uchar * input, __global float * mask, __global uchar * output, 

const int inputWidth,const int maskWidth)  

const int x = get_global_id(0); 

const int y = get_global_id(1); 

 

 

float sum = 0; 

for (int r = 0; r < maskWidth; r++) 

//Inkrementieren rowindex with picturewidth  

const int idxrow = (y + r) * inputWidth + x; 

for (int c = 0; c < maskWidth; c++) 

//convolve 

sum += mask[(r * maskWidth) + c] * input[idxrow + c]; 

output[y * get_global_size(0) + x] = sum; 

 

 

Can someone tell me if and how it´s possible to improve the peroformance of the Gaussian Kernel on the FPGA?  

Thanks :) 

 

0 Kudos
4 Replies
Altera_Forum
Honored Contributor II
404 Views

Comparing with the ARM core is probably not very conclusive since the ARM core is extremely slow. 

 

The most obvious way to increase performance on the FPGA would be to unroll the loop on "c". Though since you are performing a floating-point reduction, you should either fully unroll that loop, or first optimize that loop to achieve an iteration interval of one by inferring a shift register as outlined in "Intel® FPGA SDK for OpenCL Best Practices Guide, 1.6.1.5 Removing Loop-Carried Dependency by Inferring Shift Registers" and then unroll it to achieve best performance. 

 

You should consider fully reading Intel's programming and best practices guides since all the basic optimization techniques are covered there.
0 Kudos
Altera_Forum
Honored Contributor II
404 Views

unrolling the loop on c improved the kernel. After reading the Best Practice Guide i tried to improve my GaussianBlur function like suggested in 1.6.1.5. 

 

With help from the example in the guide and the exampleimplementation of an Sobel-Filter on: https://www.altera.com/support/support-resources/design-examples/design-software/opencl/sobel-filter.html  

 

But my output of the kernel isn´t correct. 

 

Kernel-Code: 

#define maskWidth 7 # define COLS 640 __kernel void gaussneu(global uchar * restrict frame_in, global uchar * restrict frame_out, const int iterations/*, const int COLS*/) { // Filter coefficients float mask = { { 0.0049, 0.0092, 0.0134, 0.0152, 0.0134, 0.0092, 0.0049}, { 0.0092, 0.0172, 0.0250, 0.0283, 0.0250, 0.0172, 0.0092}, { 0.0134, 0.0250, 0.0364, 0.0412, 0.0364, 0.0250, 0.0134}, { 0.0152, 0.0283, 0.0412, 0.0467, 0.0412, 0.0283, 0.0152}, { 0.0134, 0.0250, 0.0364, 0.0412, 0.0364, 0.0250, 0.0134}, { 0.0092, 0.0172, 0.0250, 0.0283, 0.0250, 0.0172, 0.0092}, { 0.0049, 0.0092, 0.0134, 0.0152, 0.0134, 0.0092, 0.0049}, }; // Pixel buffer of 6 rows and 7 extra pixels int rows; // The initial iterations are used to initialize the pixel buffer. int count = -(6 * COLS + 7); while (count != iterations) { // Each cycle, shift a new pixel into the buffer. // Unrolling this loop allows the compile to infer a shift register. # pragma unroll for (int i = COLS * 6 + 6; i > 0; --i) { rows = rows; } rows = count >= 0 ? frame_in : 0; // if count >=0 -> true rows = frame_in float sum = 0; // With these loops unrolled, one convolution can be computed every cycle. # pragma unroll for (int i = 0; i < maskWidth; ++i) { # pragma unroll for (int j = 0; j < maskWidth; ++j) { uchar pixel = rows; sum += mask*pixel; } } if (count >= 0) { frame_out = sum; } count++; } } 

 

Can someone tell me whats the problem?  

Also I think there is an error on the sobel.cl file from the altera example in this part: 

 

int temp = abs(x_dir) + abs(y_dir); unsigned int clamped; if (temp > threshold) { clamped = 0xffffff; } else { clamped = 0; } if (count >= 0) { frame_out = clamped; } 

 

the computet value temp is never set, the output values only can take the values 0 or 0xffffffff
0 Kudos
Altera_Forum
Honored Contributor II
404 Views

Edit @ question 1: my kernel code works, i did a mistake with handling the boarderproblem in the Host-Code :x. But i still think that the sobel.cl file has an error

0 Kudos
Altera_Forum
Honored Contributor II
404 Views

 

--- Quote Start ---  

Edit @ question 1: my kernel code works, i did a mistake with handling the boarderproblem in the Host-Code :x. But i still think that the sobel.cl file has an error 

--- Quote End ---  

 

 

They are just setting the output to either 0 or 0xffffff (max value). Since this code works based on threshold, anything above the threshold is seen as an edge and anything below is black. You can change the threshold value to obtain different images and choose the best threshold.
0 Kudos
Reply