Parallel Image Processing in OpenMP - Image Blocks
I'm doing my first steps in the OpenMP world.
I have an image I want to apply a filter on.
Since the image is large I wanted to break it into blocks and apply the filter on each independently in parallel.
Namely, I'm creating 4 images I want to have different threads.
I'm using Intel IPP for the handling of the images and the function to apply on each sub image.
I described the code here:
The problem is I tried both sections and parallel for and got only 20% improvement.
What am I doing wrong?
How can I tell each "Worker" that though data is taken from the same array, it is safe to read (Data won't change) and write (Each worker has exclusive approach to its part of the result image).
When I quickly check the code, one thing I note is that the code is not handling the image border. Is this some bug there?http://scc.qibebt.cas.cn/docs/compiler/intel/2011/ipp/ipp_manual/IPPI/ippi_ch9/ch9_borders.htm
Also how Intel IPP is linked in the code, dynamic, or static?
Also to further checking the threading behavior, I also to use some performance tools, like Intel vtune to understand how this code is running the processors, and see if it is really executed in parallelism.
make sure that use static linking with non-threaded ippi library. In this way you can use the next approach:
int height = dstRoiSize.height;
int width = dstRoiSize.width;
Ipp32f *pSrc1, *pDst1;
int nThreads, cH, cT;
#pragma omp parallel shared( pSrc, pDst, nThreads, width, height, kernelSize,\
xAnchor, cH, cT ) private( pSrc1, pDst1 )
#pragma omp master
nThreads = omp_get_num_threads();
cH = height / nThreads;
cT = height % nThreads;
#pragma omp barrier
int id = omp_get_thread_num();
pSrc1 = (Ipp32f*)( (Ipp8u*)pSrc + id * cH * srcStep );
pDst1 = (Ipp32f*)( (Ipp8u*)pDst + id * cH * dstStep );
if( id != ( nThreads - 1 )) curH = cH;
else curH = cH + cT;
ippiFilterRow_32f_C1R( pSrc1, srcStep, pDst1, dstStep,
width, curH, pKernel, kernelSize, xAnchor );
"master" region is required for calculating an amount of work for each thread, after "barrier" each thread can calculate the region of image to process (independently). You see OMP keywords "shared" and "private" in the parallel region declaration to distinguish shared and private variables.
yes, this approach is used in ippiFilterXXX functions - you can compare IPP threaded and non-threaded libraries to see the gain. You can use IPP performance system for this purpose (PS) that is a part of each IPP release. I see up to ~2.2x speedup on my laptop (HSWx2 + HT on) for ippiFilter_32f_C1R for masksize 3x3 and 720x480 roi.