Parallel Image Processing in OpenMP - Image Blocks

Royi · ‎03-28-2015

Hello,
I'm doing my first steps in the OpenMP world.

I have an image I want to apply a filter on.
Since the image is large I wanted to break it into blocks and apply the filter on each independently in parallel.
Namely, I'm creating 4 images I want to have different threads.

I'm using Intel IPP for the handling of the images and the function to apply on each sub image.

I described the code here:

http://stackoverflow.com/questions/29319226/parallel-image-processing-in-openmp-splitting-image

The problem is I tried both sections and parallel for and got only 20% improvement.

What am I doing wrong?
How can I tell each "Worker" that though data is taken from the same array, it is safe to read (Data won't change) and write (Each worker has exclusive approach to its part of the result image).

Thank You.

Chao_Y_Intel · ‎03-31-2015

Hello,

When I quickly check the code, one thing I note is that the code is not handling the image border. Is this some bug there?http://scc.qibebt.cas.cn/docs/compiler/intel/2011/ipp/ipp_manual/IPPI/ippi_ch9/ch9_borders.htm

Also how Intel IPP is linked in the code, dynamic, or static?

Also to further checking the threading behavior, I also to use some performance tools, like Intel vtune to understand how this code is running the processors, and see if it is really executed in parallelism.

Thanks,
Chao

Igor_A_Intel · ‎04-01-2015

Hi Royi,

make sure that use static linking with non-threaded ippi library. In this way you can use the next approach:

        int height = dstRoiSize.height;
        int width   = dstRoiSize.width;
       Ipp32f *pSrc1, *pDst1;
        int nThreads, cH, cT;

#pragma omp parallel shared( pSrc, pDst, nThreads, width, height, kernelSize,\
                             xAnchor, cH, cT ) private( pSrc1, pDst1 )
        {
    #pragma omp master
            {
                nThreads = omp_get_num_threads();
                cH = height / nThreads;
                cT = height % nThreads;
            }
    #pragma omp barrier
            {
                int curH;
                int id = omp_get_thread_num();

                pSrc1 = (Ipp32f*)( (Ipp8u*)pSrc + id * cH * srcStep );
                pDst1 = (Ipp32f*)( (Ipp8u*)pDst + id * cH * dstStep );
                if( id != ( nThreads - 1 )) curH = cH;
                else curH = cH + cT;
                ippiFilterRow_32f_C1R( pSrc1, srcStep, pDst1, dstStep,
                            width, curH, pKernel, kernelSize, xAnchor );
            }
        }

"master" region is required for calculating an amount of work for each thread, after "barrier" each thread can calculate the region of image to process (independently). You see OMP keywords "shared" and "private" in the parallel region declaration to distinguish shared and private variables.

regards, Igor

Royi · ‎04-03-2015

Hi Igor,

Is this the approach in your Multi Threaded versions?
What are the performance gain with your code, have you tried it?

This was very helpful.

Thank You!

Igor_A_Intel · ‎04-09-2015

Hi Royi,

yes, this approach is used in ippiFilterXXX functions - you can compare IPP threaded and non-threaded libraries to see the gain. You can use IPP performance system for this purpose (PS) that is a part of each IPP release. I see up to ~2.2x speedup on my laptop (HSWx2 + HT on) for ippiFilter_32f_C1R for masksize 3x3 and 720x480 roi.

regards, Igor