Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.
6709 Discussions

Parallel Image Processing in OpenMP - Image Blocks

Royi
Novice
405 Views

Parallel Image Processing in OpenMP - Image Blocks

Hello,
I'm doing my first steps in the OpenMP world.

I have an image I want to apply a filter on.
Since the image is large I wanted to break it into blocks and apply the filter on each independently in parallel.
Namely, I'm creating 4 images I want to have different threads.

I'm using Intel IPP for the handling of the images and the function to apply on each sub image.

I described the code here:

http://stackoverflow.com/questions/29319226/parallel-image-processing-in-openmp-splitting-image

The problem is I tried both sections and parallel for and got only 20% improvement.

What am I doing wrong?
How can I tell each "Worker" that though data is taken from the same array, it is safe to read (Data won't change) and write (Each worker has exclusive approach to its part of the result image).

Thank You.

0 Kudos
4 Replies
Chao_Y_Intel
Moderator
405 Views

Hello, 

When I quickly check the code, one thing I note is that the code is not handling the image border. Is this some bug there?http://scc.qibebt.cas.cn/docs/compiler/intel/2011/ipp/ipp_manual/IPPI/ippi_ch9/ch9_borders.htm

Also how Intel IPP is linked in the code,  dynamic, or static?

Also to further checking the threading behavior,  I also to use some performance tools, like Intel vtune to understand how this code is running the processors, and see if it is really executed in parallelism.

Thanks,
Chao  

0 Kudos
Igor_A_Intel
Employee
405 Views

Hi Royi,

make sure that use static linking with non-threaded ippi library. In this way you can use the next approach:

        int height  = dstRoiSize.height;
        int width   = dstRoiSize.width;
       Ipp32f *pSrc1, *pDst1;
        int nThreads, cH, cT;

#pragma omp parallel  shared( pSrc, pDst, nThreads, width, height, kernelSize,\
                             xAnchor, cH, cT ) private( pSrc1, pDst1 )
        {
    #pragma omp master
            {
                nThreads = omp_get_num_threads();
                cH = height / nThreads;
                cT = height % nThreads;
            }
    #pragma omp barrier
            {
                int curH;
                int id = omp_get_thread_num();

                pSrc1 = (Ipp32f*)( (Ipp8u*)pSrc + id * cH * srcStep );
                pDst1 = (Ipp32f*)( (Ipp8u*)pDst + id * cH * dstStep );
                if( id != ( nThreads - 1 )) curH = cH;
                else curH = cH + cT;
                ippiFilterRow_32f_C1R( pSrc1, srcStep, pDst1, dstStep,
                            width, curH, pKernel, kernelSize, xAnchor );
            }
        }

"master" region is required for calculating an amount of work for each thread, after "barrier" each thread can calculate the region of image to process (independently). You see OMP keywords "shared" and "private" in the parallel region declaration to distinguish shared and private variables.

regards, Igor

0 Kudos
Royi
Novice
405 Views

Hi Igor,

Is this the approach in your Multi Threaded versions?
What are the performance gain with your code, have you tried it?

This was very helpful.

Thank You!

0 Kudos
Igor_A_Intel
Employee
405 Views

Hi Royi,

yes, this approach is used in ippiFilterXXX functions - you can compare IPP threaded and non-threaded libraries to see the gain. You can use IPP performance system for this purpose (PS) that is a part of each IPP release. I see up to ~2.2x speedup on my laptop (HSWx2 + HT on) for ippiFilter_32f_C1R for masksize 3x3 and 720x480 roi.

regards, Igor

0 Kudos
Reply