Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Parallelize single threaded IPP functions

steffenroeber
Beginner
1,186 Views

Hi,

is there any description available that describes how to use single threaded IPP functions in a multithreaded environment?

'Simple' functions such as ippiHSVToRGB_8u_C3R that do not use buffers, specs or borders are no problem. Each thread works on its own roi. But what about roi sizes (cpu caches, memory alignment and so on).

And how do I use filters with borders (ippiFilterScharrHorizMaskBorder_8u16s_C1R) , or fft with a spec (ippiFFTFwd_RToPack_32f_C1R), or others using a buffer (ippiFastMarching_8u32f_C1R)?

0 Kudos
17 Replies
Chao_Y_Intel
Moderator
1,186 Views

 

Hello, 
Can you a check on the following IPP sample, it provides some example showing how to calling IPP function with external threading: 
ipp\examples\ipp-examples.zip

The examples\ipp-examples\examples\ipp_thread provide the example code for filter and other functions.  The document folder at that example has the steps to build and run the sample code. 

Thanks,
Chao

0 Kudos
Gennady_F_Intel
Moderator
1,186 Views

please see another example shows how to call FilterSchar from multithreading environment

0 Kudos
steffenroeber
Beginner
1,186 Views

WHat are the border flags ippBorderInMemTop and ippBorderInMemBottom used for?

0 Kudos
steffenroeber
Beginner
1,186 Views

In function ippiFilterScharrHorizMaskBorderGetBufferSize_mt of your example:

what's the logic of:

bufsize = (bufsize + 63) & ~63;

0 Kudos
Sergey_K_Intel
Employee
1,186 Views

steffenroeber wrote:

In function ippiFilterScharrHorizMaskBorderGetBufferSize_mt of your example:

what's the logic of:

bufsize = (bufsize + 63) & ~63;

Hi,

It's finding nearest value greater than bufsize and divisible by 64. Like "bufsize += 64 - bufsize % 64;"

0 Kudos
steffenroeber
Beginner
1,186 Views

Ok. Wrong question. Why must the bufSize divisible by 64?

0 Kudos
steffenroeber
Beginner
1,186 Views

and why is bufsize *= mxnthr; bufSize is already for complete image roi

0 Kudos
Ivan_Z_Intel
Employee
1,186 Views

1.

WHat are the border flags ippBorderInMemTop and ippBorderInMemBottom used for?

In this example source image is cut into stripes. In top stripe (which is processed by thread #0) bottom border is real pixels, so border type must be  ippBorderInMemBottom. Vice versa in bottom stripe (which is processed by thread #(max_nums_thread-1)) top border is real pixels, so border  type for this stripe must be ippBorderInMemTop. For this reason for other stripes border type must be ippBorderInMemTop+ippBorderInMemBottom.

2.

Why must the bufSize divisible by 64?

For every thread separate buffer is needed. This bufSize defines buffer size for one thread. If bufSize is divisible by 64 then buffers for all threads will be aligned by 64 (it is important for performance).

3.

why is bufsize *= mxnthr; bufSize is already for complete image roi

Yes, there is mistake. Thanks for notice.

Below the code is better: For this case there is memory saving.

void ippiFilterScharrHorizMaskBorderGetBufferSize_mt(IppiSize dstRoiSize, IppiMaskSize mask, IppDataType srcDataType, IppDataType dstDataType, int numChannels, int *pBufferSize, int *bufStep, int *numthr)
{
    int bufsize;
    int mxnthr = 1;
    if (*numthr <= 1) {
        ippiFilterScharrHorizMaskBorderGetBufferSize(dstRoiSize, mask, srcDataType, dstDataType, numChannels, &bufsize);
        *bufStep = bufsize;
    } else {
        int hd;
        int hr;
        mxnthr = omp_get_max_threads();
        if (mxnthr > *numthr) {
            mxnthr = *numthr;
            omp_set_num_threads(mxnthr);
        }
        hd = dstRoiSize.height / mxnthr;
        hr = dstRoiSize.height % mxnthr;
        dstRoiSize.height = hd + hr;
        ippiFilterScharrHorizMaskBorderGetBufferSize(dstRoiSize, mask, srcDataType, dstDataType, numChannels, &bufsize);
        bufsize = (bufsize + 63) & ~63;
        *bufStep = bufsize;
        bufsize *= mxnthr;
    }
    *pBufferSize = bufsize;
    *numthr = mxnthr;
}

Thanks!

 

 

0 Kudos
steffenroeber
Beginner
1,186 Views

Thank  you for that explanation. Now it works.

But next questions:

What about linear transformations? For example: ippiFFTInit_R_32f. There are

IppiFFTSpec_R_32f* pFFTSpec

Ipp8u* pMemInit

can that be shared in threads?

Is it possible to parallelize ippiHoughLine_Region_8u32f_C1R and similars?

0 Kudos
Alexey_Tyndyuk
Beginner
1,186 Views

Hi,

You can't share a single pMemInit buffer in multiple threads, but you can share FFTSpec.
If you need to run the same type and size of FFT in multiple threads, you can initialize FFTSpec only once, and then free pMemInit after initialization. Once you initialize FFTSpec (IppiFFTSpec_R_32f), you can share it between multiple threads since it is not modified by ippiFFT processing functions. But you will need a separate work buffer (pBuffer) for each ippiFFT function running in its own thread.

You can find an example for ippiFFTInit_R_32f here:
https://software.intel.com/en-us/node/504249

Best regards,
Alexey

0 Kudos
steffenroeber
Beginner
1,186 Views

What about the ippiHoughLine_Region_8u32f_C1R?

0 Kudos
Andrey_B_Intel
Employee
1,186 Views

steffenroeber wrote:

Hi Steffen.

is there any description available that describes how to use single threaded IPP functions in a multithreaded environment?

I am attaching example of threading morphology functions. You can compile it and run to undestand how it works.

Thanks for using IPP.

0 Kudos
Andrey_B_Intel
Employee
1,186 Views

steffenroeber wrote:

What about the ippiHoughLine_Region_8u32f_C1R?

The result of this function is sorted list of lines.  You can split region by angles and process them in parallel. But you need to keep in mind following:

For example single threaded version returns 10 lines sorted by weight from whole image.

In parallel mode every thread returns 10 lines too  so total number of lines is 10*(N of threads)). After finishing multi-threaded version you need to analyze these 50 lines and select the first 10 strongest lines. They will be equal the result of single-threaded function.  For example you can calculate number of pixels at every returned line. Of course it is overhead but unfortunately current API does not provide infomation about weight of line.

0 Kudos
steffenroeber
Beginner
1,186 Views

Wjat do you mean with "split region by angles"?

0 Kudos
Andrey_B_Intel
Employee
1,186 Views

steffenroeber wrote:

Wjat do you mean with "split region by angles"?

Look please at description of function in manual

IppStatus ippiHoughLine_Region_8u32f_C1R(const Ipp8u* pSrc, int srcStep, IppiSize roiSize, IppPointPolar* pLine, IppPointPolar dstRoi[2], int maxLineCount, int*pLineCount, IppPointPolar delta, int threshold, Ipp8u* pBuffer);

"dstRoi Specifies the range of parameters of straight lines to be detected." It means that function return only lines which have angles from dstRoi[0].theta to dstRoi[1].theta. For multithreaded version you can split region by N parts with angle step (dstRoi[1].theta-dstRoi[0].theta)/N and call every thread with its own dstRoi parameter. The code could be:

deltaTheta = (dstRoiST[1].theta-dstRoiST[0].theta)/N;

for(n=0;n<N;n++){

   dstRoiMT [0].rho=dstRoiST [0].rho;

   dstRoiMT [1].rho=dstRoiST [1].rho;

   dstRoiMT [0].theta=dstRoiST [0].theta+n*deltaTheta;

   dstRoiMT [1].theta=dstRoiST [1].theta+(n+1)*deltaTheta;

}

0 Kudos
steffenroeber
Beginner
1,186 Views

Ok. This function alwo works. Now next one: ippiHoughLine_8u32f_C1R

Here I have a roi. Can I use that for parallelization?

0 Kudos
Andrey_B_Intel
Employee
1,186 Views

steffenroeber wrote:

Ok. This function alwo works. Now next one: ippiHoughLine_8u32f_C1R

Here I have a roi. Can I use that for parallelization?

Sorry, but I don't undestand question

You cannot parallelize ippiHoughLine_8u32f_C1R and ippiHoughLine_Region_8u32f_C1R by splitting on tiles in roi. The both function use pixels of whole image so for correct parallelization you can split Hough space only. Therefore ippiHoughLine_8u32f_C1R cannot be parallelized because it does not have API for splitting Hough space. But you can replace ippiHoughLine_8u32f_C1R with ippiHoughLine_Region_8u32f_C1R with lines from diapasone [0..2PI]

 

0 Kudos
Reply