Parallelize single threaded IPP functions

steffenroeber · ‎10-29-2014

Hi,

is there any description available that describes how to use single threaded IPP functions in a multithreaded environment?

'Simple' functions such as ippiHSVToRGB_8u_C3R that do not use buffers, specs or borders are no problem. Each thread works on its own roi. But what about roi sizes (cpu caches, memory alignment and so on).

And how do I use filters with borders (ippiFilterScharrHorizMaskBorder_8u16s_C1R) , or fft with a spec (ippiFFTFwd_RToPack_32f_C1R), or others using a buffer (ippiFastMarching_8u32f_C1R)?

Chao_Y_Intel · ‎10-30-2014

Hello,
Can you a check on the following IPP sample, it provides some example showing how to calling IPP function with external threading:
ipp\examples\ipp-examples.zip

The examples\ipp-examples\examples\ipp_thread provide the example code for filter and other functions. The document folder at that example has the steps to build and run the sample code.

Thanks,
Chao

Gennady_F_Intel · ‎10-30-2014

please see another example shows how to call FilterSchar from multithreading environment

steffenroeber · ‎10-30-2014

WHat are the border flags ippBorderInMemTop and ippBorderInMemBottom used for?

steffenroeber · ‎10-30-2014

In function ippiFilterScharrHorizMaskBorderGetBufferSize_mt of your example:

what's the logic of:

bufsize = (bufsize + 63) & ~63;

Sergey_K_Intel · ‎10-30-2014

steffenroeber wrote:

In function ippiFilterScharrHorizMaskBorderGetBufferSize_mt of your example:

what's the logic of:

bufsize = (bufsize + 63) & ~63;

Hi,

It's finding nearest value greater than bufsize and divisible by 64. Like "bufsize += 64 - bufsize % 64;"

steffenroeber · ‎10-30-2014

Ok. Wrong question. Why must the bufSize divisible by 64?

steffenroeber · ‎10-30-2014

and why is bufsize *= mxnthr; bufSize is already for complete image roi

Ivan_Z_Intel · ‎10-31-2014

1.

WHat are the border flags ippBorderInMemTop and ippBorderInMemBottom used for?

In this example source image is cut into stripes. In top stripe (which is processed by thread #0) bottom border is real pixels, so border type must be ippBorderInMemBottom. Vice versa in bottom stripe (which is processed by thread #(max_nums_thread-1)) top border is real pixels, so border type for this stripe must be ippBorderInMemTop. For this reason for other stripes border type must be ippBorderInMemTop+ippBorderInMemBottom.

2.

Why must the bufSize divisible by 64?

For every thread separate buffer is needed. This bufSize defines buffer size for one thread. If bufSize is divisible by 64 then buffers for all threads will be aligned by 64 (it is important for performance).

3.

why is bufsize *= mxnthr; bufSize is already for complete image roi

Yes, there is mistake. Thanks for notice.

Below the code is better: For this case there is memory saving.

void ippiFilterScharrHorizMaskBorderGetBufferSize_mt(IppiSize dstRoiSize, IppiMaskSize mask, IppDataType srcDataType, IppDataType dstDataType, int numChannels, int *pBufferSize, int *bufStep, int *numthr)
{
    int bufsize;
    int mxnthr = 1;
    if (*numthr <= 1) {
        ippiFilterScharrHorizMaskBorderGetBufferSize(dstRoiSize, mask, srcDataType, dstDataType, numChannels, &bufsize);
        *bufStep = bufsize;
    } else {
        int hd;
        int hr;
        mxnthr = omp_get_max_threads();
        if (mxnthr > *numthr) {
            mxnthr = *numthr;
            omp_set_num_threads(mxnthr);
        }
        hd = dstRoiSize.height / mxnthr;
        hr = dstRoiSize.height % mxnthr;
        dstRoiSize.height = hd + hr;
        ippiFilterScharrHorizMaskBorderGetBufferSize(dstRoiSize, mask, srcDataType, dstDataType, numChannels, &bufsize);
        bufsize = (bufsize + 63) & ~63;
        *bufStep = bufsize;
        bufsize *= mxnthr;
    }
    *pBufferSize = bufsize;
    *numthr = mxnthr;
}

Thanks!

steffenroeber · ‎10-31-2014

Thank you for that explanation. Now it works.

But next questions:

What about linear transformations? For example: ippiFFTInit_R_32f. There are

IppiFFTSpec_R_32f* pFFTSpec

Ipp8u* pMemInit

can that be shared in threads?

Is it possible to parallelize ippiHoughLine_Region_8u32f_C1R and similars?

Alexey_Tyndyuk · ‎10-31-2014

Hi,

You can't share a single pMemInit buffer in multiple threads, but you can share FFTSpec.
If you need to run the same type and size of FFT in multiple threads, you can initialize FFTSpec only once, and then free pMemInit after initialization. Once you initialize FFTSpec (IppiFFTSpec_R_32f), you can share it between multiple threads since it is not modified by ippiFFT processing functions. But you will need a separate work buffer (pBuffer) for each ippiFFT function running in its own thread.

You can find an example for ippiFFTInit_R_32f here:
https://software.intel.com/en-us/node/504249

Best regards,
Alexey

steffenroeber · ‎11-04-2014

What about the ippiHoughLine_Region_8u32f_C1R?

Andrey_B_Intel · ‎11-05-2014

steffenroeber wrote:

Hi Steffen.

is there any description available that describes how to use single threaded IPP functions in a multithreaded environment?

I am attaching example of threading morphology functions. You can compile it and run to undestand how it works.

Thanks for using IPP.

Andrey_B_Intel · ‎11-05-2014

steffenroeber wrote:

What about the ippiHoughLine_Region_8u32f_C1R?

The result of this function is sorted list of lines. You can split region by angles and process them in parallel. But you need to keep in mind following:

For example single threaded version returns 10 lines sorted by weight from whole image.

In parallel mode every thread returns 10 lines too so total number of lines is 10*(N of threads)). After finishing multi-threaded version you need to analyze these 50 lines and select the first 10 strongest lines. They will be equal the result of single-threaded function. For example you can calculate number of pixels at every returned line. Of course it is overhead but unfortunately current API does not provide infomation about weight of line.

steffenroeber · ‎11-05-2014

Wjat do you mean with "split region by angles"?

Andrey_B_Intel · ‎11-06-2014

steffenroeber wrote:

Wjat do you mean with "split region by angles"?

Look please at description of function in manual

IppStatus ippiHoughLine_Region_8u32f_C1R(const Ipp8u* pSrc, int srcStep, IppiSize roiSize, IppPointPolar* pLine, IppPointPolar dstRoi[2], int maxLineCount, int*pLineCount, IppPointPolar delta, int threshold, Ipp8u* pBuffer);

"dstRoi Specifies the range of parameters of straight lines to be detected." It means that function return only lines which have angles from dstRoi[0].theta to dstRoi[1].theta. For multithreaded version you can split region by N parts with angle step (dstRoi[1].theta-dstRoi[0].theta)/N and call every thread with its own dstRoi parameter. The code could be:

deltaTheta = (dstRoiST[1].theta-dstRoiST[0].theta)/N;

for(n=0;n<N;n++){

dstRoiMT [0].rho=dstRoiST [0].rho;

dstRoiMT [1].rho=dstRoiST [1].rho;

dstRoiMT [0].theta=dstRoiST [0].theta+n*deltaTheta;

dstRoiMT [1].theta=dstRoiST [1].theta+(n+1)*deltaTheta;

}

steffenroeber · ‎11-06-2014

Ok. This function alwo works. Now next one: ippiHoughLine_8u32f_C1R

Here I have a roi. Can I use that for parallelization?

Andrey_B_Intel · ‎11-07-2014

steffenroeber wrote:

Ok. This function alwo works. Now next one: ippiHoughLine_8u32f_C1R

Here I have a roi. Can I use that for parallelization?

Sorry, but I don't undestand question

You cannot parallelize ippiHoughLine_8u32f_C1R and ippiHoughLine_Region_8u32f_C1R by splitting on tiles in roi. The both function use pixels of whole image so for correct parallelization you can split Hough space only. Therefore ippiHoughLine_8u32f_C1R cannot be parallelized because it does not have API for splitting Hough space. But you can replace ippiHoughLine_8u32f_C1R with ippiHoughLine_Region_8u32f_C1R with lines from diapasone [0..2PI]