Strange Behavior of ippi Integral

Emanon · ‎01-06-2022

Hello,

I was trying ippiIntegral_8u32s_C1R , it works fine , but when i tried to set the Integral array to zero before doing Integral which should just cost longer time than without set array to zero , it makes the integral function a lot faster. If I set the array to zero with tbb, then the cost time of parallel_for set array to zero plus integral will be about half the time which integral without set array to zero first cost. Since my work is performace focus , I am interest in what happens. Does ippiIntegral set output array to zero first? Or any other reason? (Edit: the ippi version is 2021.5.0)

VidyalathaB_Intel · ‎01-10-2022

Hi,

Thanks for reaching out to us.

>>i tried to set the Integral array to zero before doing Integral......If I set the array to zero with tbb, then the cost time of parallel_for set array to zero plus integral will be about half the time

Could you please share with us your OS details and a minimal reproducer(& steps to reproduce if any) for scenarios which you were mentioning & also the timings that you are getting so that we can work on it from our end?

Regards,

Vidya.

Emanon · ‎01-10-2022

my OS is win10 pro ,CPU Interl i7-8750H , 16GB RAM.

Image size is 16384*24000

Cost Time is about 520000 microseconds without memset, and memset plus Integral cost about 260000 microseconds.

Thanks.

INT32 Insp_Run(BYTE* SrcBuf,int SrcW,int SrcH,int RectX,int RectY,int RectW,int RectH)
{
    int dRes=0;
    auto stamp01 = std::chrono::high_resolution_clock::now();
    Ipp32s* ImageIntegrals = new Ipp32s[(SrcW+1)*(SrcH+1)];
    IppiSize roiSize = {RectW,RectH};
    int srcStep = SrcW;
    int dstStep = SrcW+1;
    //integral become faster,just memset zero with tbb
    {
        _ImageIntegral_0 ImageIntegral_0;
        ImageIntegral_0.ImageIntegrals = ImageIntegrals;
        parallel_for(tbb:blocked_range<INT32>(0,SrcH+1),ImageIntegral_0);
    }
    dRes = ippiIntegral_8u32s_C1R(SrcBuf+RectY*SrcW+RectX,srcStep,ImageIntegrals+RectY*(SrcW+1)+RectX,dstStep*sizeof(Ipp32s),roiSize,0);
    auto stamp02 = std::chrono::high_resolution_clock::now();
    
    delete []ImageIntegrals;
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stamp02-stamp01);
    std::string strdur = std::to_string(duration.count());
    MessageBoxA(NULL,strdur.c_str(),"",MB_OK);
}

VidyalathaB_Intel · ‎01-11-2022

Hi,

Thanks for getting back to us.

It would be a great help if you share the complete sample reproducer so that it helps us to get more insights regarding this issue.

So, could you please share with us the complete working sample reproducer along with the command you have used to compile the code?

Regards,

Vidya.

Emanon · ‎01-11-2022

I think the code sample I posted should work , sorry I can not understand what do you mean complete sample reproducer.

If you mean you need the code which is without memset just need to comment out from _ImageIntegral_0 to the tbb parallel_for.

Its just doing only memset parallel for every row so I think its not a big deal, I have tried without doing parallel and just memset the array , the runtime of ippiIntegral_8u32s_C1R will still be reduced , just the memset time cost longer without parallel.

Edit: sorry I missed to return, add "return dRes;" at the last line then it should work now.

VidyalathaB_Intel · ‎01-18-2022

Hi,

Thanks for all the information that you have provided.

We are working on your issue, we will get back to you soon.

Regards,

Vidya.

Andrey_B_Intel · ‎01-27-2022

Hi Emanon.

Could you please run "tbb zero-ing" in single thread and provide perf results please?

Andrey B.

IPP

Emanon · ‎01-27-2022

Hi Andrey.

I tried zeroing in single thread and the cost time of ippiIntegral is almost the same as zeroing with tbb,

which is about 150000 microseconds(only ippiIntegral, not include zeroing).

Andrey_B_Intel · ‎02-03-2022

Hi Emanon.

To estimate performance of IPP functions the next template is recommended:

ipp_func()
t0 = get_timer();
for (n=0;n<N;n++)
  ipp_func();
t1 = get_timer();
func_time=(t1-t0)/N

first call of ipp function is skipped because a lot of events happen: physical memory allocation, data loading to cache from memory, branch predictor statistic update, frequency scaling and so on. In next calls CPU and data are in "ready" state and performance differs.

Could you please redesign your reproducer according with this approach?

Thanks.

Andrey B

Emanon · ‎02-06-2022

Hi Andrey.

Does ippInit() count as a ipp function?

If do then with ippInit() first, the run time without set array to zero of ippiIntegral is 520815.2(ms),

and the run time with set array to zero of ippiIntegral is 146848.9(ms). (Image size is 16384x24000)

Andrey_B_Intel · ‎03-23-2022

Hi Emanon.

I am attaching small benchmark how measure performance of ippIntegral (and other functions).

My system is Xeon Silver 4116. "cpe" means clock per element. Less is better.

1024 x 1024 start_val=0, num_loops=1, cpe= 0.8743
1024 x 1024 start_val=1, num_loops=1, cpe= 0.8735
1024 x 1024 start_val=0, num_loops=10, cpe= 0.6566
1024 x 1024 start_val=1, num_loops=10, cpe= 0.6469
1024 x 1024 start_val=0, num_loops=1000, cpe= 0.6024
1024 x 1024 start_val=1, num_loops=1000, cpe= 0.6179

cpe for 1000 runs is 0.61, for 1 run is 0.87 and we usually orientate at performance of multiple runs.

Could you please run this code at your system and provide results?

Thanks.

Andrey

Strange Behavior of ippi Integral

Performance