Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

CrossCorrNorm performance issues Ipp17 and Ipp9

Herbert_K_
Débutant
2 332 Visites

Hello,

i have the following performance issues with CrossCorrNorm function.

The first issue is with different sizes of the src and template compared to the "old" version of the algorithm.

If i use as source size 512, 512 and as template 256,256 i have a calculation time of 3 ms with the new version, old version 3.8 ms

If i use as source size 512, 512 and as template 500,500 i have a calculation time of 13 ms with the new version, old version 3.3 ms. 

Where this is coming from?

The second issue i have with both version of the algorithm

If i use as source size 512, 512 and as template 256,256 i have a calculation time of 3 ms with the new version, old version 3.8 ms

If i use as source size 513, 512 and as template 256,256 i have a calculation time of 5.5 ms with the new version, old version 6,55ms

The calculation time is nearly doubled. What is happening here? This happening also at 256 to 257 and 1024 to 1025. Always doubles the calculation time.

Now the source code how i tested this behavior.

void testCrossCorr(){

Timer timer;

    IppStatus status;
    IppiSize srcRoiSize = { 513,512 };
    IppiSize tplRoiSize = { 500,500 };
    IppiSize dstRoiSize = { srcRoiSize.width - tplRoiSize.width + 1, srcRoiSize.height - tplRoiSize.height + 1 };

    int stepBytesSrc = 0;
    Ipp8u* pSrc = ippiMalloc_8u_C1(srcRoiSize.width, srcRoiSize.height, &stepBytesSrc);
    int stepBytesTpl = 0;
    Ipp8u* pTpl = ippiMalloc_8u_C1(tplRoiSize.width, tplRoiSize.height, &stepBytesTpl);
    int stepBytesDst = 0;
    Ipp32f* pDst = ippiMalloc_32f_C1(dstRoiSize.width, dstRoiSize.height, &stepBytesDst);

    IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNorm);
    Ipp8u *pBuffer;
    int bufSize;
    status = ippiCrossCorrNormGetBufferSize(srcRoiSize, tplRoiSize, funCfg, &bufSize);
    if (status != ippStsNoErr) return;
    pBuffer = ippsMalloc_8u(bufSize);

    timer.start();
    int loopSize = 10;
    for (int i = 0; i < loopSize; i++)
    {
        status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
            stepBytesDst, funCfg, pBuffer);

    }
    
    {
        timer.stop();
        std::cout << "\n::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted " <<
            timer.elapsed_time() << "ms\n" <<

         timer.start();
    }
    
    for (int i = 0; i < loopSize; i++)
    {
        status = ippiCrossCorrValid_NormLevel_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize,
            pTpl, stepBytesTpl, tplRoiSize, pDst, stepBytesDst);

    }

    {
        timer.stop();

 

        std::cout << "\n::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted " <<
            timer.elapsed_time() << "ms\n" <<
        

timer.start();
    }
    ippiFree(pSrc);
    ippiFree(pTpl);
    ippiFree(pDst);
    ippsFree(pBuffer);

 

}

 

Regards

Herb

 

 

0 Compliments
12 Réponses
Igor_A_Intel
Employé
2 332 Visites

Hi Herbert,

This functionality is optimized with convolution theorem - therefore if some of dimensions crosses the next boundary of pow of 2 - the next order FFT is used, that, obviously, increases time ~2x.

As regarding perf differences between different versions of IPP - please provide an output from lib versions for both:

    const IppLibraryVersion *lib;

    lib = ippiGetLibVersion();
    printf( "CPU       : %s\n", lib->targetCpu );
    printf( "Name      : %s\n", lib->Name );
    printf( "Version   : %s\n", lib->Version );
    printf( "Build date: %s\n", lib->BuildDate );
 

regards, Igor

0 Compliments
Herbert_K_
Débutant
2 332 Visites

Hi Igor,

thanks for the fast reply. Normally i have a linear processing time depending on the source size and template size. When the calculation time is doubled on specific boundary the behavior is bad. Maybe a internal clustering would bring a better scaling result.

The Ipp9 library is:

targetCPU: I9

Name: ippIP AVX2 (I9 threaded) --> i use this with setNumThreads(1) and also not the mt version of the library

Version: 9.0 Legacy (r48491) (-)

BuildDate: Oct 13 2015

Version Ipp18

targetCPU: I9

Name: ippIP AVX2 (I9) --> i use this with setNumThreads(1) and also not the mt version of the library

Version: 2018.0.3 (r58644) 

BuildDate: Apr 7 2018

regards, Herb

 

 

0 Compliments
Igor_A_Intel
Employé
2 332 Visites

ok, got it, thank you.

also please tell me what is your operating system - Windows or Linux?

regards, Igor

0 Compliments
Herbert_K_
Débutant
2 332 Visites

the behavior happens on Windows and Linux. We develop on Windows and our target system is linux.

0 Compliments
Igor_A_Intel
Employé
2 332 Visites

Hi Herb,

I don't see any issues: (the same versions of libraries, measured on my T470 laptop (the same l9 code version):

static linking:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 227858.505917 cpe

::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 304642.631953 cpe
Press any key to continue . . .

dynamic linking:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 232007.209467 cpe

::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 276802.594083 cpe
Press any key to continue . . .

threaded dynamic libs, numThreads = 1:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 51539.084615 cpe

::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 58517.840237 cpe
Press any key to continue . . .


I slightly modified your code - see below:

#include <stdio.h>
#include "ippdefs.h"
#include "ipp.h"
#include "ippdefs90legacy.h"
#include "ippi90legacy.h"


int main(void)
{

    IppStatus status;
    IppiSize srcRoiSize = { 512,512 };
    IppiSize tplRoiSize = { 500,500 };
    IppiSize dstRoiSize = { srcRoiSize.width - tplRoiSize.width + 1, srcRoiSize.height - tplRoiSize.height + 1 };

    int stepBytesSrc = 0;
    Ipp8u* pSrc = ippiMalloc_8u_C1(srcRoiSize.width, srcRoiSize.height, &stepBytesSrc);
    int stepBytesTpl = 0;
    Ipp8u* pTpl = ippiMalloc_8u_C1(tplRoiSize.width, tplRoiSize.height, &stepBytesTpl);
    int stepBytesDst = 0;
    Ipp32f* pDst = ippiMalloc_32f_C1(dstRoiSize.width, dstRoiSize.height, &stepBytesDst);
    ippiImageJaehne_8u_C1R(pSrc, stepBytesSrc, srcRoiSize);
    ippiImageJaehne_8u_C1R(pTpl, stepBytesTpl, tplRoiSize);

    IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNorm);
    Ipp8u *pBuffer;
    int bufSize;
    status = ippiCrossCorrNormGetBufferSize(srcRoiSize, tplRoiSize, funCfg, &bufSize);
    if (status != ippStsNoErr) return;
    pBuffer = ippsMalloc_8u(bufSize);

    Ipp64u strt, stp;

    ippSetNumThreads(1);

    int loopSize = 10;

    status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
        stepBytesDst, funCfg, pBuffer);

   int nT;
   ippGetNumThreads(&nT);
   //printf("/nNum threads = %d\n", nT);
   strt = ippGetCpuClocks();
    for (int i = 0; i < loopSize; i++)
    {
        status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
            stepBytesDst, funCfg, pBuffer);

    }

    stp = ippGetCpuClocks();
    Ipp64f tmp = (Ipp64f)(stp - strt);
    tmp = tmp/loopSize;
    tmp = tmp/dstRoiSize.width;
    tmp = tmp/dstRoiSize.height;
    printf( "\n::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted %f cpe\n", tmp);

    strt = ippGetCpuClocks();
    for (int i = 0; i < loopSize; i++)
    {
        status = ippiCrossCorrValid_NormLevel_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize,
            pTpl, stepBytesTpl, tplRoiSize, pDst, stepBytesDst);

    }

    stp = ippGetCpuClocks();
    tmp = (Ipp64f)(stp - strt);
    tmp = tmp/loopSize;
    tmp = tmp/dstRoiSize.width;
    tmp = tmp/dstRoiSize.height;
    printf( "\n::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted %f cpe\n", tmp);

    ippiFree(pSrc);
    ippiFree(pTpl);
    ippiFree(pDst);
    ippsFree(pBuffer);
}

regards, Igor

 

0 Compliments
Herbert_K_
Débutant
2 332 Visites

Hi Igor,

thank you for the fast testing. I used your code and also used static linking. The results are:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 224359.924852 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 287181.53017 cpe

Also no performance issues. Now i installed also the threaded library on Ipp18 and get this result.

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 59510.233136 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 50513.827219 cpe

also no performance problem. But i dont understand where this is coming from? We both used ippSetNumThreads(1), therefore where we get this performance boost of time 4x? I also checked if it runs on more then 1 cpu. It does not. I have simply no explanation for this behavior. Can you help?

Regards, Herb

0 Compliments
Herbert_K_
Débutant
2 332 Visites

Hi Igor,

you have an update on this topic? I looked into the list of threaded functions. The CrossCorr function is not listed. Don´t understand the  performance boost in this function depending on the library used. Also the "threaded" libary is no longer available under linux. 

Regards, Herb

0 Compliments
Igor_A_Intel
Employé
2 332 Visites

Hi Herb,

To this moment I don't have the final update. I performed some investigations and found several interesting things in the threaded libs behavior, that I can't explain just now. Therefore this is in progress. As regarding your statement that "threaded" libs are not longer available under Linux - which IPP version do you mean? - All IPP releases have threaded libs for Linux as well as for Windows.

regards, Igor.

0 Compliments
Herbert_K_
Débutant
2 332 Visites

We used the IPP2019 Update 2. My colleagues said the was no rpm for the threaded version. In Windows was the option avaiable.

0 Compliments
Herbert_K_
Débutant
2 332 Visites

My colleagues found the threaded version. Sorry for the inconvienence.

0 Compliments
ArtemMaklaev
Employé
2 332 Visites

Hi Herb,

The issue with 4x performance gap between sequential and threaded (launched with ippSetNumThreads(1)) versions of ippiCrossCorrNorm_8u32f_C1R was resolved. The root cause was difference in algorithm’s parameters for threaded and non-threaded modes.

The fix will be available in IPP 2020.

 

The fix for ippiCrossCorrValid_NormLevel_8u32f_C1R (legacy library) is still in progress.

BR, Artem.

0 Compliments
Gennady_F_Intel
Modérateur
2 332 Visites

We will update this thread as soon as the fix of the problem will be available.

0 Compliments
Répondre