Execution time of ippiDFTInv_CToC_32fc_C1R function in PSXE2016 is slower than PS2011

Pham_Minh_N_ · ‎06-14-2016

Hi everybody,

When I migrate source code from VS2008 + PS2011 + Intel C++ Complier (called project Before) to VS2015 + PSXE2016 + Intel C++ Complier (called project After).

I have the problem is: execution time of ippiDFTInv_CToC_32fc_C1R function in project [After] is slower than project [Before].

Detailed as below (About source code sample, please refer the attach file)

Before migration (ms) After migration (ms) Deviation (ms)

20.076 28.145 8.069

Note: Configuration of PC

- OS: Win7 Enterprise SP1 64bit

- CPU: Intel Core i3-3220 (3.30 GHz)

- RAM: 8GB.

Currently, I don't know reason why execution time of ippiDFTInv_CToC_32fc_C1R function in project [After] is slower than project [Before].

Please help me explain it.

Best regards,

NhanPham.

Anoop_M_Intel · ‎06-14-2016

I have redirected this issue to IPP forum for better response.

Thanks and Regards
Anoop

Igor_A_Intel · ‎06-15-2016

Hello,

Which IPP version (arch, dll/static, st/mt, etc.) do you use? Could you provide an output from GetLibVersion()? (please insert this call just before a call to DCT):

const IppLibraryVersion* lib;

lib = ippiGetLibVersion();

printf("%s %s %d.%d.%d.%d\n", lib->Name, lib->Version, lib->major, lib->minor, lib->majorBuild, lib->build);

regards, Igor.

Pham_Minh_N_ · ‎06-15-2016

Hi Igor Astakhov,

Thanks for your quick response,

At project [Before] uses IPP version:

ippsy8-7.0.dll+ 7.0 build 205.7 7.0.205.1008

At project [After] uses IPP version:

ippSP AVX (e9) 9.0.2 (r49912) 9.0.2.49912

Best regards,

NhanPham.

Igor_A_Intel · ‎06-15-2016

Hi Nhan Pham,

Each IPP release is coming to customers with IPP PS (Performance System) - I've checked both these libraries - 7.0 (y8) and 9.0.2 (e9) and see that 9.0.2 is from 1.5x to 7x faster (depends on size). You didn't provide information from ippiGetLibVersion - (a) in your first post your claims are about ippIP function, in the last reply you refer to ippSP functionality; (b) for 7.0 you use dynamic linking (dynamic libraries in 7.0 are multithreaded by default), but it is not clear which linking (dynamic or static) do you use for 9.0.2.

regards, Igor

Gennady_F_Intel · ‎06-15-2016

NhanPham, it seems to me you added printf into measure loop. Is that has been done intentionally? could you remove this pritf outside and check performance again!

...................

   InitializeTimer();
   __int64 lstime = GetTimerCounter();
   for(int i =0; i< 100; i++)
   {
       DFTFunction(input, output, sizeX, sizeY, sizeZ);
   }
   printf("output[1000] = (%f, %f)\n", output[100].re, output[100].im);

printf("Execution time: %f\n", GetExecutionTime(lstime));

..................

Sergey_K_Intel · ‎06-15-2016

Besides, I see many 'malloc' calls within DFTFunction. GetSize, Init...

Could you benchmark exactly ippiDFTInv_CToC_32fc_C1R?

Pham_Minh_N_ · ‎06-16-2016

Hi Igor Astakhov,

I give more information

At project [Before] uses IPP version:

ippiy8-7.0.dll+ 7.0 build 205.7 7.0.205.1004 (dynamic library)

At project [After] uses IPP version:

ippIP AVX (e9) 9.0.0 (r47849) 9.0.0.47849 (dynamic library)

You said, you tested and libraries 9.0 is faster than 1.5 to 1.7 libraries 7.0. But the result is not come from my project that I attached.

See part of code in project that I attached

   roiSize.width = sizeX;
   roiSize.height = sizeY;

   if (ippStsNoErr == ippiDFTGetSize_C_32fc(roiSize, IPP_FFT_DIV_INV_BY_N, ippAlgHintAccurate, &nSpecSize, &nInitSize, &size))
   {
       // Allocate memory
       dftSpec = (IppiDFTSpec_C_32fc*)ippMalloc(nSpecSize);
       if (nInitSize > 0)
       {
           pbyInit = (Ipp8u*)ippsMalloc_8u(nInitSize);
       }
       // Initializes the context structure for the image DFT functions
       if (ippStsNoErr == ippiDFTInit_C_32fc(roiSize, IPP_FFT_DIV_INV_BY_N, ippAlgHintAccurate, dftSpec, pbyInit))
       {
           pBuffer = ippsMalloc_8u(((size > FTBuffSizeMin) ? size : FTBuffSizeMin));
           if (NULL != pBuffer)
           {
               for (int zc = 0; zc < sizeZ; zc++)
               {
                   ippiDFTInv_CToC_32fc_C1R(input, sizeX * sizeof(Ipp32fc), output, sizeX * sizeof(Ipp32fc), dftSpec, pBuffer);
               }
           }
       }
   }

I tried many patterns as

sizeX = 478, sizeY = 454, sizeZ = 64

sizeX = 478, sizeY = 454, sizeZ = 1

sizeX = 4780, sizeY = 454, sizeZ = 64

sizeX = 47800, sizeY = 454, sizeZ = 64

......

With this patterns, DFT functions in libraries 9.0 is always slower than DFT functions in libraries 7.0

If you have not running this source code yet, please run it to check the results

Thanks,

Pham Minh Nhan

Igor_A_Intel · ‎06-17-2016

Hi Pham Minh Nhan,

I measured DFT performance using very simple code:

#include <stdio.h>

#include "ipp.h"

#define N_LOOP 1000

#define WIDTH 454

#define HEIGHT 478

int main()

{

int sizeBuf, sizeSpec, sizeIni, i, j, srcStep, dstStep;

IppiDFTSpec_C_32fc *ctxDFT;

IppStatus status;

Ipp32fc *src, *dst, *tmpDst;

Ipp32f *tmpSrc;

Ipp8u *buf;

IppiSize roi = { WIDTH, HEIGHT};

Ipp64u c1, c2;

double cpe;

const IppLibraryVersion* lib;

ippInit();

lib = ippiGetLibVersion();

printf("build = %d\n",lib->build);

printf("targetCpu = %s\n",lib->targetCpu);

printf("Name = %s\n", lib->Name);

printf("Version = %s\n", lib->Version);

printf("BuildDate = %s\n", lib->BuildDate);

status = ippiDFTGetSize_C_32fc( roi, IPP_FFT_DIV_INV_BY_N, ippAlgHintAccurate, &sizeSpec, &sizeIni, &sizeBuf );

ctxDFT = (IppiDFTSpec_C_32fc*)ippMalloc( sizeSpec );

buf = ippMalloc( IPP_MAX( sizeBuf, sizeIni ));

status = ippiDFTInit_C_32fc( roi, IPP_FFT_DIV_INV_BY_N, ippAlgHintAccurate, ctxDFT, buf );

src = ippiMalloc_32fc_C1( WIDTH, HEIGHT, &srcStep );

dst = ippiMalloc_32fc_C1( WIDTH, HEIGHT, &dstStep );

// init src

status = ippiImageJaehne_32f_C1R( (Ipp32f*)dst, dstStep, roi );

for( j = 0; j < HEIGHT; j++ ){

tmpSrc = (Ipp32f*)((Ipp8u*)dst + j * dstStep );

tmpDst = (Ipp32fc*)((Ipp8u*)src + j * srcStep );

for( i = 0; i < WIDTH; i++ ){

tmpDst.re = tmpSrc;

tmpDst.im = -tmpSrc;

}

// warm cache

status = ippiDFTInv_CToC_32fc_C1R( src, srcStep, dst, dstStep, ctxDFT, buf );

// measure perf

c1 = ippGetCpuClocks();

for( i = 0; i < N_LOOP; i++ ){

status = ippiDFTInv_CToC_32fc_C1R( src, srcStep, dst, dstStep, ctxDFT, buf );

}

c2 = ippGetCpuClocks();

cpe = ((double)c2 - (double)c1)/((double)WIDTH * (double)HEIGHT * (double)N_LOOP);

printf ( "size = %d x %d, cpe = %f\n", WIDTH, HEIGHT, cpe );

ippiFree( src );

ippiFree( dst );

ippFree( buf );

ippFree( ctxDFT );

return 0;

}

It is visible that performance for 8.2.3 and 9.0.3 is the same:

build = 48108
targetCpu = h9
Name = ippIP AVX2 (h9)
Version = 8.2.3 (r48108)
BuildDate = Jul 23 2015
size = 454 x 478, cpe = 48.161625
Press any key to continue . . .

build = 51269
targetCpu = h9
Name = ippIP AVX2 (h9)
Version = 9.0.3 (r51269)
BuildDate = Apr 8 2016
size = 454 x 478, cpe = 49.619253
Press any key to continue . . .

build = 48108
targetCpu = h9
Name = ippIP AVX2 (h9)
Version = 8.2.3 (r48108)
BuildDate = Jul 23 2015
size = 512 x 512, cpe = 13.975082
Press any key to continue . . .

build = 51269
targetCpu = h9
Name = ippIP AVX2 (h9)
Version = 9.0.3 (r51269)
BuildDate = Apr 8 2016
size = 512 x 512, cpe = 14.034056
Press any key to continue . . .

regards, Igor

Pham_Minh_N_ · ‎06-20-2016

Hi Igor Astakhov,

Thanks for your answer,

Currently, my business logic code as attached file. Therefore, I can not change business logic code as your sample code.

When migration IPP library to higher version (from IPP7 to IPP9), I think the performance will increase. But my project is the opposite.

I guess, configuration of my project isn't correct (release mode). But I don't know where are problems.

Thanks,

NhanPham.