Solved: Re: High CPU Load with IPP FFTs

pwindle · ‎12-09-2008

I am having a problem when using IPP for performing FFTs. I have written a small test app that demonstrates the problem. Have a look at the following 3 varients of a function I have written:

1. Perform 50 * 32K FFTs in quick succession.

double TForm1::FFT(Ipp64u* pTimes)
{
IppsFFTSpec_C_32f* pFFTSpec = 0;
int iBufferSize = 0;
float* pFloatPointers[20];
char* pExtraBuffer = 0;
int iFFTOrder = 15;
int iFFTSize = 32768;
Ipp64u start, end;

int i = 0;
while(i < 20)
{
pFloatPointers = (float*)ippMalloc(iFFTSize * sizeof(float));
ippsSet_32f(10.0, pFloatPointers[i++], iFFTSize);
}

double t1 = Now();

// Create the FFT Specification structure with the flags and hints we want
ippsFFTInitAlloc_C_32f(&pFFTSpec, iFFTOrder, IPP_FFT_NODIV_BY_ANY, ippAlgHintNone);

// Now work out the size of external buffer we should allocate
// This buffer isn't necessary but FFT calls are faster if it's calculated up front
ippsFFTGetBufSize_C_32f(pFFTSpec, &iBufferSize);

pExtraBuffer = (char*)(ippMalloc(iBufferSize));

for (int iSpectrum = 0; iSpectrum < 10; ++iSpectrum)
{
Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[0], pFloatPointers[1], pFloatPointers[10], pFloatPointers[11], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[2], pFloatPointers[3], pFloatPointers[12], pFloatPointers[13], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[4], pFloatPointers[5], pFloatPointers[14], pFloatPointers[15], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[6], pFloatPointers[7], pFloatPointers[16], pFloatPointers[17], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[8], pFloatPointers[9], pFloatPointers[18], pFloatPointers[19], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;
}

// Release the FFT Specifation struture allocated earlier
ippsFFTFree_C_32f(pFFTSpec);

double t2 = Now();

ippFree(pExtraBuffer);

i = 0;
while(i < 20)
{
ippFree(pFloatPointers[i++]);
}

return double(t2 - t1) * 86400.0;
}

2. Perform 5 * 32K FFTs per second for 10 seconds. 1 Second sleep followed by 5 FFTs.

double TForm1::FFT(Ipp64u* pTimes)
{
IppsFFTSpec_C_32f* pFFTSpec = 0;
int iBufferSize = 0;
float* pFloatPointers[20];
char* pExtraBuffer = 0;
int iFFTOrder = 15;
int iFFTSize = 32768;
Ipp64u start, end;

int i = 0;
while(i < 20)
{
pFloatPointers = (float*)ippMalloc(iFFTSize * sizeof(float));
ippsSet_32f(10.0, pFloatPointers[i++], iFFTSize);
}

double t1 = Now();

// Create the FFT Specification structure with the flags and hints we want
ippsFFTInitAlloc_C_32f(&pFFTSpec, iFFTOrder, IPP_FFT_NODIV_BY_ANY, ippAlgHintNone);

// Now work out the size of external buffer we should allocate
// This buffer isn't necessary but FFT calls are faster if it's calculated up front
ippsFFTGetBufSize_C_32f(pFFTSpec, &iBufferSize);

pExtraBuffer = (char*)(ippMalloc(iBufferSize));

for (int iSpectrum = 0; iSpectrum < 10; ++iSpectrum)
{
Sleep(1000);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[0], pFloatPointers[1], pFloatPointers[10], pFloatPointers[11], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[2], pFloatPointers[3], pFloatPointers[12], pFloatPointers[13], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[4], pFloatPointers[5], pFloatPointers[14], pFloatPointers[15], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[6], pFloatPointers[7], pFloatPointers[16], pFloatPointers[17], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(0);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[8], pFloatPointers[9], pFloatPointers[18], pFloatPointers[19], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;
}

// Release the FFT Specifation struture allocated earlier
ippsFFTFree_C_32f(pFFTSpec);

double t2 = Now();

ippFree(pExtraBuffer);

i = 0;
while(i < 20)
{
ippFree(pFloatPointers[i++]);
}

return double(t2 - t1) * 86400.0;
}

3.Perform 5 * 32K FFTs per second for 10 seconds. 200ms sleep between each FFT.

double TForm1::FFT(Ipp64u* pTimes)
{
IppsFFTSpec_C_32f* pFFTSpec = 0;
int iBufferSize = 0;
float* pFloatPointers[20];
char* pExtraBuffer = 0;
int iFFTOrder = 15;
int iFFTSize = 32768;
Ipp64u start, end;

int i = 0;
while(i < 20)
{
pFloatPointers = (float*)ippMalloc(iFFTSize * sizeof(float));
ippsSet_32f(10.0, pFloatPointers[i++], iFFTSize);
}

double t1 = Now();

// Create the FFT Specification structure with the flags and hints we want
ippsFFTInitAlloc_C_32f(&pFFTSpec, iFFTOrder, IPP_FFT_NODIV_BY_ANY, ippAlgHintNone);

// Now work out the size of external buffer we should allocate
// This buffer isn't necessary but FFT calls are faster if it's calculated up front
ippsFFTGetBufSize_C_32f(pFFTSpec, &iBufferSize);

pExtraBuffer = (char*)(ippMalloc(iBufferSize));

for (int iSpectrum = 0; iSpectrum < 10; ++iSpectrum)
{
Sleep(200);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[0], pFloatPointers[1], pFloatPointers[10], pFloatPointers[11], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(200);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[2], pFloatPointers[3], pFloatPointers[12], pFloatPointers[13], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(200);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[4], pFloatPointers[5], pFloatPointers[14], pFloatPointers[15], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(200);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[6], pFloatPointers[7], pFloatPointers[16], pFloatPointers[17], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;

Sleep(200);
start = ippGetCpuClocks();
ippsFFTFwd_CToC_32f(pFloatPointers[8], pFloatPointers[9], pFloatPointers[18], pFloatPointers[19], pFFTSpec, pExtraBuffer);
end = ippGetCpuClocks();
*pTimes++ = end - start;
}

// Release the FFT Specifation struture allocated earlier
ippsFFTFree_C_32f(pFFTSpec);

double t2 = Now();

ippFree(pExtraBuffer);

i = 0;
while(i < 20)
{
ippFree(pFloatPointers[i++]);
}

return double(t2 - t1) * 86400.0;
}

The performance of these functions varies wildly as follows:

1.

Approx Function execution time: 0.016s

Average number of CPU Clocks per FFT: Approx 400,000

CPU Load in Task Manager: Does not register

2.

Approx Function execution time: 10s

Average number of CPU Clocks per FFT: Approx 650,000

CPU Load in Task Manager: Approx 5% for full 10 second duration

3.

Approx Function execution time: 10.15s

Average number of CPU Clocks per FFT: Varies from 750,000 to 1.5M

CPU Load in Task Manager: 50% for full 10 second duration

So the problem is that basically when I space the FFTs equally over the course of 10 seconds the CPU just gets hammered. Can anyone please tell me why this is occuring and what I am doing incorrectly.

The background to this is that I develop high bandwidth data acquisition and real-time signal processing solutions for measuring vibration. Currently I am using the NSP libraries for all signal processing work but I want to migrate to IPP. Because the processing is real-time, it more closely resembles the 3rd function (FFTs performed periodically as data is acquired) but to be using 50% of the CPU to perform 5 FFTs per second is useless.

The above testing was performed on my development machine which is:

Core 2 Duo E6750 2.66 GHz, 2 GB RAM, Win XP SP3

Thanks

Vladimir_Dudnik · ‎12-10-2008

OpenMP threads still active for some time after they are finish real work, that is why you see higher cpu usage. The recomenation is to call ippSetNumThreads(1) to limit number threads created by IPP (or link with IPP not threaded static libraries)

Regards,
Vladimir

View solution in original post

pwindle · ‎12-09-2008

Sorry for lack of code formatting. Formatting tool didn't work so I just pasted the code in normally.

Vladimir_Dudnik · ‎12-10-2008

OpenMP threads still active for some time after they are finish real work, that is why you see higher cpu usage. The recomenation is to call ippSetNumThreads(1) to limit number threads created by IPP (or link with IPP not threaded static libraries)

Regards,
Vladimir

pwindle · ‎12-18-2008

Quoting - vdudnik

OpenMP threads still active for some time after they are finish real work, that is why you see higher cpu usage. The recomenation is to call ippSetNumThreads(1) to limit number threads created by IPP (or link with IPP not threaded static libraries)

Regards,
Vladimir

Just wanted to say thanks very much, ippSetNumThreads(1) fixed it great.

Vladimir_Dudnik · ‎12-18-2008

you are welcome:)

brian-womack · ‎12-23-2008

To me, limiting execution to one thread is not a great solution, because I need to balance execution over 4 cores. With moremulti-core processors coming within the next two years, this is a key thing.

It seems there is either an underlying defect or that we need to call some function that 'cleans up' openMP stuff. Is there such a call?

Vladimir_Dudnik · ‎12-23-2008

This is not about defect. When you combine OpenMP threading with Win32 Sleep function the system behaves in this way.Whenyou decide to use IPP internal threading you need to carefully use system threading on top of IPP threads. Alternatively you may disable IPP threading and parallelize your task on top of IPP calls with whatever threading API.