- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am having a problem when using IPP for performing FFTs. I have written a small test app that demonstrates the problem. Have a look at the following 3 varients of a function I have written:

1. Perform 50 * 32K FFTs in quick succession.

double TForm1::FFT(Ipp64u* pTimes)

{

IppsFFTSpec_C_32f* pFFTSpec = 0;

int iBufferSize = 0;

float* pFloatPointers[20];

char* pExtraBuffer = 0;

int iFFTOrder = 15;

int iFFTSize = 32768;

Ipp64u start, end;

int i = 0;

while(i < 20)

{

pFloatPointers* = (float*)ippMalloc(iFFTSize * sizeof(float)); ippsSet_32f(10.0, pFloatPointers[i++], iFFTSize); } double t1 = Now(); // Create the FFT Specification structure with the flags and hints we want ippsFFTInitAlloc_C_32f(&pFFTSpec, iFFTOrder, IPP_FFT_NODIV_BY_ANY, ippAlgHintNone); // Now work out the size of external buffer we should allocate // This buffer isn't necessary but FFT calls are faster if it's calculated up front ippsFFTGetBufSize_C_32f(pFFTSpec, &iBufferSize); pExtraBuffer = (char*)(ippMalloc(iBufferSize)); for (int iSpectrum = 0; iSpectrum < 10; ++iSpectrum) { Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[0], pFloatPointers[1], pFloatPointers[10], pFloatPointers[11], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[2], pFloatPointers[3], pFloatPointers[12], pFloatPointers[13], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[4], pFloatPointers[5], pFloatPointers[14], pFloatPointers[15], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[6], pFloatPointers[7], pFloatPointers[16], pFloatPointers[17], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[8], pFloatPointers[9], pFloatPointers[18], pFloatPointers[19], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; } // Release the FFT Specifation struture allocated earlier ippsFFTFree_C_32f(pFFTSpec); double t2 = Now(); ippFree(pExtraBuffer); i = 0; while(i < 20) { ippFree(pFloatPointers[i++]); } return double(t2 - t1) * 86400.0;}*

2. Perform 5 * 32K FFTs per second for 10 seconds. 1 Second sleep followed by 5 FFTs.

double TForm1::FFT(Ipp64u* pTimes)

{

IppsFFTSpec_C_32f* pFFTSpec = 0;

int iBufferSize = 0;

float* pFloatPointers[20];

char* pExtraBuffer = 0;

int iFFTOrder = 15;

int iFFTSize = 32768;

Ipp64u start, end;

int i = 0;

while(i < 20)

{

pFloatPointers* = (float*)ippMalloc(iFFTSize * sizeof(float)); ippsSet_32f(10.0, pFloatPointers[i++], iFFTSize); } double t1 = Now(); // Create the FFT Specification structure with the flags and hints we want ippsFFTInitAlloc_C_32f(&pFFTSpec, iFFTOrder, IPP_FFT_NODIV_BY_ANY, ippAlgHintNone); // Now work out the size of external buffer we should allocate // This buffer isn't necessary but FFT calls are faster if it's calculated up front ippsFFTGetBufSize_C_32f(pFFTSpec, &iBufferSize); pExtraBuffer = (char*)(ippMalloc(iBufferSize)); for (int iSpectrum = 0; iSpectrum < 10; ++iSpectrum) { Sleep(1000); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[0], pFloatPointers[1], pFloatPointers[10], pFloatPointers[11], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[2], pFloatPointers[3], pFloatPointers[12], pFloatPointers[13], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[4], pFloatPointers[5], pFloatPointers[14], pFloatPointers[15], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[6], pFloatPointers[7], pFloatPointers[16], pFloatPointers[17], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(0); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[8], pFloatPointers[9], pFloatPointers[18], pFloatPointers[19], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; } // Release the FFT Specifation struture allocated earlier ippsFFTFree_C_32f(pFFTSpec); double t2 = Now(); ippFree(pExtraBuffer); i = 0; while(i < 20) { ippFree(pFloatPointers[i++]); } return double(t2 - t1) * 86400.0;}*

3.Perform 5 * 32K FFTs per second for 10 seconds. 200ms sleep between each FFT.

double TForm1::FFT(Ipp64u* pTimes)

{

IppsFFTSpec_C_32f* pFFTSpec = 0;

int iBufferSize = 0;

float* pFloatPointers[20];

char* pExtraBuffer = 0;

int iFFTOrder = 15;

int iFFTSize = 32768;

Ipp64u start, end;

int i = 0;

while(i < 20)

{

pFloatPointers* = (float*)ippMalloc(iFFTSize * sizeof(float)); ippsSet_32f(10.0, pFloatPointers[i++], iFFTSize); } double t1 = Now(); // Create the FFT Specification structure with the flags and hints we want ippsFFTInitAlloc_C_32f(&pFFTSpec, iFFTOrder, IPP_FFT_NODIV_BY_ANY, ippAlgHintNone); // Now work out the size of external buffer we should allocate // This buffer isn't necessary but FFT calls are faster if it's calculated up front ippsFFTGetBufSize_C_32f(pFFTSpec, &iBufferSize); pExtraBuffer = (char*)(ippMalloc(iBufferSize)); for (int iSpectrum = 0; iSpectrum < 10; ++iSpectrum) { Sleep(200); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[0], pFloatPointers[1], pFloatPointers[10], pFloatPointers[11], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(200); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[2], pFloatPointers[3], pFloatPointers[12], pFloatPointers[13], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(200); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[4], pFloatPointers[5], pFloatPointers[14], pFloatPointers[15], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(200); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[6], pFloatPointers[7], pFloatPointers[16], pFloatPointers[17], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; Sleep(200); start = ippGetCpuClocks(); ippsFFTFwd_CToC_32f(pFloatPointers[8], pFloatPointers[9], pFloatPointers[18], pFloatPointers[19], pFFTSpec, pExtraBuffer); end = ippGetCpuClocks(); *pTimes++ = end - start; } // Release the FFT Specifation struture allocated earlier ippsFFTFree_C_32f(pFFTSpec); double t2 = Now(); ippFree(pExtraBuffer); i = 0; while(i < 20) { ippFree(pFloatPointers[i++]); } return double(t2 - t1) * 86400.0;}*

The performance of these functions varies wildly as follows:

1.

Approx Function execution time: 0.016s

Average number of CPU Clocks per FFT: Approx 400,000

CPU Load in Task Manager: Does not register

2.

Approx Function execution time: 10s

Average number of CPU Clocks per FFT: Approx 650,000

CPU Load in Task Manager: Approx 5% for full 10 second duration

3.

Approx Function execution time: 10.15s

Average number of CPU Clocks per FFT: Varies from 750,000 to 1.5M

CPU Load in Task Manager: 50% for full 10 second duration

So the problem is that basically when I space the FFTs equally over the course of 10 seconds the CPU just gets hammered. Can anyone please tell me why this is occuring and what I am doing incorrectly.

The background to this is that I develop high bandwidth data acquisition and real-time signal processing solutions for measuring vibration. Currently I am using the NSP libraries for all signal processing work but I want to migrate to IPP. Because the processing is real-time, it more closely resembles the 3rd function (FFTs performed periodically as data is acquired) but to be using 50% of the CPU to perform 5 FFTs per second is useless.

The above testing was performed on my development machine which is:

Core 2 Duo E6750 2.66 GHz, 2 GB RAM, Win XP SP3

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Vladimir

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sorry for lack of code formatting. Formatting tool didn't work so I just pasted the code in normally.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Vladimir

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*OpenMP threads still active for some time after they are finish real work, that is why you see higher cpu usage. The recomenation is to call ippSetNumThreads(1) to limit number threads created by IPP (or link with IPP not threaded static libraries)
Regards,*

Vladimir

Vladimir

Just wanted to say thanks very much, ippSetNumThreads(1) fixed it great.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

you are welcome:)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

This is not about defect. When you combine OpenMP threading with Win32 Sleep function the system behaves in this way.Whenyou decide to use IPP internal threading you need to carefully use system threading on top of IPP threads. Alternatively you may disable IPP threading and parallelize your task on top of IPP calls with whatever threading API.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page