Re: FFT Performance Disappointment

beaupaisley · ‎09-09-2007

We recently upgraded from IPP 5.1 to IPP 5.2 and were disappointed to see that there were not measurable performance improvements in the FFTs and that they still have not been threaded to take advantage of multi-core machines.
1. Are there efforts being made on core performance and when will they be available
2. Have efforts started to multithread the FFTs and when will this be available.
3. Do the in-place versions provide better performance, (It's got to be quite a bit better because our application will require that the result be copied to a result, so there is a penalty to pay.), e.g. ppsFFTFwd_CToC_64f_I()?

In particular we are using the following functions and there performance is starting to lag other alternatives so we look forward to these enhancements.
ippsFFTFwd_CToC_64f()
ippsFFTInv_CToC_64f()
ippsDFTFwd_CToC_64f()
ippsDFTInv_CToC_64f()

Vladimir_Dudnik · ‎09-11-2007

Hello,

we are currently work on threading for FFT functions. Expected performance gain on dual core system is about 80% for order equal 13. It might be available in IPP 5.3 version (end of this year). Our experts did not detect significant difference in performance in in-place and not-in-place versions of functions you mention.

Additionally, we work on improvement of performance of single-threaded FFT for Ipp64f data type (we expect to provide 10..15% performance improvement in the next IPP version on core2 architecture.

Regards,
Vladimir

levicki · ‎09-18-2007

Vladimir, last time I checked IPP 5.2 FFT performance I got something like this:

Buffer size : 288 bytes
irdft(rdft(  64)):  0.79 us
Buffer size : 544 bytes
irdft(rdft( 128)):  1.11 us
Buffer size : 1056 bytes
irdft(rdft( 256)):  1.86 us
Buffer size : 2080 bytes
irdft(rdft( 512)):  3.42 us
Buffer size : 4128 bytes
irdft(rdft(1024)):  6.83 us
Buffer size : 8224 bytes
irdft(rdft(2048)): 13.85 us

Test machine was Core 2 Duo E6300 running at 2.24GHz with 2GB of DDR2-800 RAM. Yes, those execution times are for both FFT and inverse FFT.

Perhaps I made a mistake? Here is the test code I used:


#include 
#include 
#if 1 // change to 0 for dynamic IPP performance test
#include  // change v8 to your CPU type for static IPP performance test
#include 
#else
#include 
#endif

#pragma comment(lib, "ippsemerged.lib")
#pragma comment(lib, "ippsmerged.lib")
#pragma comment(lib, "ippcorel.lib")

typedef unsigned __int64 u64;

#define NAKED	__declspec(naked)
#define ALIGN	__declspec(align(16))

ALIGN	Ipp32f	src[2048];
ALIGN	Ipp32f	dst[2048];

DWORD CPUFrequency(void)
{
const	wchar_t *key = L"HARDWAREDESCRIPTIONSystemCentralProcessor�";
const	wchar_t *value = L"~Mhz";
	DWORD	buflen = 4;
	HKEY	hKey;
	DWORD	freq;

	RegOpenKeyEx(HKEY_LOCAL_MACHINE, key, 0, KEY_READ, &hKey);
	RegQueryValueEx(hKey, value, NULL, NULL, (LPBYTE)&freq, &buflen);
	RegCloseKey(hKey);

	return freq;
}

NAKED u64 readtsc(void)
{
	__asm	{
		rdtsc
		ret
	}
}

int main(int argc, char *argv[])
{
	IppsFFTSpec_R_32f	*pFFTSpec;
	IppStatus		s;
	Ipp8u			*buf = NULL;
	u64			t0, t1;
	double			f;
	int			buf_len, i, n, iter = 100000;

	for (i = 0; i < 2048; i++) {
		src = (i + 3) / 3.14f;
	}

	for (n = 6; n < 12; n++) {
		s = ippsFFTInitAlloc_R_32f(&pFFTSpec, n, IPP_FFT_NODIV_BY_ANY, ippAlgHintFast);

		if (s != ippStsNoErr) {
			printf("ippsFFTInitAlloc_R_32f() failed (%ld)
", s);
			return 1;
		}

		s = ippsFFTGetBufSize_R_32f(pFFTSpec, &buf_len);

		if (s != ippStsNoErr) {
			printf("ippsFFTGetBufSize_R_32f() failed (%ld)
", s);
		} else {
			if (buf_len == 0) {
				printf("ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
");
			} else {
				printf("Buffer size : %ld bytes
", buf_len);
				buf = ippsMalloc_8u(buf_len);
			}
		}

		t0 = readtsc();
		for (i = 0; i < iter; i++) {
			ippsFFTFwd_RToCCS_32f(src, dst, pFFTSpec, buf);
			ippsFFTInv_CCSToR_32f(dst, src, pFFTSpec, buf);
		}
		t1 = readtsc();

		if (buf != NULL) {
			ippsFree(buf);
		}

		ippsFFTFree_R_32f(pFFTSpec);

		f = (double)(t1 - t0) / iter / CPUFrequency();

		printf("irdft(rdft(%4d)): %5.2f us
", 1 << n, f);
	}

	return 0;
}

However, there is one serious problem — with 5.2.063 I am getting horrible results with dynamic IPP .vs. static IPP.

Reason is that ippsFFTGetBufSize_R_32f() for some reason returns success and buffer size 0 and without work buffer things get awfully slow.

ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
irdft(rdft(  64)): 126.23 us
ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
irdft(rdft( 128)): 304.70 us
ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
irdft(rdft( 256)): 703.33 us
ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
ird
ft(rdft( 512)): 1607.48 us
ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
irdft(rdft(1024)): 3591.07 us
ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
irdft(rdft(2048)): 7988.09 us

Could you please test this?

Vladimir_Dudnik · ‎10-01-2007

Igor, there are number of issues in code you provide and some of them could lead to significant degradation of performance, please take a look on comment provided by our experts:

1) if GetBufSize returns 0  it means that PX code is working;
2) scaling mode is wrong (for such measuring technique) that leads to indefinites in src&dst and to dramatically bad performance;
3) CPU frequency detection doesnt work properly on my machine  let use ippGetCpuFreqMhz() instead;
4) There are number of bugs in the attached example  have a look for my comments below
5) not critical but why iter is double?
int main(int argc, char *argv[])
{
 IppsFFTSpec_R_32f *pFFTSpec;
 IppStatus s;
 Ipp8u *buf; // buf value is undefined here
 u64 t0, t1;
 double f, iter = 100000.0;
 int buf_len, i, n;
 for (i = 0; i < 2048; i++) {
 src = (i + 3) / 3.14f;
 }
 for (n = 6; n < 12; n++) {
 s = ippsFFTInitAlloc_R_32f(&pFFTSpec, n, IPP_FFT_NODIV_BY_ANY, ippAlgHintFast);
// IPP_FFT_NODIV_BY_ANY leads to non-normalized result 
// therefore src will not contain the same values after
// the inverse FFT is applied  the non-controlled increasing
// of src will be seen after each iteration that leads
// to overflow, indefinites in src and as result 
// to significant performance degradation
// So IPP_FFT_DIV_INV_BY_N must be used or
// inverse FFT must be called with another destination
// than src
 if (s != ippStsNoEr
r) {
 printf("ippsFFTInitAlloc_R_32f() failed (%ld)
", s);
 return 1;
 }
 ippsFFTGetBufSize_R_32f(pFFTSpec, &buf_len); // no status for future analysis
 if (s != ippStsNoErr) {
         // analyzed status doesnt correspond to
         // getBuf function 
 printf("ippsFFTGetBufSize_R_32f() failed (%ld)
", s);
 } else {
 if (buf_len == 0) {
 printf("ippsFFTGetBufSize_R_32f() didn't fail but no buffer size!
");
 } else {
 printf("Buffer size : %ld bytes
", buf_len);
 buf = ippsMalloc_8u(buf_len);
 }
 }
 t0 = readtsc();
 for (i = 0; i < iter; i++) {
            // if GetBuf fails  then buf is undefined 
            // must be initialized at least to
            // NULL at the beginning
 ippsFFTFwd_RToCCS_32f(src, dst, pFFTSpec, buf);
 ippsFFTInv_CCSToR_32f(dst, src, pFFTSpec, buf);
 }

Regards,
Vladimir

levicki · ‎10-02-2007

if GetBufSize returns 0 it means that PX code is working;

Yes but the real question you should have answered is why is that happening in the dynamic build on the Core 2 Duo?

CPU frequency detection doesnt work properly on my machine

Strange. That works on all machines I tested it so far. You just have to use UNICODE build.

not critical but why iter is double?

Leftover.

IPP_FFT_NODIV_BY_ANY...

What you are saying is not so obvious from the IPP documentation. From the NODIV_BY_ANY name and from the table I understood that if both factors are 1 then there is no scaling and hence IFFT(FFT(A)) == A. Furthermore, another buffer is not an option here.

analyzed status doesnt correspond to...

Syntax error, sorry. I am sure you know how to fix it.

buf is undefined

Oops... I wrote that test code in a hurry.

Now, let us return to the real problem here:

Why is PX code being called on a Core 2 Duo with dynamic linkage?
Why does ippsFFTGetBufSize_R_32f() have a "special case" return value?!?

#1 can be a bug or perhaps I forgot to initialize something.

#2 however, is a bad design. Zero is not a valid buffer size. Returning zero from ippsFFTGetBufSize_R_32f() is a failure, not a silent success.

Moreover, returning zero only when PX code is being run is a "special case return value" which is a Bad Thing™.

Furthermore, such a simple function does not deserve to return an error code — simply returning zero in case of an error would be sufficient. This way one has to make two tests to determine success instead of one.

Finally, not documenting that a function can return zero as a buffer size and call that a success is horrendous.

Vladimir_Dudnik · ‎10-05-2007

Hi Igor,

1. IPP dispatcher should not use PX code on Core2Duo processor. It is the first time we hear about such problem. Could you please double check that it is the case?

2. *GetBufSize function return size of buffer. If in some particular case buffer is not needed its size is zero. Why do you think it is a failure?

Regards,
Vladimir

levicki · ‎10-05-2007

IPP dispatcher should not use PX code on Core2Duo processor. It is the first time we hear about such problem. Could you please double check that it is the case?

I have corrected the code as per your engineers instructions and I already wrote before how I am compiling it be it static or dynamic (check the ifdef at the beginning of the code). I suggest you test it yourself, perhaps I have made an error.

GetBufSize function return size of buffer. If in some particular case buffer is not needed its size is zero. Why do you think it is a failure?

I believe I have already explained it but I will give you an analogy — if you ask me "What time is it?" what kind of answer do you expect to get? I presume you would like me to tell you the current time in AM/PM or 24hr format, right? What will you make of it if I tell you a number which doesn't represent a valid time? Would you consider that an error condition or not?

In my opinion, part of the problem lies in the fact that the PX version doesn't need a work buffer — it is not documented and it is a special case and frankly, I don't see a point of a work buffer being optional. If it improves performance it should be automatically created when you specify a problem size and used as needed.

Vladimir_Dudnik · ‎01-15-2008

Hello,

FFT functions working with Ipp32f and Ipp32fc data types in ipps library are threaded for orders 13..17. Functions working with Ipp64f and Ipp64fc are threaded for orders 12..16.

This all true for Core2 specific IPP library (V8/U8)

Regards,
Vladimir