- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. Are there efforts being made on core performance and when will they be available
2. Have efforts started to multithread the FFTs and when will this be available.
3. Do the in-place versions provide better performance, (It's got to be quite a bit better because our application will require that the result be copied to a result, so there is a penalty to pay.), e.g. ppsFFTFwd_CToC_64f_I()?
In particular we are using the following functions and there performance is starting to lag other alternatives so we look forward to these enhancements.
ippsFFTFwd_CToC_64f()
ippsFFTInv_CToC_64f()
ippsDFTFwd_CToC_64f()
ippsDFTInv_CToC_64f()
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
we are currently work on threading for FFT functions. Expected performance gain on dual core system is about 80% for order equal 13. It might be available in IPP 5.3 version (end of this year). Our experts did not detect significant difference in performance in in-place and not-in-place versions of functions you mention.
Additionally, we work on improvement of performance of single-threaded FFT for Ipp64f data type (we expect to provide 10..15% performance improvement in the next IPP version on core2 architecture.
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir, last time I checked IPP 5.2 FFT performance I got something like this:
Buffer size : 288 bytes irdft(rdft( 64)): 0.79 us Buffer size : 544 bytes irdft(rdft( 128)): 1.11 us Buffer size : 1056 bytes irdft(rdft( 256)): 1.86 us Buffer size : 2080 bytes irdft(rdft( 512)): 3.42 us Buffer size : 4128 bytes irdft(rdft(1024)): 6.83 us Buffer size : 8224 bytes irdft(rdft(2048)): 13.85 us
Test machine was Core 2 Duo E6300 running at 2.24GHz with 2GB of DDR2-800 RAM. Yes, those execution times are for both FFT and inverse FFT.
Perhaps I made a mistake? Here is the test code I used:
#include#include #if 1 // change to 0 for dynamic IPP performance test #include // change v8 to your CPU type for static IPP performance test #include #else #include #endif #pragma comment(lib, "ippsemerged.lib") #pragma comment(lib, "ippsmerged.lib") #pragma comment(lib, "ippcorel.lib") typedef unsigned __int64 u64; #define NAKED __declspec(naked) #define ALIGN __declspec(align(16)) ALIGN Ipp32f src[2048]; ALIGN Ipp32f dst[2048]; DWORD CPUFrequency(void) { const wchar_t *key = L"HARDWAREDESCRIPTIONSystemCentralProcessor"; const wchar_t *value = L"~Mhz"; DWORD buflen = 4; HKEY hKey; DWORD freq; RegOpenKeyEx(HKEY_LOCAL_MACHINE, key, 0, KEY_READ, &hKey); RegQueryValueEx(hKey, value, NULL, NULL, (LPBYTE)&freq, &buflen); RegCloseKey(hKey); return freq; } NAKED u64 readtsc(void) { __asm { rdtsc ret } } int main(int argc, char *argv[]) { IppsFFTSpec_R_32f *pFFTSpec; IppStatus s; Ipp8u *buf = NULL; u64 t0, t1; double f; int buf_len, i, n, iter = 100000; for (i = 0; i < 2048; i++) { src = (i + 3) / 3.14f; } for (n = 6; n < 12; n++) { s = ippsFFTInitAlloc_R_32f(&pFFTSpec, n, IPP_FFT_NODIV_BY_ANY, ippAlgHintFast); if (s != ippStsNoErr) { printf("ippsFFTInitAlloc_R_32f() failed (%ld) ", s); return 1; } s = ippsFFTGetBufSize_R_32f(pFFTSpec, &buf_len); if (s != ippStsNoErr) { printf("ippsFFTGetBufSize_R_32f() failed (%ld) ", s); } else { if (buf_len == 0) { printf("ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! "); } else { printf("Buffer size : %ld bytes ", buf_len); buf = ippsMalloc_8u(buf_len); } } t0 = readtsc(); for (i = 0; i < iter; i++) { ippsFFTFwd_RToCCS_32f(src, dst, pFFTSpec, buf); ippsFFTInv_CCSToR_32f(dst, src, pFFTSpec, buf); } t1 = readtsc(); if (buf != NULL) { ippsFree(buf); } ippsFFTFree_R_32f(pFFTSpec); f = (double)(t1 - t0) / iter / CPUFrequency(); printf("irdft(rdft(%4d)): %5.2f us ", 1 << n, f); } return 0; }
However, there is one serious problem — with 5.2.063 I am getting horrible results with dynamic IPP .vs. static IPP.
Reason is that ippsFFTGetBufSize_R_32f() for some reason returns success and buffer size 0 and without work buffer things get awfully slow.
ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! irdft(rdft( 64)): 126.23 us ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! irdft(rdft( 128)): 304.70 us ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! irdft(rdft( 256)): 703.33 us ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! ird ft(rdft( 512)): 1607.48 us ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! irdft(rdft(1024)): 3591.07 us ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! irdft(rdft(2048)): 7988.09 us
Could you please test this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor, there are number of issues in code you provide and some of them could lead to significant degradation of performance, please take a look on comment provided by our experts:
1) if GetBufSize returns 0 it means that PX code is working;2) scaling mode is wrong (for such measuring technique) that leads to indefinites in src&dst and to dramatically bad performance;3) CPU frequency detection doesnt work properly on my machine let use ippGetCpuFreqMhz() instead;4) There are number of bugs in the attached example have a look for my comments below5) not critical but why iter is double?
int main(int argc, char *argv[])
{
IppsFFTSpec_R_32f *pFFTSpec;
IppStatus s;
Ipp8u *buf; // buf value is undefined hereu64 t0, t1;double f, iter = 100000.0;
int buf_len, i, n;for (i = 0; i < 2048; i++) {
src = (i + 3) / 3.14f;
}
for (n = 6; n < 12; n++) {
s = ippsFFTInitAlloc_R_32f(&pFFTSpec, n, IPP_FFT_NODIV_BY_ANY, ippAlgHintFast);
// IPP_FFT_NODIV_BY_ANY leads to non-normalized result
// therefore src will not contain the same values after
// the inverse FFT is applied the non-controlled increasing
// of src will be seen after each iteration that leads
// to overflow, indefinites in src and as result
// to significant performance degradation
// So IPP_FFT_DIV_INV_BY_N must be used or
// inverse FFT must be called with another destination
// than src
if (s != ippStsNoEr r) {
printf("ippsFFTInitAlloc_R_32f() failed (%ld) ", s);
return 1;
}
ippsFFTGetBufSize_R_32f(pFFTSpec, &buf_len); // no status for future analysisif (s != ippStsNoErr) {// analyzed status doesnt correspond to// getBuf function
printf("ippsFFTGetBufSize_R_32f() failed (%ld) ", s);
} else {
if (buf_len == 0) {
printf("ippsFFTGetBufSize_R_32f() didn't fail but no buffer size! ");
} else {
printf("Buffer size : %ld bytes ", buf_len);
buf = ippsMalloc_8u(buf_len);
}
}
t0 = readtsc();
for (i = 0; i < iter; i++) {
// if GetBuf fails then buf is undefined// must be initialized at least to
// NULL at the beginning
ippsFFTFwd_RToCCS_32f(src, dst, pFFTSpec, buf);ippsFFTInv_CCSToR_32f(dst, src, pFFTSpec, buf);
}
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if GetBufSize returns 0 it means that PX code is working;
Yes but the real question you should have answered is why is that happening in the dynamic build on the Core 2 Duo?
CPU frequency detection doesnt work properly on my machine
Strange. That works on all machines I tested it so far. You just have to use UNICODE build.
not critical but why iter is double?
Leftover.
IPP_FFT_NODIV_BY_ANY...
What you are saying is not so obvious from the IPP documentation. From the NODIV_BY_ANY name and from the table I understood that if both factors are 1 then there is no scaling and hence IFFT(FFT(A)) == A. Furthermore, another buffer is not an option here.
analyzed status doesnt correspond to...
Syntax error, sorry. I am sure you know how to fix it.
buf is undefined
Oops... I wrote that test code in a hurry.
Now, let us return to the real problem here:
- Why is PX code being called on a Core 2 Duo with dynamic linkage?
- Why does ippsFFTGetBufSize_R_32f() have a "special case" return value?!?
#1 can be a bug or perhaps I forgot to initialize something.
#2 however, is a bad design. Zero is not a valid buffer size. Returning zero from ippsFFTGetBufSize_R_32f() is a failure, not a silent success.
Moreover, returning zero only when PX code is being run is a "special case return value" which is a Bad Thing™.
Furthermore, such a simple function does not deserve to return an error code — simply returning zero in case of an error would be sufficient. This way one has to make two tests to determine success instead of one.
Finally, not documenting that a function can return zero as a buffer size and call that a success is horrendous.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Igor,
1. IPP dispatcher should not use PX code on Core2Duo processor. It is the first time we hear about such problem. Could you please double check that it is the case?
2. *GetBufSize function return size of buffer. If in some particular case buffer is not needed its size is zero. Why do you think it is a failure?
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IPP dispatcher should not use PX code on Core2Duo processor. It is the first time we hear about such problem. Could you please double check that it is the case?
I have corrected the code as per your engineers instructions and I already wrote before how I am compiling it be it static or dynamic (check the ifdef at the beginning of the code). I suggest you test it yourself, perhaps I have made an error.
GetBufSize function return size of buffer. If in some particular case buffer is not needed its size is zero. Why do you think it is a failure?
I believe I have already explained it but I will give you an analogy — if you ask me "What time is it?" what kind of answer do you expect to get? I presume you would like me to tell you the current time in AM/PM or 24hr format, right? What will you make of it if I tell you a number which doesn't represent a valid time? Would you consider that an error condition or not?
In my opinion, part of the problem lies in the fact that the PX version doesn't need a work buffer — it is not documented and it is a special case and frankly, I don't see a point of a work buffer being optional. If it improves performance it should be automatically created when you specify a problem size and used as needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
FFT functions working with Ipp32f and Ipp32fc data types in ipps library are threaded for orders 13..17. Functions working with Ipp64f and Ipp64fc are threaded for orders 12..16.
This all true for Core2 specific IPP library (V8/U8)
Regards,
Vladimir
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page