I'm wondering if there's a solution for the following problem.
For a number of years, I've been working on an audio processing application.
Initially I created my own FFT implementation, SSE2-based, with data (always 2 channels stereo) ordered as left-right-left-right-...
The FFT implementation used the fact that I always had to do exactly the same thing for both channels, which made vectorizing easy.
So far, so good. More recently I discovered IPP, and replaced my own FFT by that from IPP. To be able to use it, I have to rearrange my memory before and after each FFT call (split to 2 channels, call FFT twice, then combine the 2 channels again). The rearranging of.memory is again written in sse2, and very efficient. But... It still takes time.
As things are progressing now, the number of FFT calls in my program keeps going up - and so does the overhead of rearranging the memory. So I was wondering if there's a better solution than rearranging the memory twice.
(i know ideally I should rearrange my data, but that would mean rewriting near 100 processing functions, many of which are partially optimized using sse2. That's why I'm trying to find an easier solution.
Hi, have you tried using the threaded versions of Intel IPP FFT functions? You would still have to rearrange memory but could spread the work over multiple cores. In addition you could make the calls to the FFT functions for the left and right channels on separarate threads. There is a threaded version of the Intel IPP static library and the the shared library version is threaded by default so if you are using the shared version of the library and not adjusting the number of threads then the FFT call is probably threaded; check the file ThreadedFunctionsList.txtwith the Intel IPP documentation to make sure the FFT function you are calling is threaded internally.
Considering the cost of re-arranging memory, I think the best solution is to re-architect your other functions to utilize two planar stereo channels, rather than the two interleaved channels you are now using. There are some functions that can be used to convert between the two arrangements: RealToCplx and CplxToReal can be used to translate between interleaved two channel data and two channel planar data.
But given that you're already a proficient SSE2 programmer I suspect you've already figured out how to re-arrange your date using the most efficient means.
I've tried to rewrite at least parts of my code to avoid having to convert the data, but it turns out that the fact that the left and right channels are interleaved is also benefitting the performance. So I gain some performance by not having to convert the data, but I in some cases I loose more due to the less optimal data storage than I gain from not having to convert the data.
(How is that possible? In some cases I have to combine the output of many different FFT's, and when the left and right channels are stored separately I have to read and write at twice as many different locations. So basically the number of reads and writes doubles, and the number of cache misses probably also increases).
Fortunately, I did find something useful: If I take an FFT of interleaved data, and I want to do something with amplitudes at different frequencies (for example bandpass), equal on both channels, I only have to mirror the behavior for the top end - so at 2^N-x I do the same thing that I do at x. This did help me to increase the performance for certain very simple filters.
Now what I was actually looking for - but I guess I should ask at a maths forum or something - is a more generic solution for other types of filtering as well.