We are working on a software synthesizer. During voice initialization, I'm performing a IppsDivCRev_32f_I on 65000 floats.
When using IPP 5.2, this was using a constant amount of CPU, no problem.
Then we switched to IPP 5.3, and it got weird: the *first* time the function is called, it eats a crazy amount of CPU, like 1000x what it should (& this is bad for our synthesizer use, causing audio buffer underrun). The following times it eats normal amount of CPU, except that it fluctuates between 2x less& 2x more what it was eating when using the 5.2 version.
I found out that the crazy amount of CPU also scales 1:1 along with the number of threads. With2 threads, it eats twice less than with 4. So I thought, IPP may not be pre-creating its threads, it's done the first time a function needs them. Except that setting the number of threads to 1 should disable threading (doc says), but it still eats a crazy amount of CPU (only 4x less than with 4 threads).
So can it be a bug/problem with the function the first time it's called? Or something normal? I can probably use a workaround & make IPP pre-process a DivCRev so that the next time it will be smooth.
Also, is it safe to call IppsSetNumThreads all the time? I have a Q6600, and I find the performances of that DivCRev poor compared to IPP 5.2. I'd rather have a constant 1x CPU usage, than something fluctuating between 0.5 & 2x. However, I will need more threads for FFT processing (for which I haven't checked the benefits yet when threaded, but I trust you).
I also found out: calling another function, like an FFT, on a big block, also eats aninsane amount of CPU the first time, but this is shared with the DivCRev (that is, if DivCRev has already been called, the FFT processing won't eat that insane amount of CPU).
So it doesn't look like it's something specific to DivCRev, but really the thread creation. If it's the case:
-can we ask IPP to pre-allocate its threads (at a safer time?)
-why does it have to allocate a thread when we called IppSetNumThreads(1)? Shouldn't this disable the threading? If not, why can't we call it with zero?
The FFT is also giving me poor results, like, it looks like a quad core is giving me poorer results when multithreading. I'm already not a big fan of this multicore craze, and already found deceiving in my own multithreaded implementations to only see a 290% improvement at best out of a quad core (I was at least expecting like 350%, the rest being wasted by the task management), but in these few functions, it's just worse than in 5.2.
-5.3 and beta 6 both behave the same way in their multithreaded versions. However, it's still the same when you call IppSetNumThreads(1). So it's pretty bad as you can't disable the threading for certain functions only?
-5.2, 5.3 & 6 all behave the same in their non-threaded version. For my own use, these version seem to be more efficient, to start with the CPU usage is constant (not fluctuating like the threaded version), and lower. It's a quad core with not much running in the background (let's say only 5% is used).
So it's quite deceiving, the first-time huge CPU hit, and the average unstable & higher CPU usage, I see no reason to use the threaded versions right now.
I wasn't really expecting any improvement using IPP in my usual use in an audio application that has to process very little buffers, I wasn't even expecting the multithreading to kick in for small (couple of hundreds) buffers. But as I wrote above I tested this on 65k items. If this still isn't big enough to see multithreading improvements, maybe you should consider raising the point at which multithreading kicks in (but then, it won't be very useful for 1D stuff like signal processing, maybe still for 2D bitmaps).
It's Vista 64bit, but the piece of code is 32bit (& so is the DLL). On a Q6600.
I could detail how the higher-priority threads are created in our app, but it doesn't really matter as I could replicate it in a test app.
Simply loading the dynamic builds results in:
-5.2, 5.3 & 6 beta: normal CPU usage all the time for the IppsDivCRev_32f_I
-5.3 threaded & 6 beta threaded versions: very high CPU usage the first call. Really, this looks like the threads are created on the first call. I find it normal, we'd just need a function to pre-create those threads. However it's strange that it does this when the # of threads is set to 1. Like it's creating.. 1 thread.