Thank very much for VERY

meldaproduction · ‎03-14-2015

Hi,

I'm currently investigating IPP 8 and have a few problems with the biquad IIR support:

- ippsIIRInitAlloc is now deprecated, fine by me, we can allocate manually. However I need to be able to changes the IIR taps WITHOUT changing the filter state. ippsIIRInit seems to destroy the delay line, which makes it unusable. So the only way to change taps without using deprecated functions is to copy the delay line and set it back with ippsIIRInit, which is unacceptable. So what am I missing?

- In many cases I use multiple instances of the same filter - same taps, but used on multiple signals. Right now it seems I need to allocate the IppsIIRState for each of the signals, which means duplicate the taps. It's always just a single biquad, so no big deal, but...

- The requested "new" method using ippMalloc and ippsIIRInit made me check the sizes of the IppsIIRState structures and these are HUGE - like over 4k for a single biquad filter! For 1000 filters that makes 4MB, that looks hardly efficient, especially since I need just a single biquad, so state structure of "2 numbers" (plus taps, plus anything for optimizations, but that could hardly be 4k). Am I missing something or is it a bug?

Thanks in advance!

Igor_A_Intel · ‎03-16-2015

Hi,

why copying and setting delay line is unacceptable for you? Have you measured performance impact? IIR initialization is on 99% devoted to setting taps in an appropriate and the most suitable for getting maximum performance order:

Optimized algorithm for IIR - recalculation of "A" coefficients (before "y")
for unrolling on four the "y" calculations (to remove feedback latency):
The main formula (for the "k+1" order):
y = -A1*y[n-1]-A2*y[n-2]-...-Ak*y[n-k]+B0*x+B1*x[n-1]+...+Bk*x[n-k];
where A1=a1/a0, A2=a2/a0,...,Bk=bk/a0. (see input order of taps)
We shall do two-steps successive filtering: at first by "x", then by "y":

1) for the vector of length = m+1:
     Fx[0] = B0*x+B1*x[n-1]+...+Bk*x[n-k];
     Fx[1] = B0*x[n+1]+B1*x+...+Bk*x[n-k+1];
     .........................................
     Fx = B0*x[n+m]+B1*x[n+m-1]+...+Bk*x[n-k+m];

2) after the first step we will have:
     y   = -A1*y[n-1] -A2*y[n-2]- ...-Ak*y[n-k] +Fx[0];
     y[n+1] = -A1*y    -A2*y[n-1]- ...-Ak*y[n-k+1]+Fx[1];
     ................................................
     y[n+m] = -A1*y[n-1+m]-A2*y[n-2+m]-...-Ak*y[n-k+m]+Fx;

     To remove feedback latency, let us substitute the y from the first line
     to the next three lines, then y[n+1] from the second line to the next two
     lines, y[n+2] from the third - to the line number four. At the left of the
     first four lines we'll have new output points - y, y[n+1], y[n+2],
     y[n+3], and at the right - the functions from "old" points (y[n-1]...
     y[n-k]) and from Fx[0]...Fx[3] with some new coefficients, which are
     calculated during pState structure initializing in the functions
     ippsIIRInitAlloc(dType):

     For the "y" coefficients:
     line 1: {A1[0],...,Ak[0]};
     line 2: {A1[1],...,Ak[1]} = A1[0] * {A1[0],...,Ak[0]} +
                               + {0,A1[0],...,Ak-1[0]};
     line 3: {A1[2],...,Ak[2]} = A1[0] * {A1[1],...,Ak[1]} +
                               + A2[0] * {A1[0],...,Ak[0]} +
                               + {0,0,A1[0],...,Ak-2[0]};
     line 4: {A1[3],...,Ak[3]} = A1[0] * {A1[2],...,Ak[2]} +
                               + A2[0] * {A1[1],...,Ak[1]} +
                               + A3[0] * {A1[0],...,Ak[0]} +
                               + {0,0,0,A1[0],...,Ak-3[0]};

     For the "Fx" coefficients (for the vector {Fx[0], Fx[1], Fx[2], Fx[3]}):
     line 1:     1,     0,     0,     0;
     line 2: A1[0],     1,     0,     0;
     line 3: A1[1], A1[0],     1,     0;
     line 4: A1[2], A1[1], A1[0],     1;

regarding 4K state size - it is required for taps and for buffer to perform separate processing by B coefficients and by A coefficients, also it's required for in-place case. To filter n-channels signal it's better to use ippsIIR_32f_P function (deprecation already removed from this functionality). Summary: please use IIRInit to set new taps and GetDlyLine for saving the previous state - restoring delay line is invisible from the performance point of view. 4M for 1000 filters is ~0.1% of RAM in the not so modern computers. You always can change the approach - use 1 state and Get/SetDlyLine for each signal...

regards, Igor

meldaproduction · ‎03-16-2015

Thank very much for VERY exhaustive answer Igor!

Why copying a delay line as unacceptable - in our software the filter taps may be updated literally every sample, mostly there will be at least 32 samples being processed, but you never know. Now, besides it is completely unnecessary to "init" it (I assume SetTaps does exactly that - init the taps, but do NOT change the delay line, so I have really no idea why it is deprecated, since there is no other way to do that), copying the delay line means allocating a buffer, copying data to it and then back. It's obvious there will be quite a penalty for that and the only way to lower it is to preallocate a buffer for the delay line, which means yet more memory used by the software.

About the buffer sizes - after I made the post I actually checked the assembler code in the debugger and I noticed that there are the coefficients exactly as you say (though I didn't know what they are of course :) ), but the rest of the buffer doesn't seem to be used. It seems like the implementation is kind of planning for the worst case scenario, but you could easily use stack or let the user of the IIR functions provide a temporary buffer like it is in FFT routines. That way we could share the same temporary buffer. It's partly about memory (our users can have a LOT going on, so every byte counts in a way...), but also about cache misses etc.

One final thing - it seems that your implementation is quite an overkill. It assumes there could be stages of several filters with virtually any number of coefficients. I assume there is some subspecialization for biquad filters, but I'd just like to point out that in my field (audio processing) biquads are used 99.99% of the time (in fact we are using just biquads), more advanced filters can be converted to them etc. I'm just thinking that the performance could be even improved if there would be features dedicated to the most common case, which is "one single biquad". Right now I checked the performance improvement compared to plaing C++ implementation in "double" filters is about 30%, which is ok, it is actually very good, but compared to say "multiply a float vector", where there could be like 90% savings it's just not that much, at least for additional 4k space (in double it is actually 16k, which is a hell of a lot :) ).

Please just don't think I'm arguing or anything, I'm just dealing with this problem, as IIR filtering is a totally essential part of our software used in literally hundreds of cases, so I'm just trying to get it as fast and as "easy on memory" as possible.

Thanks again!

Igor_A_Intel · ‎03-17-2015

Hi,

IPP forum is a place for informal communication. I've understood all your requirements and specific. If you really interesting in making things better - please submit your request to IPS (Intel Premier Support): (a) ippsIIRSetTaps, (b) 1 biQuad special optimization (32f or other data types?) (c) consider a way to reduce memory requirements. The official way guarantees that all your issues will be considered and high probability - solved.

regard, Igor

meldaproduction · ‎03-17-2015

Thank you for the info Igor!

How to use IIR functions for variable filters with multiple states?