Hey Guys,Im an EE major just starting on my BSEE senior project. It is to create an FPGA based acoustic echo canceller. We have to use a 1024 tap adaptive FIR filter. It is required that the 1024 tap fir filter be split into 8 smaller filters. The filtering will be done in the time domain using buffers and MACs. How could this be achieved? Can you directly add the outputs of multiple filters in parallel?
--- Quote Start --- Im an EE major just starting on my BSEE senior project. It is to create an FPGA based acoustic echo canceller. We have to use a 1024 tap adaptive FIR filter. It is required that the 1024 tap fir filter be split into 8 smaller filters. The filtering will be done in the time domain using buffers and MACs. How could this be achieved? Can you directly add the outputs of multiple filters in parallel? --- Quote End --- What is the sample rate of the incoming acoustic signal? 100kHz? FPGAs can be clocked at 1000x this rate, so you essentially have 1000 FPGA clocks per incoming acoustic sample. Inside the FPGA, you can implement a ~1000 tap filter using a single MAC and two RAM blocks; one with samples and one with coefficients. The RAM blocks can be dual-ported and double-buffered, so that you can write new coefficients and data to one half of the RAM using one port, while the digital filter logic is reading from the other half of RAM using the other port. There's really no reason to decide that you have to split the design into 8 separate filters ... but if that is a 'requirement' then perhaps you'll have to do it. It is however possible that your instructions were based on a multi-rate implementation of the same filter. So the first thing I would suggest is getting a clarification on why you have to implement the 8 filter architecture. Regarding your last question; yes you can add lots of filter outputs in parallel, but you generally do not do this in one FPGA clock cycle, you add pairs of samples in an adder tree (the tree depth is log2 of the number of inputs to add). For your 8 inputs, you have 4 x 2-input adders with 4 outputs (that are 1-bit wider), feeding 2 x 2-input adders with 2 outputs (that are 1-bit wider), feeding a final 2-input adder stage (log2(8) = 3 stages of adders). The final sum is 3-bits wider than the input sample width. You need to be careful about rounding and truncation operations. Take a read of these notes: http://www.ovro.caltech.edu/~dwh/correlator/pdf/esc-100paper_hawkins.pdf (http://www.ovro.caltech.edu/%7edwh/correlator/pdf/esc-100paper_hawkins.pdf) http://www.ovro.caltech.edu/~dwh/correlator/pdf/esc-100slides_hawkins.pdf (http://www.ovro.caltech.edu/%7edwh/correlator/pdf/esc-100slides_hawkins.pdf) Cheers, Dave
Thanks for responding to the question and for the pdfs. I didn't mention that the echo cancellation algorithm is Block NLMS. That is why he wants us to split the filter into blocks of 128. So, we would take one block of input data, send it to a filter, take another block, send it to another filter and so on.Each block would have it's own unique coefficients. The thing i'm confused by is how to split up a fir filter and combine the outputs to obtain the equivalent filter.
I am not clear about the over all purpose of your filter, but if your main filter is meant to be 1024 taps and to be applied to the input stream continuously but in blocks of 128 stages x 8 then you will need to add up all the 8 outputs continuously and divide result by 8 for every single output sample.
The purpose of the project is to cancel echo. Think about two conferencing rooms. You are in one room, and you talk. That signal would be xn. That signal reaches the second room, comes out of the speaker, bounces off of the walls, and returns to the mic. This returned signal is dn. The signal dn coming from the other room not only contains the other person's voice, but also some returned echo.The goal is to create an adaptive FIR filter that takes your voice, xn, as an input, and produces an output yn. This output is subtracted from the returned signal, dn, to yield en, the error signal. This error, along with the input xn and the step size, is used to calculate a gradient. This gradient is added to the current coefficient, yielding new coefficients that point in the direction of minimal error. After this happens a few times, the filter will be adjusted, and the filter output when subtracted from the returned signal will completely cancel the echo. This could be done easily on a DSP processor. However, we are to implement it in an FPGA to take advantage of parallelism. A CPU will grab xn and dn from mic inputs on the FPGA board, and send it to the logic. The logic will then calculate en, and each en value will be sent to the output port via the CPU. Due to the length of the return path, it is required that a 1024 tap filter be used. But, the professor insisted that we break up the filter into 8 blocks, and calculate new coefficients for each block seperately. My initial idea was to use muxes, and connect the 8 FIR filters in parallel via the muxes. The ouputs would be summed together to produce the final value. Each time you would send 128 values to one filter, and then control the muxes to move to the next, and so forth. But, this idea seems so clunky. It seems that there could be a better way to achieve this. I tried reading the IEEE papers, but they really aren't that helpful, and are way to complex for this project. I understand Verilog really well. I'm just stuck at the architectural implementation.
I know you can divide a signal into blocks in the frequency domain then process each block then add up the signal. But in your case, the signal xn is in time domain. The concept of of 8 blocks either means 8 independent filters or it means one filter of 1024 taps broken for implementation sake.In case you have to pass one block of data into one delay pipe and not pass it through to other blocks in succession then your filter becomes actually 8 independant filters for same input but segragated sections. May be your professor wants then somehow to choose among them. Addition here breaks the meaning of filtering but can still be viewd as some sort of averaging if you add the 8 results in parallel. So overall I am not clear really just like you. Possibly modelling will help
He said that research showed that some blocks should have their coefficients updated more often than others.He said that rather than update 1024 coefficients each time a new sample arrives, just update 128 coefficients when a block of 128 arrives.
--- Quote Start --- He said that research showed that some blocks should have their coefficients updated more often than others. He said that rather than update 1024 coefficients each time a new sample arrives, just update 128 coefficients when a block of 128 arrives. --- Quote End --- The Quadrature Mirror Filter (QMF) stuff is probably in Vaidyanathan, "Multirate Systems and Filter Banks". Performing this type of division of the filter makes sense when the processor implementing the filters cannot just brute-force implement the filter. Given that you have 1000x more processing power, you should first check that the brute-force approach is not the 'best' approach. Cheers, Dave
In that case it could be he means apply 1024 taps as a standard FIR i.e. input stream enters a 1024 delay pipe continuously though 128 coefficients are to be update-able as one set. Each block of 128 input samples enters next block pipe and so on till the end. As such you can just add up the 8 stages in parallel to get overall sum of products per output sample. You will need to update the 128 sets smoothly without breaking the filter computation by some sort of buffering.
Thanks for the help Kaz. Do you mean to create 8 seperate filters, and add all of the outputs? Then shift each block of 128 to each successive filter?Each filter could consist of two buffers, one for the coefficients and one for the inputs, each 128 in length. These two buffers would feed into seperate MACs. And then I would add the outputs of all of the MACs, while at the same time shifting each previous 128 value block to the next input buffer. Am Is this what you are saying? DWH, are you saying that it is a waste of time to split the filter up? I agree. I don't understand why we are to split it up if we are working in an FPGA. Do you think there would be any performance decrease if we did just brute force it and implement the whole filter?
1) You will need a delay pipe of 1024 stages for input stream. Using fabric registers for this pipe will deplete fpga resource (imagine 1024 x 16 bits = a lot). Therefore you got a problem to start with but it can be implemented in ram blocks or ram ALUTs. ram based shifter is also possible but it has restrictions on tap distance, however you can set it to minimum tap distance e.g. 3 stages and speed up your stream at 3 times the input rate.2) coeffs will need also ram and can updated there and be read back (dual port ram). I don't see any reason now to split up outputs into 8 sections but just add up the whole sum of products in an accumulator. Whether you can add up all or part depends on your processing speed Vs input speed. Regarding my note on updating coeffs to avoid glitchy output, actually I wouldn't worry about since coeffs need be designed with gentle gradient anyway or else a glitch will occur.
--- Quote Start --- DWH, are you saying that it is a waste of time to split the filter up? I agree. I don't understand why we are to split it up if we are working in an FPGA. Do you think there would be any performance decrease if we did just brute force it and implement the whole filter? --- Quote End --- I'm not saying its a waste of time, I'm just questioning whether it is the correct solution. Take for example the following; lets say I have a signal that I sample at 1MHz and need to filter with a 100-tap filter and decimate the signal by 8 times. If you look in a multirate signal processing book, you will get lots of nice math that show how to take your filter taps, split them into 8 parallel filters and run them at 1/8th the input sample rate. Its very cool, but its a complete waste of time for this specific implementation (and possibly resources). The filter can be implemented using one MAC and a RAM and just run it at 100MHz (or whatever higher rate is appropriate for the filter and FPGA). Or you can get slightly trickier and implement the 8 parallel filters using that same MAC and RAM. The only difference then might be that you can run the logic slower - but if you were going to run the logic at 100MHz for your other logic, what difference does this 'savings' make? 'Optimal' is in the eye of the beholder. Cheers, Dave