DSP Pipelining

Altera_Forum · ‎02-11-2012

Hey guys,

I've been posting on here for a while now, asking questions related to the NLMS senior project I've been working on.

I now have the system working, and have a fully functional RTL simulation. However, when I performed a gate level simulation on the system, I recieved hold errors.

I showed the professor the Verilog, and he said the problem was that we were letting our state machine directly start calculations async. In other words, the posedge of control signals from the state machine activated always loops that began the coefficient update calculations. Such async behavior seems to cause timing issues and glitches.

Instead, he recommended that we peform the calculation operations on the clock edges, and then use if statements to check if the calculations were ready or not.

Upon implementing these changes and making everything change upon the clock edge, we lowered the Fmax down to 10MHz.

We discovered that the calculations were taking too long, so we pipelined them. Now Fmax is up to 60Mhz.

However, we now have 10 clock delays to perform one calculation. The old state machine used to calculate a coefficient, perform a FIR MAC with that calculation, and repeat L times, where L was the tap length of the FIR filter.

Now, there are 10 delays that have to be incorporated into the state machine. It already took 6 cycles just to perform one FIR MAC. Now, with 10 extra cycles, you get 16 cycles to perform one FIR MAC. If you multiply that by the sampling rate , which is 16kHz(Audio), and the number of taps, which is 1024, you get 262 MHz for a required filter clock frequency.

But Fmax is only 60Mhz, even with pipelining!

Any suggestions what to do?

Altera_Forum · ‎02-11-2012

You can design your pipeline in such a way that a sample enters the pipeline on every clock. So while a sample enters at the front of the pipeline as many samples as your pipeline is long, are already travelling towards the output. The only (sic) thing you have to assure is that the necessary coefficients etcetera are also propagated through the DSP chain.

That way you only need a clock of about 16 MHz, which should be easy to achieve.

Altera_Forum · ‎02-13-2012

Thanks for responding.

I didn't mention that this was an adaptive filter.

Im doing convolutions with a circular input buffer and a linear coefficient buffer. The state machine grabs the top most input sample and lowest coefficient and along with the error output calculates the new coefficient. Then it uses this new coefficient to calculate the next FIR MAC.

When you implement pipelining, it takes so many cycles for a coefficient calculation to complete.

The next FIR MAC calculation requires the latest coefficient to be calculated first. That is the problem I'm facing. If I implement a 5 stage pipeline, for example, I will have to wait 5 cycles before I can perform the MAC with the newly calculated coefficient.

I still don't understand how I can get around this.

Altera_Forum · ‎02-13-2012

--- Quote Start ---

Thanks for responding.

I didn't mention that this was an adaptive filter.

--- Quote End ---

I took a quick look at Wikipedia, the maths are above me :)

--- Quote Start ---

Im doing convolutions with a circular input buffer and a linear coefficient buffer. The state machine grabs the top most input sample and lowest coefficient and along with the error output calculates the new coefficient. Then it uses this new coefficient to calculate the next FIR MAC.

When you implement pipelining, it takes so many cycles for a coefficient calculation to complete.

The next FIR MAC calculation requires the latest coefficient to be calculated first. That is the problem I'm facing. If I implement a 5 stage pipeline, for example, I will have to wait 5 cycles before I can perform the MAC with the newly calculated coefficient.

I still don't understand how I can get around this.

--- Quote End ---

Now if you need the results of the previous sample to calculate the next one, pipelining is not going to be much help.

If I get it right you do a coefficient calculation and then do a FIR of 1024 taps, or does every FIR stage need a newly calculate coefficient? In the first case you could pipeline the FIR for as many stages as you see fit and repeat the FIR operation for (L / stages). This way you can keep the required clock frequency in the 60 MHz region (which is easy to fit).

Altera_Forum · ‎02-15-2012

I think what you are dealing with is the filter rate and the update rate of your NLMS algorithm. Normally the FIR filter does not get updated each and every clock cycle in an adaptive filter. I would split the problem and work on them independently.