Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Success! Subscription added.

Success! Subscription removed.

Sorry, you must verify to complete this action. Please click the verification link in your email. You may re-send via your profile.

- Intel Community
- Intel Community Knowledge Base
- Product Support Forums Knowledge Base
- FPGA Knowledge Base
- FPGA Wiki
- FIR Filter Design in Arria V/Cyclone V DSP Block Using VHDL Inferring

363 Discussions

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content

Introduction

Altera’s 28-nm DSP architecture includes a host of features for optimizing FIR filter implementations:

- Hard, built-in pre-adders can be used when implementing symmetric filters to cut multiplier usage by half.
- Internal co-efficient register storage allows the designer to store the filter coefficients inside the DSP block, which not only saves registers and memory but allows for faster fMAX because coefficients do not have to be routed from the logic.
- Distributed output adder, output register and cascade path for implementing systolic FIR filters.

This document demonstrates how these important features can be inferred by Quartus II from VHDL when designing filters. Four FIR filters are featured in this section, including multichannel interpolation, multichannel decimation, single rate multichannel systolic symmetric design and single channel single rate systolic symmetric design.The filter requirements as well as demonstrated DSP Block features are given in Table 1. The inferring codes can be downloaded here.

The filters included in this package have the following features:

- All FPGA modules are inferred without invoking MegaFunctions. Modules inferred include DSP Blocks, parallel adders and memory based shift tap registers.
- Fully parameterized design, allowing the same template to be used for other bit width, filter size, number of channels, filter length, filter coefficients, sample rate and FPGA clock rate. Due to the time limit, the template is only tested again the parameters listed in the filter requirement in Table 1.

Each filter type has the following deliverables:

- Source code in VHDL.
- Test bench.
- Input test data. Only sine wave is tested.
- Tcl scripts for running Modelsim simulation.
- Quartus project file.

RTL inferring is focused on the functionality of DSP block as shown in Table 1. This project also uses arrays and multi-dimensional arrays to make the design more compact, more readable and more manageable. This requires customized definition of data types, as VHDL is strong typed. At various places type conversion may be needed.For all four FIR designs, a package definition file is included that defined necessary data types. The dimensions of all the array types are fully parameterized.

All four filters in this example have the following interface signals:

The following table lists the interface signals.

The multiple channel input data follows the protocol used in Altera FIR Compiler II. That is, an aggregate of valid channel data followed by don’t care samples. Suppose we have four valid channels in the interpolation filter case. The interpolation rate is four. This implies every 16 clock cycles there is one valid data for each input channel. The multiple channels of data take the following format at the input to the FIR:

where <X> marks a don’t care sample. Alternatively, the input data could take the following more symmetrical format

The format in Figure 2 is more general and is the adopted format in this design. It requires a slightly more complex structure on data delay taps in interpolation filters, but can be easily modified to support the format in Figure 3. Decimation and single rate multi-channel filter types can support both input data formats since there are only multi-channel signals without any <X> inserted. The single rate single channel can also support both since there are only signal channel signals without any <X> inserted.

The filters are fully parameterized with the key parameters listed below. Please refer to the source code especially the package definition file for a complete collection of parameters.

The interpolation filter is implemented as a polyphase decomposed direct form FIR filter. The polyphase decomposition on interpolation filters can be viewed as commutator samples the output of M parallel sub-filter paths, where M is the interpolation rate.

By polyphase decomposition and running FPGA at least M times faster than the input sample rate, we can reuse the multipliers on each polyphase sub-filter path. As a result, the number of multipliers required to run the FIR engine is reduced by M-fold from the total filter size. For multi-channel interpolation FIR filter, the FPGA must run at least M* num_chan_c times faster than the input sample rate. In the following description, rate_c is used to present interpolation rate (M).

Figure 5 shows the architecture of 8 tap direct form multi-channel interpolation FIR filter. It can be easily extended to other taps. Zc means delaying c cycles. c is equal to rate_c* num_chan_c.

Alternatively, you can implement the polyphase FIR in systolic mode, utilizing the chain-out data path available arria-v or cyclone-v DSP blocks. Systolic mode will be discussed in more details later.

In the inferring code, the interpolating rate is 4, the channel number is 4 and the tap number is 32. The filter coefficients are stored internally in DSP blocks. Each DSP block stores 4 different coefficients, corresponding to 4 polyphase components. A select signal controls both the data MUX and coefficient selection, so that coefficients and channel data are properly aligned. The outputs of the multipliers are grouped into pairs of 2 signals, so that we can use the sum of two mode of the DSP block. The sum of two modes allows the output of the first DSP block to be sourced into the second DSP block through the chain-out adder path. The outputs from every other DSP blocks are then collected to perform the final summation. Therefore we need a 16-port pipelined adder to perform the final summation.

As we will show shortly, to infer the tap delay line implementation, a few parameters are needed:

Given the filter requirement in Table 1, we have the following parameters for the interpolation filter:

From Figure 5 we know that the VHDL consists of three parts: shift register chain, DSP block instances and adder chain. The adder chain is implemented by the parallel_add block in Megafunctions/LPM. So only the code of shift register chain and DSP block instances are introduced here.

Altshift_taps MegaFunction creates a shift register chain with equally spaced taps in RAM. Details on inferring altshift_taps can be found in * Quartus II Handbook, Recommended HDL Coding Styles*.

There are two kinds of DSP blocks as shown in Figure 5. The functionality of DSP block and the corresponding inferring code are shown in Figure 7 and Figure 8.

The DSP block usage summary is shown in Figure 9. The number of DSP blocks and the used features match that described in Table 1.

The decimation filter is implemented as a polyphase decomposed direct form FIR filter. The polyphase decomposition on decimation filters can be viewed as commutator delivers input samples sequentially to M parallel sub-filter paths, where M is the decimation rate.

By polyphase decomposition and running FPGA at least M times faster than the input sample rate, we can reuse the multipliers on each polyphase sub-filter path. As a result, the number of multipliers required to run the FIR engine is reduced by M-fold from the total filter size. For multi-channel decimation FIR filter, the FPGA must run at least M* num_chan_c times faster than the input sample rate. In the following description, rate_c is used to present interpolation rate (M).

Note that the input is delivered to the last polyphase of the filter first.Figure 5 shows the architecture of 8 tap direct form multi-channel interpolation FIR filter. It can be easily extended to other taps. Zc means delaying c cycles. c is equal to rate_c* num_chan_c.

Figure 10 shows the architecture of 8 tap direct form multi-channel decimation FIR filter. Similar to the interpolation FIR, the filter coefficients are stored internally in DSP blocks. The tap number is also 32. Each DSP block stores 4 different coefficients, corresponding to 4 polyphase components. A select signal controls coefficient selection. We don’t need the MUX along the tap delay line. The output signal follows the format shown in Figure 2, thus output valid signal is periodically deasserted. The outputs of the multipliers are grouped into pairs of 2 signals, so that we can use the sum of two modes of the DSP block. The sum of two modes allows the output of the first DSP block to the sourced into the second DSP block through the chain-out adder path. The outputs from every other DSP blocks are then collected to perform the final summation. Therefore we need a 16-port pipelined adder to perform the final summation. An accumulator supporting multiple channels completes the final stage of the decimation filter. The DSP block internal accumulator could not be used in this example because the delay is more than 1 cycle due to multi-channel requirement. However in a single channel decimation FIR, you might be able to use the DSP block internal accumulator to further speed up your design.

Given the filter requirement in Table 1, we have the following parameters for the decimation filter:

The inferring part of the decimation FIR is very similar to the interpolation FIR. Please refer to the interpolation FIR section for details.

The compiling result of DSP block usage is the same as that of the interpolation FIR. Please refer to the interpolation FIR section for details.

For a single rate FIR, the systolic structure utilizes the chain out data path inside the DSP blocks and can therefore speed up the FIR quite significantly. If the filter has symmetric coefficients, we can reduce the multiplier count by bending the tap delay line and pre-adding input data in pairs.

Figure 12 shows the architecture example of 16 tap multi-channel systolic symmetric FIR. This architecture can be easily extended to other taps. Parameter c in reflects the time division multiplexing (TDM) factor of a multi-channel FIR filter. It is the ratio of FPGA clock rate over input data sample rate, assuming all channels are of the same sample rate. The number of channels supported in your design should not exceed c. When c becomes 1, it reduces to a single channel FIR, which will be discussed in more details in the next section.

As shown in Figure 12, the output of DSP block is fed to the chain-in of next DSP block. Due to the additional pipeline on the chain out adder path, the tap delay line depth changes after it bends. The tap delay line needs to be implemented in on-chip RAM.

The inferring example of VHDL code has the following parameters:

From Figure 12 we know that the VHDL consists of two parts: tap delay line and DSP block instances.The tap delay line can be inferred in a similar manner as in the interpolation filter case. Please refer to Figure 6 for details. Be mindful that the forward, backward and folding point taps have different depth. As a result three separate shift registers are instantiated and concatenated together.

There are two kinds of DSP blocks as shown in Figure 12. The functionality of DSP block and the corresponding inferring code are shown Figure 13 and Figure 14. The final data output is directly from the last DSP block chain out adder output.

The DSP block usage summary is shown in Figure 15. The number of DSP blocks and the used features match that described in Table 1.

For a single rate FIR, the systolic structure utilizes the chain out data path inside the DSP blocks and can therefore speed up the FIR quite significantly. If the filter has symmetric coefficients, we can reduce the multiplier count by bending the tap delay line and pre-adding input data in pairs.

It is the same structure as the multiple channel systolic FIR, with the tap delay line having exactly two delays between taps. That is, the c parameter in Table 6 becomes 1. We can use the DSP block scanout register as the additional pipeline register required in the systolic FIR. Scanout register is designed to balance out the chain out register.

If you do not rely on the symmetric coefficients property or use pre-adder, a single channel systolic FIR can potentially be mapped entirely inside DSP blocks. The tap delay lines are implemented using input pipeline registers inside the DSP blocks. This will further speed up the performance of the FIR.

Similar to the multiple channel systolic FIR, the multipliers and chain out adders are also mapped completely inside the DSP blocks, resulting in a very high speed FIR structure.

The only difference from the multiple channel systolic FIR design is how tap delay line is implemented. Instead of inferring altshift_taps MegaFunction, you should use simple two-cycle-delay registers to realize the tap delay line.

The preadders and chain out adders can be inferred in the same manner as the multi-channel systolic FIR. Be mindful to use the correct chain out data width is the key for inference.

The compiling result of DSP block usage is the same as that of the multi-channel single rate systolic symmetric FIR. Please refer to the multi-channel single rate systolic symmetric FIR section for details.

This document describes the key VHDL inference elements to optimally design FIR inside Arria V/Cyclone V DSP blocks. Although VHDL coding styles may vary, simple guidelines are provided to highlight necessary steps for successful inference by the Quartus II software

**Community support is provided Monday to Friday. Other contact methods are available here.**

Intel does not verify all solutions, including but not limited to any file transfers that may appear in this community. Accordingly, Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

For more complete information about compiler optimizations, see our Optimization Notice.