Complex Multiplier Implementation in Arria 10

cancel
Showing results for 
Search instead for 
Did you mean: 

Complex Multiplier Implementation in Arria 10

Complex Multiplier Implementation in Arria 10

 

Description

This wiki page is dedicated to users implementing a complex multiplier in Arria 10 and other families. There are many different ways to implement a complex multiplier with varying results in terms of hardened DSP block packing and maximum frequency (fmax). In general, any function implemented in the Arria 10 or other Intel FPGA hardened DSP blocks should be reviewed to make sure functions are being packed into the DSP element as needed and not being placed into the fabric which can result in higher LUT/register utilization and reduced fmax. Often times, migrating a DSP function from a previous family may result in less than optimal packing and logic utilization in the new family especially if there are DSP architecture changes in the new FPGA family. This article will show three different implementations of a complex multiplier and how they synthesize to gates and resulting fmax.

Three different implementations of a complex multiplier were created.

1. The altmult_complex IP was parameterized and instantiated

2. One of Quartus’ template RTL DSP multipliers was chosen and adjusted to match a complex multiplier

3. Direct parameterization of the DSP IP.

The Complex Multiplier

Given two complex numbers:

x = a + bi

y = c + di

Multiplying x and y gives us the following:

x * y = (a + bi) * (c + di)

= ac + adi + bci – bd

= (ac – bd) + (ad + bc)i

Arria 10 Complex Multiplier

The following document explains how a complex multiplier can be implemented in Arria 10.

a10_handbook.pdf

A10_Handbook_3.5.1.2_Complex_Mult.JPG

As shown, up to an 18x19 complex multiplier can be implemented using two Arria 10 hardened DSP blocks. However, not all implementations will map the complex multiplier to just the two DSP blocks.

Design Example

A design example was created in Quartus Prime Standard and Quartus Prime Pro to show the results from three different implementations of a complex multiplier in Arria 10.

The Quartus Prime Standard 18.0 Build 614 example:

DSP Complex Mult A10 18 0 0 614 Std.qar - See attached at the bottom of this article

The Quartus Prime Pro 18.0 Build 219 example:

DSP Complex Mult A10 18 0 0 219 Pro.qar - See attached at the bottom of this article

Pro has a different algorithm for packing pipeline registers that are connected to top-level ports, so the Pro design has extra pipeline stages in the top level for the direct instantiated DSP IP (the third implementation) and also uses the following qsf for the top-level pipeline registers

(* altera_attribute = {" -name SYNCHRONIZER_IDENTIFICATION OFF "}

Implementation 1, altmult_complex IP

Implementation 1 of a complex multiplier uses the altmult_complex IP parameterized as follows:

Altmult_complex_IP_parameter_editor.JPG

Implementation 2, using DSP RTL template

Quartus has many RTL language templates available for insertion into the design. The templates available range from state machines, RAM storage elements, and arithmetic operations. With an RTL file open in Quartus, a user can see all the template options in the “edit -> insert template” pull down.

Insert_Template.JPG

In some cases, an existing template can be adjusted to match the functionality needed in the design. After reviewing all the 20nm (Arria 10 FPGA family) DSP features, it was found that the “M18x19_sumof2 with Dynamic Sub and Dynamic Negate” template most closely matched a complex multiplier.

Insert_M18x19_sumof2_w_addsub.JPG

The template was inserted into a new RTL file and then saved as m18x19_sum_of_2_full_regs_dynSub_dynNegate.v for Quartus Standard and m18x19_sum_of_2_full_regs_dynSub_dynNegate_Pro.v for Quartus Pro.

The file m18x19_sum_of_2_full_regs_dynSub_dynNegate.v was then copied to m18x19_sum_of_2_full_regs_complex.v so that edits could be made to exactly match a complex multiplier that we are trying to implement.

The m18x19_sum_of_2_full_regs_complex.v was edited to remove the last dynamic add/negate stage and add another output, s2_output_reg, so that the result would have both real and imaginary results.

A tkdiff or diff will show the differences between m18x19_sum_of_2_full_regs_complex.v and m18x19_sum_of_2_full_regs_dynSub_dynNegate.v for a better understanding.

Implementation 3, direct Native Fixed-Point DSP Intel Arria 10 FPGA IP

The 3rd implementation of a complex multiplier in Arria 10 was done by directly parameterizing the Native Fixed-Point DSP Intel Arria 10 FPGA IP block and instantiating the generated IP for both the real and complex portions of the arithmetic. The first tab of the IP was parameterized as follows:

A10_Native_Fixed_Point_DSP_IP_Parameter.JPG

The IP can be downloaded as part of the qar file to see the remaining tabs. The same IP as used to do the real and imaginary arithmetic just notice that the port “sub” was set to 1 for subtraction used in the real number calculation and 0 for addition used in the imaginary calculation.

Test_A10_Native_Fixed_Point_DSP Test_A10_Native_Fixed_Point_DSP_real (
.ay (dataa_imag2), // ay.ay
.ax (datab_imag2), // ax.ax
.by (dataa_real2), // by.by
.bx (datab_real2), // bx.bx
.resulta (result_real2_pre), // resulta.resulta
.clk (mult_clock2), // clk.clk
.ena (1'b1), // ena.ena
.aclr (reset), // aclr.aclr
.sub (1'b1) // set to subtract
);
Test_A10_Native_Fixed_Point_DSP Test_A10_Native_Fixed_Point_DSP_imag (
.ay (dataa_real2), // ay.ay
.ax (datab_imag2), // ax.ax
.by (datab_real2), // by.by
.bx (dataa_imag2), // bx.bx
.resulta (result_imag2_pre), // resulta.resulta
.clk (mult_clock2), // clk.clk
.ena (1'b1), // ena.ena
.aclr (reset), // aclr.aclr
.sub (1'b0) // set to add
);

Results of implementation 1, altmult_complex IP

The example was compiled and the results for altmult_complex was analyzed. It was found that the implementation of altumult_complex for a complex multiplier operation resulted in using 3 DSP Blocks, 72 LUTs, and 288 core registers. Fmax was reported as 206.48Mhz in a mid-speed grade device.

Chip Planner shows utilization as follows:

mik_Intel_0-1594324552857.png

The resource property viewer shows that the DSP blocks are not being packed well.

mik_Intel_1-1594324600449.png

Running a complex multiplier in Arria 10 using altmult_complex is probably OK up to ~200Mhz as long as there are a lot of core fabric resources available too.

Fmax can be improved by changing the altmult_complex to 3 stages of output latency and then manually adding an input pipeline stage. This allows Quartus to pull in the first stage of pipeline registers inside the DSP. Fmax can improve to ~350Mhz.

Results of implementation 2, using DSP RTL template

The example was compiled and the results for m18x19_sum_of_2_full_regs_complex RTL code were analyzed. It was found that the implementation of template m18x19_sum_of_2_full_regs_complex for a complex multiplier operation resulted in using 2 DSP Blocks and 2 core registers. Fmax was reported as 408.66Mhz in a mid-speed grade device.

Chip Planner shows utilization as follows:

mik_Intel_2-1594324695637.png

The entire complex multiplier was almost implemented in two DSP blocks. However, the bits used to turn on the add operation on the real portion of the complex multiplier were registered outside the DSP block.

Running a complex multiplier in Arria 10 using the m18x19_sum_of_2_full_regs_complex DSP template yields good results from an fmax and DSP hardened resource packing perspective.

Results of implementation 3, using direct Native Fixed-Point DSP Intel Arria 10 FPGA IP

The example was compiled and the results for direct Native Fixed-Point DSP Intel Arria 10 FPGA IP was analyzed. It was found that the implementation of direct Native Fixed-Point DSP Intel Arria 10 FPGA IP for a complex multiplier operation resulted in using 2 DSP Blocks and NO OTHER FABRIC RESOURCES. Fmax was reported as 458.72Mhz in a mid-speed grade device.

Chip Planner shows utilization as follows:

mik_Intel_3-1594324769300.png

The entire complex multiplier was implemented in two DSP blocks.

The resource property viewer shows that the DSP blocks are packed well.

mik_Intel_0-1594325545102.png

Running a complex multiplier using direct Native Fixed Point DSP Intel Arria 10 FPGA IP yielded the best results from an fmax and DSP hardened resource packing perspective.

Conclusion

A complex multiplier can be implemented a number of different ways in FPGAs. In Arria 10, three implementations were shown with each resulting in different performance and core logic utilization numbers. In general, it is best practice to review hardened resources being target in the FPGA to make sure that the logic is being packed as expected to save on core logic resources. In the example provided for this article, it can be seen that one implementation for a complex multiplier in Arria 10 gives the best results. Using direct Native Fixed-Point DSP Intel Arria 10 FPGA IP will yield the best results in terms of fmax performance and the ability for Quartus to pack the complex multiplier into the hardened Arria 10 DSP blocks. Not all DSP functions may yield the same results, so it is best to experiment with different implementations until requirements are met.

Note: All three implementations in the example design were simulated and had matching results to verify functionality.

 

Attachments
Version history
Last update:
‎07-09-2020 01:40 PM
Updated by: