Solved: SIMD using DSP in Stratix 10

jjxichn · ‎04-12-2025

Hello All, I am new to Stratix 10. I am wondering does Altera's DSP have SIMD feature as Xilinx? SIMD means like 1 DSP performs 4 parallel addition. I checked the DSP features in verilog template but I did not see it unfortunately. If it has, I am curious is there any ways to implement it or is there any specific coding style? Thanks

KennyTan_Altera · ‎04-15-2025

Unlike Xilinx DSP48E1 slices, Altera DSP blocks are primarily optimized for Multiply-Accumulate (MAC) operations — SIMD-style packed adders are not as directly exposed or as flexible in the same way Xilinx’s DSP48E slices are.

Can you see if the below code help?

Verilog RTL — 4×12-bit Parallel Adder (Using ALM Carry Chains)

module parallel_4x12b_adder (

input [11:0] a0, a1, a2, a3, // 4× 12-bit inputs

input [11:0] b0, b1, b2, b3, // 4× 12-bit inputs

output [12:0] sum0, sum1, sum2, sum3 // 4× 13-bit outputs to handle carry

);

assign sum0 = a0 + b0;

assign sum1 = a1 + b1;

assign sum2 = a2 + b2;

assign sum3 = a3 + b3;

endmodule

Explanation:

Each assign statement uses the ALM carry chain. Quartus synthesizer will map this to the dedicated carry chains in the ALMs.

Outputs are 13-bit wide to handle potential overflow.

Very efficient — no LUT wasting.

If You Want to Pack into a 48-bit DSP-friendly Add (More Complex)

If you really wanted to try packing them and using a DSP block (assuming 48-bit add support, e.g., Arria 10/Stratix 10), here’s a conceptual version:

verilog

module packed_4x12b_adder (

input [47:0] a_packed, // 4× 12-bit packed inputs

input [47:0] b_packed, // 4× 12-bit packed inputs

output [47:0] sum_packed // 4× 12-bit packed results (overflow risk!)

);

assign sum_packed = a_packed + b_packed;

endmodule

Notes:

You'd need to align each 12-bit value properly in the 48-bit word.

Risk of overflow if sum exceeds 12 bits in each lane.

Post-add masking and saturation may be needed.

This is risky since Intel DSPs don’t natively split this into SIMD lanes like Xilinx. Quartus might split this into ALM logic anyway.

Recommendation

✔ Use the first version — Quartus will map those independent 12-bit additions onto ALMs using fast carry chains, highly efficient, no LUT waste.

✔ Avoid the packed 48-bit add unless you're certain the device and toolchain will optimize it safely into a DSP block.

Let me know if the above helps to some extent. If not, we may have to leave this for a future enhancement in Quartus.

View solution in original post

KennyTan_Altera · ‎04-13-2025

I check in our userguide, we do not have this SIMD mode. May I know what is your use case perhaps I will get some feature enhancement in the future?

jjxichn · ‎04-14-2025

In Xilinx, one DSP can be realize 4 parallel addition from 10 bits all the way to 14 bits. This can help reduce the number of LUTs being used. I am not sure are there any better ways to realize addition other than using LUTs in Altera platform.

KennyTan_Altera · ‎04-15-2025

Unlike Xilinx DSP48E1 slices, Altera DSP blocks are primarily optimized for Multiply-Accumulate (MAC) operations — SIMD-style packed adders are not as directly exposed or as flexible in the same way Xilinx’s DSP48E slices are.

Can you see if the below code help?

Verilog RTL — 4×12-bit Parallel Adder (Using ALM Carry Chains)

module parallel_4x12b_adder (

input [11:0] a0, a1, a2, a3, // 4× 12-bit inputs

input [11:0] b0, b1, b2, b3, // 4× 12-bit inputs

output [12:0] sum0, sum1, sum2, sum3 // 4× 13-bit outputs to handle carry

);

assign sum0 = a0 + b0;

assign sum1 = a1 + b1;

assign sum2 = a2 + b2;

assign sum3 = a3 + b3;

endmodule

Explanation:

Each assign statement uses the ALM carry chain. Quartus synthesizer will map this to the dedicated carry chains in the ALMs.

Outputs are 13-bit wide to handle potential overflow.

Very efficient — no LUT wasting.

If You Want to Pack into a 48-bit DSP-friendly Add (More Complex)

If you really wanted to try packing them and using a DSP block (assuming 48-bit add support, e.g., Arria 10/Stratix 10), here’s a conceptual version:

verilog

module packed_4x12b_adder (

input [47:0] a_packed, // 4× 12-bit packed inputs

input [47:0] b_packed, // 4× 12-bit packed inputs

output [47:0] sum_packed // 4× 12-bit packed results (overflow risk!)

);

assign sum_packed = a_packed + b_packed;

endmodule

Notes:

You'd need to align each 12-bit value properly in the 48-bit word.

Risk of overflow if sum exceeds 12 bits in each lane.

Post-add masking and saturation may be needed.

This is risky since Intel DSPs don’t natively split this into SIMD lanes like Xilinx. Quartus might split this into ALM logic anyway.

Recommendation

✔ Use the first version — Quartus will map those independent 12-bit additions onto ALMs using fast carry chains, highly efficient, no LUT waste.

✔ Avoid the packed 48-bit add unless you're certain the device and toolchain will optimize it safely into a DSP block.

Let me know if the above helps to some extent. If not, we may have to leave this for a future enhancement in Quartus.

KennyTan_Altera · ‎04-16-2025

Do you have further question?

jjxichn · ‎04-19-2025

Hello Kenny,

I tried this code snippet you provided and it seems the altera is not picking up the DSP to realize it. I tried the change the bitwidth of the input and output and it does not change.

module parallel_4x12b_adder (

input [11:0] a0, a1, a2, a3, // 4× 12-bit inputs

..

assign sum3 = a3 + b3;

endmodule

Another question I have is how can I check whether the adder is realized by the carry chain or LUT in ALM? In Resource Usage Summary, I only see number of ALMs needed. It does not specify it is LUT or carry chain.

Thanks,

KennyTan_Altera · ‎04-20-2025

Hi,

Sorry that the simplified sample will not work as no clock is feed into the register.

In order to get the full implementation, you can right click the verilog.v files -> insert template -> verilog HDL -> full design -> arithmetic -> DSP feature.

Pick according to your devices.

In order to check if the carry chain implemented, you will need to go with Chip Planner -> Resource property editor, you will be able to check from there.

jjxichn · ‎04-21-2025

Hi,

I have checked the verilog template for DSP features. The attached file shows all the template for DSP features. But I did not see the SIMD or similar description to it. Can you please point me to the right ones?

KennyTan_Altera · ‎04-22-2025

After a detailed analysis, it’s clear that Xilinx offers a distinct advantage in supporting SIMD-style operations through its DSP slices (such as the DSP48E1), which allow for parallel processing of multiple narrow-width additions—for example, four 12-bit additions per clock cycle. This is made possible by internal ALU partitioning specifically optimized for such use cases.

In contrast, Intel (Altera) DSP blocks—such as those in the Stratix 10 and Arria 10 families—do not provide a native SIMD adder mode. Intel's DSP architecture is instead focused on:

18x19 or 27x27 multipliers
Optional pre-adders
Accumulators and chainout logic
Dynamic add/subtract/negate features

Because of this, attempting to implement Xilinx-style 4-lane SIMD additions (e.g., 4 parallel 10–14-bit adders) on Intel FPGAs requires alternative approaches:

Soft logic (LUTs): Quartus can infer these adders from RTL, but this approach uses more logic resources and may impact performance.
DSP blocks using workaround modes, such as the plus36 mode (Templates 9 and 10), where multiplier outputs are combined using DSP-based post-adders. These methods can only approximate SIMD behavior when combined with dummy or fixed multipliers, not for standalone parallel additions.

Among the provided DSP templates:

Template 9 allows combining two multipliers with additional operands using dynamic add/sub/negate. Although not a direct SIMD implementation, it can be adapted by setting multiplier inputs to constants (e.g., 1) and injecting other values as summands.
Template 10 extends this with accumulation and preload support, enabling a pseudo-SIMD operation over multiple cycles.

When exploring the Quartus IP Catalog, there is no current DSP IP that directly implements SIMD-style packed addition (e.g., 4x 12-bit additions per clock) similar to Xilinx. The closest configuration available is the dual 18x19 with post-add/subtract, but this still does not offer true SIMD parallelism within a single DSP block.

Given these limitations, we believe this is a candidate for an enhancement request in a future release of Quartus. Adding native SIMD-style addition support would significantly improve resource efficiency for common signal processing applications and enable better competitiveness.

To better support this request, we would appreciate your input on the specific application use case, including the target algorithm or workload, expected performance improvements, and the resource constraints currently being faced. This information will help justify the feature request and demonstrate the practical impact such an enhancement would provide.

jjxichn · ‎04-24-2025

Thank you for the detailed and insightful comparison between Xilinx and Intel DSP architectures regarding SIMD-style operations.

As you said, since Altera is lacking the SIMD feature compared to Xilinx and I can use template 9 or template 10 as a workaround. This is good enough for help fix my current issue.

Thanks again for your deep analysis and reply in great details.

KennyTan_Altera · ‎04-24-2025

I’m glad that your question has been addressed, I now transition this thread to community support. If you have a new question, Please login to ‘https://supporttickets.intel.com/s/?language=en_US’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.

SIMD using DSP in Stratix 10

Design Entry|Synthesis|Compilation