FPGA Intellectual Property
PCI Express*, Networking and Connectivity, Memory Interfaces, DSP IP, and Video IP
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
6104 Discussions

I need a scheme to max out the DSP blocks on a Stratix 10

Jacob11
New Contributor II
524 Views

Hello guys,

 

I am working on a design to use all 3000 DSP blocks on the stratix 10 FPGA. I thought the best solution would be to instantiate this intel fixed point DSP IP as a multiplier:

 

https://www.intel.com/content/www/us/en/docs/programmable/683450/current/native-fixed-point-dsp-intel-stratix-51840.html

 

I am using a generate for loop to generate 3000 of these DSP's. The issue that I am having is that Quartus is optimizing away these blocks unless I directly connect each of the 3000 DSP's with a top level output.

 

I do not have enough IO on the board to accommodate 3000 outputs, so I just don't constrain the output signals to a pin. Quartus gives an error and will not synthesize the design since I am trying to synthesize more IO than I have on the board.

 

So I tried to use combinational logic to do something like:

 

output = DSP_out[0] || DSP_out[1] || DSP_out[2]

 

this actually does work, but it only synthesizes 3 DSP blocks....so I would need a long combination statement to include all 3000 DSP_out. I tried it also in a for loop and it does not work.

 

Here is my full code as it stands. I just need a way for all 3000 of the DSP blocks to synthesize. I hope my question makes sense.

 

`timescale 1ps/1ps
`default_nettype none

module power_test_design (
input wire clk_i,
output wire [1-1:0] outputa,
output reg [1-1:0] outputb
);
localparam NUM_DSP_BLOCKS = 3000;
 
genvar i;
wire reset;
integer k;
 
 
reg [17:0] ay_r;
reg [17:0] by_r;
reg [17:0] ax_r;
reg [17:0] bx_r;
(* keep = "true" *) wire [36:0] resulta [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) reg [36:0] resulta_r [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) wire [36:0] resultb [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) reg [36:0] resultb_r [NUM_DSP_BLOCKS-1:0];
reg [2:0] ena_r;
 
 
// Stratix10 system reset
reset_release U_RESET (
.ninit_done (reset ) // output, width = 1, ninit_done.ninit_done
);
 
// DSP stimulus signals
always @(posedge clk_i) begin : DSP_SET_FF
if (reset) begin
ay_r <= {18{1'b0}};
by_r <= {18{1'b0}};
ax_r <= {18{1'b0}};
bx_r <= {18{1'b0}};
ena_r <= {3{1'b0}};
end else
begin
ena_r <= 3'b001;
 
ay_r <= $unsigned(ay_r) + 1;
by_r <= $unsigned(by_r) + 1;
ax_r <= $unsigned(ax_r) + 2;
bx_r <= $unsigned(bx_r) + 3;
 
end
end
 
generate
for (i=0; i<NUM_DSP_BLOCKS; i=i+1) begin : GEN_DSPS

dsp_fixed U_DSP (
.ay (ay_r), // input, width = 18, ay.ay
.by (by_r), // input, width = 18, by.by
.ax (ax_r), // input, width = 18, ax.ax
.bx (bx_r), // input, width = 18, bx.bx
.resulta (resulta[i]), // output, width = 37, resulta.resulta
.resultb (resultb[i]), // output, width = 37, resultb.resultb
.clk0 (clk_i), // input, width = 1, clk0.clk
.clk1 (), // input, width = 1, clk1.clk not used
.clk2 (), // input, width = 1, clk2.clk not used
.ena (ena_r) // input, width = 3, ena.ena
);
//assign output of DSP block to register
assign resulta_r[i] = resulta[i];
assign resultb_r[i] = resultb[i];

end
endgenerate
 
 

 
assign outputa = resulta_r[0] || resulta_r[1] || resulta_r[2];

 
 


endmodule
`resetall


0 Kudos
1 Solution
Jacob11
New Contributor II
366 Views

SOLVED!!!!

 

I was assigning the output registers of each DSP block incorrectly.

 

What I had:

 


assign result1_a_r[i] = result1_a[i];
assign result1_b_r[i] = result1_b[i];

 

What it should be:

 

always @(posedge clk_i) begin
  result1_a_r[i] <= result1_a[i];
  result1_b_r[i] <= result1_b[i];
end
 
I replaced this code in each of the chained DSP blocks and now I have it running up to 3000 DSP blocks.
 
Thanks
Jacob

View solution in original post

9 Replies
Nurina
Employee
495 Views

Hello,


You could leave the DSP's at top level and use virtual pin assignment to fix the I/O problem.

https://www.intel.com/content/www/us/en/docs/programmable/683641/21-3/virtual-pins.html


Regards,

Nurina


Jacob11
New Contributor II
485 Views

Hello Nurina,

 

The output issue is solved by this trick, but i got bottlenecked at 468 DSP blocks.

 

I also tried chaining a lot of DSP blocks together (output from block1 as input of block2). This also works until I reach 468 DSP blocks. Then the place and route process fails with the message:

 

Error(184036): Cannot place the following 57 DSP cells -- a legal placement which satisfies all the DSP requirements could not be found
Info(184037): Node "GEN_DSP_2[0].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[132].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[133].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[134].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[135].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[136].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[137].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[138].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[139].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(184037): Node "GEN_DSP_2[140].U_DSP|s10_native_fixed_point_dsp_0|fourteennm_mac_component"
Info(18798): And 47 more similar nodes (full list omitted for brevity)

 

Here is the important parts of the updated code:

 

module power_test_design (
input wire clk_i,
output wire [NUM_DSP_BLOCKS-1:0] output0,
output wire [NUM_DSP_BLOCKS-1:0] output1,
output wire [NUM_DSP_BLOCKS-1:0] output2
);

localparam NUM_DSP_BLOCKS = 175;
 
wire [36:0] result0_a [NUM_DSP_BLOCKS-1:0];
wire [36:0] result0_b [NUM_DSP_BLOCKS-1:0];
wire [36:0] result1_a [NUM_DSP_BLOCKS-1:0];
wire [36:0] result1_b [NUM_DSP_BLOCKS-1:0];
wire [36:0] result2_a [NUM_DSP_BLOCKS-1:0];
wire [36:0] result2_b [NUM_DSP_BLOCKS-1:0];
 
//generate DSP chain
generate
for (i=0; i<NUM_DSP_BLOCKS; i=i+1) begin : GEN_DSP_0

dsp_fixed U_DSP (
.ay ($unsigned(ay_r)), // input, width = 18, ay.ay
.by ($unsigned(by_r)), // input, width = 18, by.by
.ax ($unsigned(ax_r)), // input, width = 18, ax.ax
.bx ($unsigned(bx_r)), // input, width = 18, bx.bx
.resulta (result0_a[i]), // output, width = 37, result0_a.result0_a
.resultb (result0_b[i]), // output, width = 37, result1.result1
.clk0 (clk_i), // input, width = 1, clk0.clk
.clk1 (), // input, width = 1, clk1.clk not used
.clk2 (), // input, width = 1, clk2.clk not used
.ena (ena_r) // input, width = 3, ena.ena
);
assign output0[i] = result0_a[i][0];
end
endgenerate
 
generate
for (i=0; i<NUM_DSP_BLOCKS; i=i+1) begin : GEN_DSP_1

dsp_fixed U_DSP (
.ay (result0_a[i][17:0]), // input, width = 18, ay.ay
.by (result0_b[i][17:0]), // input, width = 18, by.by
.ax (result0_a[i][17:0]), // input, width = 18, ax.ax
.bx (result0_b[i][17:0]), // input, width = 18, bx.bx
.resulta (result1_a[i]), // output, width = 37, result0_a.result0_a
.resultb (result1_b[i]), // output, width = 37, result1.result1
.clk0 (clk_i), // input, width = 1, clk0.clk
.clk1 (), // input, width = 1, clk1.clk not used
.clk2 (), // input, width = 1, clk2.clk not used
.ena (ena_r) // input, width = 3, ena.ena
);

assign output1[i] = result1_a[i][0];
end
endgenerate

generate
for (i=0; i<NUM_DSP_BLOCKS; i=i+1) begin : GEN_DSP_2

dsp_fixed U_DSP (
.ay (result1_a[i][17:0]), // input, width = 18, ay.ay
.by (result1_a[i][17:0]), // input, width = 18, by.by
.ax (result1_b[i][17:0]), // input, width = 18, ax.ax
.bx (result1_b[i][17:0]), // input, width = 18, bx.bx
.resulta (result2_a[i]), // output, width = 37, result0_a.result0_a
.resultb (result2_b[i]), // output, width = 37, result1.result1
.clk0 (clk_i), // input, width = 1, clk0.clk
.clk1 (), // input, width = 1, clk1.clk not used
.clk2 (), // input, width = 1, clk2.clk not used
.ena (ena_r) // input, width = 3, ena.ena
);
assign output2[i] = result2_a[i][0];
end
endgenerate
  
 
 
Then for the assignment editor it looks like this:
assign.png

 

Unfortunately, I still get the error message:
Error(184036): Cannot place the following 57 DSP cells -- a legal placement which satisfies all the DSP requirements
could not be found

 
If I lower the "NUM_DSP_BLOCKS localparam" until the total DSP blocks are less than 468, the project successfully
compiles through to the assembler stage. If its more than 468 in total, no matter how i configure this thing
I get this error message. It synthesizes fine, but fails place and route.
 
I have tried just doing one big block of 500 DSP generates, and I have tried chaining together 20 DSP generates
that generate only 30 each. All of it bottlenecks at 468. There is some kind of limitation here, but the error message
is not meaningful enough for me to understand.
 
Any other ideas??
 
Thanks,
Jacob
 
Nurina
Employee
462 Views

Hi Jacob,


What device are you targeting? Can you give the OPN number?

Also, which Quartus version are you using?


Regards,

Nurina


Jacob11
New Contributor II
452 Views

Good morning Nurina.

 

We are using a Stratix 10 1SM21BEU2F55E2VG on a board of our own design.

 

Thanks,

Jacob

Jacob11
New Contributor II
442 Views

And quartus prime pro 21.4

Nurina
Employee
402 Views

Hi,


Sorry for the late response, can you share your .qar file? To do this, go to Project>Archive Project.


Regards,

Nurina


Jacob11
New Contributor II
395 Views

Hello Nurina,

 

Attached is the .qar file.

 

I have commented out the last 2 DSP blocks in the chain to give a total DSP load of 450 blocks. This design builds fine(although it fails timing).

 

I recommend that you start with this build, and then comment out lines 499-502 and uncomment the block from lines 504-554. This will bring the total DSP count to 510 and the build will fail.

 

I cannot figure a way around this 468 bottleneck. I even tried instantiating another IP, the floating point DSP IP to split the load. No matter how I combined the fixed point and floating point IP's there was nothing I could do to achieve more than 468 DSP blocks.

 

Thanks

Jacob

Jacob11
New Contributor II
367 Views

SOLVED!!!!

 

I was assigning the output registers of each DSP block incorrectly.

 

What I had:

 


assign result1_a_r[i] = result1_a[i];
assign result1_b_r[i] = result1_b[i];

 

What it should be:

 

always @(posedge clk_i) begin
  result1_a_r[i] <= result1_a[i];
  result1_b_r[i] <= result1_b[i];
end
 
I replaced this code in each of the chained DSP blocks and now I have it running up to 3000 DSP blocks.
 
Thanks
Jacob
Nurina
Employee
335 Views

Hi Jacob,


I’m glad that your problem has been solved, I now transition this thread to community support. If you have a new question, Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.


p/s: If any answer from community or Intel support are helpful, please feel free to mark as solution, give Kudos and rate 4/5 survey


Regards,

Nurina


Reply