- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
exercising on DE3 Terasic board, I meet a situation that I cannot find myself a solution of my questions and kindly ask this forum to advise me. I have small verilog module that reads N words of N infinite vectors v_1,...,v_N and I need to compute all possible s_{i,j,k}=v_i^T D P^k v_j, where P is permutation matrix that shifts vector to one entry down, and D is the diagonal with (d_1,...) on diagonal so that d_1=1,d_2=(1-2^{-m}), d_3=(1-2^{-m})^2..., hence I am implementing something similar to stable IIR filters. In my case I am trying to pipeline input data that is arriving from each vector (InDataA, InDataB, for the simplicity I take an example with N=2), and compute all products and store it in the result output in (ScalAA, ScalAB, ScalBA, ScalBB ). If I install this module into standard Tesasic DE3 environment I got two issues that I cannot resolve: 1. all my data are reg signed [13:0], so one multiplication can be fitted into 18x18 bits mults. I am doing massively parallel multiplications and hope to use so called "Four Multiplier Adder Mode" as it is described in Stratix III Device Handbook 1, but I cannot understand how to implement it. I urgently need it otherwise I will run out of recourses of my DE3 board. 2. timing of this module was not very perfect, I achieve only 260-310MHz, however, in the "Four Multiplier Adder Mode" I should achieve 600MHz. I also need it because in my design I expect to have data with 400, 500 and probably 600 MHz input data rate. And now there is by module. Please, advise me how to: 1. switch four multiplier adder mode on, 2. and ideas to improve the performance. Thank you! Ilghiz
module DATA_Aq(InDataClkA, InDataA, InDataB, OutData);
parameter NBUF=16; // the maximum possible shift in the design, I should be able to run it with:
// 1) A,B,...H=8 channels, and NBUF=6, or
// 2) A,B,C,D=4 channels, and NBUF=16, so both designs
// need 384 or 256 18x18 multipliers in Four Multiplier Adder Mode
parameter UpdateSpeed=12;
input InDataClkA;
input InDataA;
input InDataB;
reg OutData; // this is some artificial output that prevents Quartus to optimize out the main part of computations
output OutData;
// Memory Declaration
reg signed DataA;
reg signed DataB;
reg signed ScalAA, Scal1AA, Scal2AA;
reg signed ScalAB, Scal1AB, Scal2AB;
reg signed ScalBA, Scal1BA, Scal2BA;
reg signed ScalBB, Scal1BB, Scal2BB;
// reg signed Scal3AA, Scal4AA, Scal5AA;
reg signed Tmp;
reg InDataCounter;
// Initialization
initial
begin
integer i;
InDataCounter=0;
for(i=0; i<NBUF; i=i+1)
begin
ScalAA=0; Scal1AA=0; Scal2AA=0;
ScalAB=0; Scal1AB=0; Scal2AB=0;
ScalBA=0; Scal1BA=0; Scal2BA=0;
ScalBB=0; Scal1BB=0; Scal2BB=0;
DataA=0;
DataB=0;
end
end
// Reading Data from Channels and Computation
always @(posedge InDataClkA)
begin
integer i;
InDataCounter<=InDataCounter+1;
for(i=0; i<NBUF-1; i=i+1)
begin
DataA<=DataA;
DataB<=DataB;
end
DataA<=InDataA;
DataB<=InDataB;
for(i=0; i<NBUF; i=i+1)
begin
Scal1AA<=InDataA*DataA;
Scal1AB<=InDataA*DataB;
Scal1BA<=InDataB*DataA;
Scal1BB<=InDataB*DataB;
Scal2AA<=ScalAA-(ScalAA>>UpdateSpeed);
Scal2AB<=ScalAB-(ScalAB>>UpdateSpeed);
Scal2BA<=ScalBA-(ScalBA>>UpdateSpeed);
Scal2BB<=ScalBB-(ScalBB>>UpdateSpeed);
ScalAA<=Scal1AA+Scal2AA;
ScalAB<=Scal1AB+Scal2AB;
ScalBA<=Scal1BA+Scal2BA;
ScalBB<=Scal1BB+Scal2BB;
end
end
// This is artificial always block to simulate that I am using Scal?? data
always @(InDataCounter)
begin
case(InDataCounter)
0: Tmp=ScalAA];
1: Tmp=ScalAB];
2: Tmp=ScalBA];
3: Tmp=ScalBB];
endcase
OutData=Tmp+Tmp+Tmp+Tmp+Tmp;
end
endmodule
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Quartus will infer the four multipler adder mode from your Verilog, if it follows a suitable template. Check the Quartus manual, section 6-9, for details and examples. http://www.altera.com/literature/hb/qts/qts_qii51007.pdf That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders. As for fMax, you need to take a look at your critical paths and see where the largest delay is. In such a case, adding some extra register stages might help. PS: your "artificial block" looks like something that will be synthesized to latches.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thank you for your kind respond. May I try to comment your answer and probably figure out the main problem that forced me to ask at this forum. --- Quote Start --- Quartus will infer the four multiplier adder mode from your Verilog, if it follows a suitable template. Check the Quartus manual, section 6-9, for details and examples. http://www.altera.com/literature/hb/...s_qii51007.pdf (http://www.altera.com/literature/hb/qts/qts_qii51007.pdf) --- Quote End --- yes, it is one reason why I am asking at this forum. --- Quote Start --- That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders. --- Quote End --- No! Actually at the Altera document http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf at page 5-2, there is a table 5-1 that says if for SL150 I use four multiplier mode I can achieve 384 18x18 multipliers, otherwise if they are just normal (independent) multipliers, I am achieving only 192 18x18 multipliers. I need more performance!!! From the other hand, at http://www.altera.com/literature/hb/stx3/stx3_siii5v2.pdf at page 1-17 and table 1-21 at 5-th line I should achieve 600 MHz at 18x18 mode and 440 MHz at double mode (I have C2 speed grade). I need more speed (FMax) for my project!!! Hence I am trying to find the solution how to organize my computation such a way to achieve this performance. --- Quote Start --- PS: your "artificial block" looks like something that will be synthesized to latches. --- Quote End --- Please, do not care about it! In the reality it is completely different algorithm in this "artificial block" but it is about 2000 lines and these lines can take you our of my main question. Sincerely, Ilghiz- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see, I misinterpreted the doc. :)
Anyway, taking a second look at your code, if I reading this right, it can't be mapped to the 4-M-A mode. The 4-M-Adder performs the operation "output = (a0*b0) + (a1*b1) + (a2*b2) + (a3*b3)" with 3 levels of registers (for fMAX) ra0 <= a0; ... rb3 <= b3; rma01 <= (ra0 * rb0) + (ra1 * rb1) rma23 <= (ra2 * rb2) + (ra3 * rb3) rma <= rma01 + rm23; output <= round_saturate(rma); Or you can use it in 4-M-Accumulator mode ra0 <= a0; ... rb3 <= b3; rma01 <= (ra0 * rb0) + (ra1 * rb1) rma23 <= (ra2 * rb2) + (ra3 * rb3) rma <= rma01 + rm23 + rma; output <= round_saturate(rma); You need to, somehow, convert your algorithm into one of these patterns. But looking at it, I can't figure out a way.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Rbuhalho,
yes, you are right, thank you! It seems that my fist question regarding to the usage of multipliers is solved. I was able to convert the algorithm such a way that it makes A*B+C*D and immediately see that the usage of multipliers drops two times! However, my performance is still far from possible peak, I achieving right now only 330MHz instead of 440MHz (is it possible to have 600MHz here on my hardware?). It seems that I need to tune my settings in Quartus or change something more in the algorithm. Here I attached the Quartus settings and modified code:
module GenScal(A1, A2, B1, B2, C1, C2, D1, D2,
P1, P2, Q1, Q2, R1, R2, S1, S2,
AP, AQ, AR, AS,
BP, BQ, BR, BS,
CP, CQ, CR, CS,
DP, DQ, DR, DS, Clk);
parameter UpdateSpeed=12;
input Clk;
input A1, A2, B1, B2, C1, C2, D1, D2;
input P1, P2, Q1, Q2, R1, R2, S1, S2;
output AP, AQ, AR, AS;
output BP, BQ, BR, BS;
output CP, CQ, CR, CS;
output DP, DQ, DR, DS;
// Memory
reg ScalAP, ScalAQ, ScalAR, ScalAS;
reg ScalBP, ScalBQ, ScalBR, ScalBS;
reg ScalCP, ScalCQ, ScalCR, ScalCS;
reg ScalDP, ScalDQ, ScalDR, ScalDS;
reg AddAP, AddAQ, AddAR, AddAS;
reg AddBP, AddBQ, AddBR, AddBS;
reg AddCP, AddCQ, AddCR, AddCS;
reg AddDP, AddDQ, AddDR, AddDS;
reg MulAP1, MulAQ1, MulAR1, MulAS1;
reg MulBP1, MulBQ1, MulBR1, MulBS1;
reg MulCP1, MulCQ1, MulCR1, MulCS1;
reg MulDP1, MulDQ1, MulDR1, MulDS1;
reg MulAP2, MulAQ2, MulAR2, MulAS2;
reg MulBP2, MulBQ2, MulBR2, MulBS2;
reg MulCP2, MulCQ2, MulCR2, MulCS2;
reg MulDP2, MulDQ2, MulDR2, MulDS2;
reg SumAP, SumAQ, SumAR, SumAS;
reg SumBP, SumBQ, SumBR, SumBS;
reg SumCP, SumCQ, SumCR, SumCS;
reg SumDP, SumDQ, SumDR, SumDS;
assign AP=ScalAP;
assign AQ=ScalAQ;
assign AR=ScalAR;
assign AS=ScalAS;
assign BP=ScalBP;
assign BQ=ScalBQ;
assign BR=ScalBR;
assign BS=ScalBS;
assign CP=ScalCP;
assign CQ=ScalCQ;
assign CR=ScalCR;
assign CS=ScalCS;
assign DP=ScalDP;
assign DQ=ScalDQ;
assign DR=ScalDR;
assign DS=ScalDS;
// Initialization
initial
begin
//
MulAP1=0; MulAQ1=0; MulAR1=0; MulAS1=0;
MulBP1=0; MulBQ1=0; MulBR1=0; MulBS1=0;
MulCP1=0; MulCQ1=0; MulCR1=0; MulCS1=0;
MulDP1=0; MulDQ1=0; MulDR1=0; MulDS1=0;
//
MulAP2=0; MulAQ2=0; MulAR2=0; MulAS2=0;
MulBP2=0; MulBQ2=0; MulBR2=0; MulBS2=0;
MulCP2=0; MulCQ2=0; MulCR2=0; MulCS2=0;
MulDP2=0; MulDQ2=0; MulDR2=0; MulDS2=0;
//
SumAP=0; SumAQ=0; SumAR=0; SumAS=0;
SumBP=0; SumBQ=0; SumBR=0; SumBS=0;
SumCP=0; SumCQ=0; SumCR=0; SumCS=0;
SumDP=0; SumDQ=0; SumDR=0; SumDS=0;
//
AddAP=0; AddAQ=0; AddAR=0; AddAS=0;
AddBP=0; AddBQ=0; AddBR=0; AddBS=0;
AddCP=0; AddCQ=0; AddCR=0; AddCS=0;
AddDP=0; AddDQ=0; AddDR=0; AddDS=0;
//
ScalAP=0; ScalAQ=0; ScalAR=0; ScalAS=0;
ScalBP=0; ScalBQ=0; ScalBR=0; ScalBS=0;
ScalCP=0; ScalCQ=0; ScalCR=0; ScalCS=0;
ScalDP=0; ScalDQ=0; ScalDR=0; ScalDS=0;
end
// Main Computations
always @(posedge Clk)
begin
// 1*1
MulAP1<=A1*P1; MulAQ1<=A1*Q1; MulAR1<=A1*R1; MulAS1<=A1*S1;
MulBP1<=B1*P1; MulBQ1<=B1*Q1; MulBR1<=B1*R1; MulBS1<=B1*S1;
MulCP1<=C1*P1; MulCQ1<=C1*Q1; MulCR1<=C1*R1; MulCS1<=C1*S1;
MulDP1<=D1*P1; MulDQ1<=D1*Q1; MulDR1<=D1*R1; MulDS1<=D1*S1;
// 2*2
MulAP2<=A2*P2; MulAQ2<=A2*Q2; MulAR2<=A2*R2; MulAS2<=A2*S2;
MulBP2<=B2*P2; MulBQ2<=B2*Q2; MulBR2<=B2*R2; MulBS2<=B2*S2;
MulCP2<=C2*P2; MulCQ2<=C2*Q2; MulCR2<=C2*R2; MulCS2<=C2*S2;
MulDP2<=D2*P2; MulDQ2<=D2*Q2; MulDR2<=D2*R2; MulDS2<=D2*S2;
// Sum
SumAP<=MulAP1+MulAP2; SumAQ<=MulAQ1+MulAQ2; SumAR<=MulAR1+MulAR2; SumAS<=MulAS1+MulAS2;
SumBP<=MulBP1+MulBP2; SumBQ<=MulBQ1+MulBQ2; SumBR<=MulBR1+MulBR2; SumBS<=MulBS1+MulBS2;
SumCP<=MulCP1+MulCP2; SumCQ<=MulCQ1+MulCQ2; SumCR<=MulCR1+MulCR2; SumCS<=MulCS1+MulCS2;
SumDP<=MulDP1+MulDP2; SumDQ<=MulDQ1+MulDQ2; SumDR<=MulDR1+MulDR2; SumDS<=MulDS1+MulDS2;
// Scal: if I change A+B-C into two stage pipeline it does not improve the performance...
ScalAP<=ScalAP+SumAP-AP; ScalAQ<=ScalAQ+SumAQ-AQ; ScalAR<=ScalAR+SumAR-AR; ScalAS<=ScalAS+SumAS-AS;
ScalBP<=ScalBP+SumBP-BP; ScalBQ<=ScalBQ+SumBQ-BQ; ScalBR<=ScalBR+SumBR-BR; ScalBS<=ScalBS+SumBS-BS;
ScalCP<=ScalCP+SumCP-CP; ScalCQ<=ScalCQ+SumCQ-CQ; ScalCR<=ScalCR+SumCR-CR; ScalCS<=ScalCS+SumCS-CS;
ScalDP<=ScalDP+SumDP-DP; ScalDQ<=ScalDQ+SumDQ-DQ; ScalDR<=ScalDR+SumDR-DR; ScalDS<=ScalDS+SumDS-DS;
end
endmodule
Device EP3SL150F1152C2
Top-level entity name my_t2_DE3 my_t2_DE3
Family name Stratix III Stratix II
Optimization Technique Speed Balanced
Use Generated Physical Constraints File Off
Use smart compilation Off Off
Enable parallel Assembler and TimeQuest Timing Analyzer during compilation On On
Enable compact report table Off Off
Restructure Multiplexers Auto Auto
Create Debugging Nodes for IP Cores Off Off
Preserve fewer node names On On
Disable OpenCore Plus hardware evaluation Off Off
Verilog Version Verilog_2001 Verilog_2001
VHDL Version VHDL_1993 VHDL_1993
State Machine Processing Auto Auto
Safe State Machine Off Off
Extract Verilog State Machines On On
Extract VHDL State Machines On On
Ignore Verilog initial constructs Off Off
Iteration limit for constant Verilog loops 5000 5000
Iteration limit for non-constant Verilog loops 250 250
Add Pass-Through Logic to Inferred RAMs On On
Parallel Synthesis Off Off
DSP Block Balancing Auto Auto
NOT Gate Push-Back On On
Power-Up Don't Care On On
Remove Redundant Logic Cells Off Off
Remove Duplicate Registers On On
Ignore CARRY Buffers Off Off
Ignore CASCADE Buffers Off Off
Ignore GLOBAL Buffers Off Off
Ignore ROW GLOBAL Buffers Off Off
Ignore LCELL Buffers Off Off
Ignore SOFT Buffers On On
Limit AHDL Integers to 32 Bits Off Off
Carry Chain Length 70 70
Auto Carry Chains On On
Auto Open-Drain Pins On On
Perform WYSIWYG Primitive Resynthesis Off Off
Auto ROM Replacement On On
Auto RAM Replacement On On
Auto DSP Block Replacement On On
Auto Shift Register Replacement Auto Auto
Auto Clock Enable Replacement On On
Strict RAM Replacement Off Off
Allow Synchronous Control Signals On On
Force Use of Synchronous Clear Signals Off Off
Auto RAM Block Balancing On On
Auto RAM to Logic Cell Conversion Off Off
Auto Resource Sharing Off Off
Allow Any RAM Size For Recognition Off Off
Allow Any ROM Size For Recognition Off Off
Allow Any Shift Register Size For Recognition Off Off
Use LogicLock Constraints during Resource Balancing On On
Ignore translate_off and synthesis_off directives Off Off
Timing-Driven Synthesis Off Off
Show Parameter Settings Tables in Synthesis Report On On
Ignore Maximum Fan-Out Assignments Off Off
Synchronization Register Chain Length 2 2
PowerPlay Power Optimization Normal compilation Normal compilation
HDL message level Level2 Level2
Suppress Register Optimization Related Messages Off Off
Number of Removed Registers Reported in Synthesis Report 5000 5000
Number of Inverted Registers Reported in Synthesis Report 100 100
Clock MUX Protection On On
Auto Gated Clock Conversion Off Off
Block Design Naming Auto Auto
SDC constraint protection Off Off
Synthesis Effort Auto Auto
Shift Register Replacement - Allow Asynchronous Clear Signal On On
Analysis & Synthesis Message Level Medium Medium
Disable Register Merging Across Hierarchies Auto Auto
Resource Aware Inference For Block RAM On On
Please, suggest me what I still can improve in settings or/and in code to get better performance! Sincerely, Ilghiz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Dumb suggestion one: Try changing the Optimization Technique from Balanced to Speed. Dumb suggestion two: Add two more register levels: register the inputs and the outputs. How are you obtaining that 330MHz? Are you synthesizing your entire design or just that module? Which is the critical path?- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
--- Quote Start --- Try changing the Optimization Technique from Balanced to Speed. --- Quote End --- I have several (not all) optimizations switched on for speed. If I synthesizing one instance of this module in the complete project, I am getting FMax=330MHz. If I have 10 instances, then FMax=310MHz only :( --- Quote Start --- Add two more register levels: register the inputs and the outputs. --- Quote End --- Please, help me with short example on it, I did not get the idea! --- Quote Start --- How are you obtaining that 330MHz? Are you synthesizing your entire design or just that module? Which is the critical path? --- Quote End --- Actually, I measure FMax for entire design, but it is not too complicated, right now the inputs are set from HSTC LVDS data, and the output is pipelined over "artificial part" to GPIO. I am newbie in FPGA design, I just turned into this field after 20 years massively parallel numerical math experience. I tried to follow "set_false_path" but it seems that I did not set it properly and cannot understand how to figure out where is my critical part. PS: I can publish entire project, it is just 600 lines, 100 lines already here, 300 lines just from Terasic, the rest one just binding different inputs and outputs to each other. Sincerely, Ilghiz
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page