FPGA Intellectual Property
PCI Express*, Networking and Connectivity, Memory Interfaces, DSP IP, and Video IP
6343 Discussions

Please, advise me how to improve the performance of my verilog module

Altera_Forum
Honored Contributor II
1,491 Views

Hi, 

 

exercising on DE3 Terasic board, I meet a situation that I cannot find myself a solution of my questions and kindly ask this forum to advise me. 

 

I have small verilog module that reads N words of N infinite vectors v_1,...,v_N and I need to compute all possible s_{i,j,k}=v_i^T D P^k v_j, where P is permutation matrix that shifts vector to one entry down, and D is the diagonal with (d_1,...) on diagonal so that d_1=1,d_2=(1-2^{-m}), d_3=(1-2^{-m})^2..., hence I am implementing something similar to stable IIR filters. 

 

In my case I am trying to pipeline input data that is arriving from each vector (InDataA, InDataB, for the simplicity I take an example with N=2), and compute all products and store it in the result output in (ScalAA, ScalAB, ScalBA, ScalBB ). 

 

If I install this module into standard Tesasic DE3 environment I got two issues that I cannot resolve: 

 

1. all my data are reg signed [13:0], so one multiplication can be fitted into 18x18 bits mults. I am doing massively parallel multiplications and hope to use so called "Four Multiplier Adder Mode" as it is described in Stratix III Device Handbook 1, but I cannot understand how to implement it. I urgently need it otherwise I will run out of recourses of my DE3 board. 

 

2. timing of this module was not very perfect, I achieve only 260-310MHz, however, in the "Four Multiplier Adder Mode" I should achieve 600MHz. I also need it because in my design I expect to have data with 400, 500 and probably 600 MHz input data rate. 

 

And now there is by module. Please, advise me how to: 

 

1. switch four multiplier adder mode on, 

2. and ideas to improve the performance. 

 

Thank you! 

 

Ilghiz 

 

module DATA_Aq(InDataClkA, InDataA, InDataB, OutData); parameter NBUF=16; // the maximum possible shift in the design, I should be able to run it with: // 1) A,B,...H=8 channels, and NBUF=6, or // 2) A,B,C,D=4 channels, and NBUF=16, so both designs // need 384 or 256 18x18 multipliers in Four Multiplier Adder Mode parameter UpdateSpeed=12; input InDataClkA; input InDataA; input InDataB; reg OutData; // this is some artificial output that prevents Quartus to optimize out the main part of computations output OutData; // Memory Declaration reg signed DataA; reg signed DataB; reg signed ScalAA, Scal1AA, Scal2AA; reg signed ScalAB, Scal1AB, Scal2AB; reg signed ScalBA, Scal1BA, Scal2BA; reg signed ScalBB, Scal1BB, Scal2BB; // reg signed Scal3AA, Scal4AA, Scal5AA; reg signed Tmp; reg InDataCounter; // Initialization initial begin integer i; InDataCounter=0; for(i=0; i<NBUF; i=i+1) begin ScalAA=0; Scal1AA=0; Scal2AA=0; ScalAB=0; Scal1AB=0; Scal2AB=0; ScalBA=0; Scal1BA=0; Scal2BA=0; ScalBB=0; Scal1BB=0; Scal2BB=0; DataA=0; DataB=0; end end // Reading Data from Channels and Computation always @(posedge InDataClkA) begin integer i; InDataCounter<=InDataCounter+1; for(i=0; i<NBUF-1; i=i+1) begin DataA<=DataA; DataB<=DataB; end DataA<=InDataA; DataB<=InDataB; for(i=0; i<NBUF; i=i+1) begin Scal1AA<=InDataA*DataA; Scal1AB<=InDataA*DataB; Scal1BA<=InDataB*DataA; Scal1BB<=InDataB*DataB; Scal2AA<=ScalAA-(ScalAA>>UpdateSpeed); Scal2AB<=ScalAB-(ScalAB>>UpdateSpeed); Scal2BA<=ScalBA-(ScalBA>>UpdateSpeed); Scal2BB<=ScalBB-(ScalBB>>UpdateSpeed); ScalAA<=Scal1AA+Scal2AA; ScalAB<=Scal1AB+Scal2AB; ScalBA<=Scal1BA+Scal2BA; ScalBB<=Scal1BB+Scal2BB; end end // This is artificial always block to simulate that I am using Scal?? data always @(InDataCounter) begin case(InDataCounter) 0: Tmp=ScalAA]; 1: Tmp=ScalAB]; 2: Tmp=ScalBA]; 3: Tmp=ScalBB]; endcase OutData=Tmp+Tmp+Tmp+Tmp+Tmp; end endmodule
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
560 Views

Hi,  

 

Quartus will infer the four multipler adder mode from your Verilog, if it follows a suitable template. 

 

Check the Quartus manual, section 6-9, for details and examples. 

http://www.altera.com/literature/hb/qts/qts_qii51007.pdf 

 

That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders. 

 

As for fMax, you need to take a look at your critical paths and see where the largest delay is. In such a case, adding some extra register stages might help. 

 

PS: your "artificial block" looks like something that will be synthesized to latches.
0 Kudos
Altera_Forum
Honored Contributor II
560 Views

Hi, 

 

thank you for your kind respond. May I try to comment your answer and probably figure out the main problem that forced me to ask at this forum. 

 

 

--- Quote Start ---  

 

Quartus will infer the four multiplier adder mode from your Verilog, if it follows a suitable template. 

 

Check the Quartus manual, section 6-9, for details and examples. 

http://www.altera.com/literature/hb/...s_qii51007.pdf (http://www.altera.com/literature/hb/qts/qts_qii51007.pdf

 

--- Quote End ---  

 

 

yes, it is one reason why I am asking at this forum. 

 

 

 

--- Quote Start ---  

 

That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders. 

 

--- Quote End ---  

 

 

No! Actually at the Altera document 

http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf 

at page 5-2, there is a table 5-1 that says if for SL150 I use four multiplier mode I can achieve 384 18x18 multipliers, otherwise if they are just normal (independent) multipliers, I am achieving only 192 18x18 multipliers. I need more performance!!! 

 

From the other hand, at 

http://www.altera.com/literature/hb/stx3/stx3_siii5v2.pdf 

at page 1-17 and table 1-21 at 5-th line I should achieve 600 MHz at 18x18 mode and 440 MHz at double mode (I have C2 speed grade). I need more speed (FMax) for my project!!! 

 

Hence I am trying to find the solution how to organize my computation such a way to achieve this performance. 

 

 

--- Quote Start ---  

PS: your "artificial block" looks like something that will be synthesized to latches. 

--- Quote End ---  

 

 

Please, do not care about it! In the reality it is completely different algorithm in this "artificial block" but it is about 2000 lines and these lines can take you our of my main question. 

 

Sincerely, 

 

Ilghiz
0 Kudos
Altera_Forum
Honored Contributor II
560 Views

I see, I misinterpreted the doc. :) 

 

Anyway, taking a second look at your code, if I reading this right, it can't be mapped to the 4-M-A mode. 

 

The 4-M-Adder performs the operation "output = (a0*b0) + (a1*b1) + (a2*b2) + (a3*b3)" with 3 levels of registers (for fMAX) 

 

ra0 <= a0; ... rb3 <= b3; 

rma01 <= (ra0 * rb0) + (ra1 * rb1) 

rma23 <= (ra2 * rb2) + (ra3 * rb3) 

rma <= rma01 + rm23; 

output <= round_saturate(rma); 

 

Or you can use it in 4-M-Accumulator mode 

 

ra0 <= a0; ... rb3 <= b3; 

rma01 <= (ra0 * rb0) + (ra1 * rb1) 

rma23 <= (ra2 * rb2) + (ra3 * rb3) 

rma <= rma01 + rm23 + rma; 

output <= round_saturate(rma); 

 

You need to, somehow, convert your algorithm into one of these patterns. 

But looking at it, I can't figure out a way.
0 Kudos
Altera_Forum
Honored Contributor II
560 Views

Dear Rbuhalho, 

 

yes, you are right, thank you! It seems that my fist question regarding to the usage of multipliers is solved. I was able to convert the algorithm such a way that it makes A*B+C*D and immediately see that the usage of multipliers drops two times! 

 

However, my performance is still far from possible peak, I achieving right now only 330MHz instead of 440MHz (is it possible to have 600MHz here on my hardware?). 

 

It seems that I need to tune my settings in Quartus or change something more in the algorithm. 

 

Here I attached the Quartus settings and modified code: 

 

 

module GenScal(A1, A2, B1, B2, C1, C2, D1, D2, P1, P2, Q1, Q2, R1, R2, S1, S2, AP, AQ, AR, AS, BP, BQ, BR, BS, CP, CQ, CR, CS, DP, DQ, DR, DS, Clk); parameter UpdateSpeed=12; input Clk; input A1, A2, B1, B2, C1, C2, D1, D2; input P1, P2, Q1, Q2, R1, R2, S1, S2; output AP, AQ, AR, AS; output BP, BQ, BR, BS; output CP, CQ, CR, CS; output DP, DQ, DR, DS; // Memory reg ScalAP, ScalAQ, ScalAR, ScalAS; reg ScalBP, ScalBQ, ScalBR, ScalBS; reg ScalCP, ScalCQ, ScalCR, ScalCS; reg ScalDP, ScalDQ, ScalDR, ScalDS; reg AddAP, AddAQ, AddAR, AddAS; reg AddBP, AddBQ, AddBR, AddBS; reg AddCP, AddCQ, AddCR, AddCS; reg AddDP, AddDQ, AddDR, AddDS; reg MulAP1, MulAQ1, MulAR1, MulAS1; reg MulBP1, MulBQ1, MulBR1, MulBS1; reg MulCP1, MulCQ1, MulCR1, MulCS1; reg MulDP1, MulDQ1, MulDR1, MulDS1; reg MulAP2, MulAQ2, MulAR2, MulAS2; reg MulBP2, MulBQ2, MulBR2, MulBS2; reg MulCP2, MulCQ2, MulCR2, MulCS2; reg MulDP2, MulDQ2, MulDR2, MulDS2; reg SumAP, SumAQ, SumAR, SumAS; reg SumBP, SumBQ, SumBR, SumBS; reg SumCP, SumCQ, SumCR, SumCS; reg SumDP, SumDQ, SumDR, SumDS; assign AP=ScalAP; assign AQ=ScalAQ; assign AR=ScalAR; assign AS=ScalAS; assign BP=ScalBP; assign BQ=ScalBQ; assign BR=ScalBR; assign BS=ScalBS; assign CP=ScalCP; assign CQ=ScalCQ; assign CR=ScalCR; assign CS=ScalCS; assign DP=ScalDP; assign DQ=ScalDQ; assign DR=ScalDR; assign DS=ScalDS; // Initialization initial begin // MulAP1=0; MulAQ1=0; MulAR1=0; MulAS1=0; MulBP1=0; MulBQ1=0; MulBR1=0; MulBS1=0; MulCP1=0; MulCQ1=0; MulCR1=0; MulCS1=0; MulDP1=0; MulDQ1=0; MulDR1=0; MulDS1=0; // MulAP2=0; MulAQ2=0; MulAR2=0; MulAS2=0; MulBP2=0; MulBQ2=0; MulBR2=0; MulBS2=0; MulCP2=0; MulCQ2=0; MulCR2=0; MulCS2=0; MulDP2=0; MulDQ2=0; MulDR2=0; MulDS2=0; // SumAP=0; SumAQ=0; SumAR=0; SumAS=0; SumBP=0; SumBQ=0; SumBR=0; SumBS=0; SumCP=0; SumCQ=0; SumCR=0; SumCS=0; SumDP=0; SumDQ=0; SumDR=0; SumDS=0; // AddAP=0; AddAQ=0; AddAR=0; AddAS=0; AddBP=0; AddBQ=0; AddBR=0; AddBS=0; AddCP=0; AddCQ=0; AddCR=0; AddCS=0; AddDP=0; AddDQ=0; AddDR=0; AddDS=0; // ScalAP=0; ScalAQ=0; ScalAR=0; ScalAS=0; ScalBP=0; ScalBQ=0; ScalBR=0; ScalBS=0; ScalCP=0; ScalCQ=0; ScalCR=0; ScalCS=0; ScalDP=0; ScalDQ=0; ScalDR=0; ScalDS=0; end // Main Computations always @(posedge Clk) begin // 1*1 MulAP1<=A1*P1; MulAQ1<=A1*Q1; MulAR1<=A1*R1; MulAS1<=A1*S1; MulBP1<=B1*P1; MulBQ1<=B1*Q1; MulBR1<=B1*R1; MulBS1<=B1*S1; MulCP1<=C1*P1; MulCQ1<=C1*Q1; MulCR1<=C1*R1; MulCS1<=C1*S1; MulDP1<=D1*P1; MulDQ1<=D1*Q1; MulDR1<=D1*R1; MulDS1<=D1*S1; // 2*2 MulAP2<=A2*P2; MulAQ2<=A2*Q2; MulAR2<=A2*R2; MulAS2<=A2*S2; MulBP2<=B2*P2; MulBQ2<=B2*Q2; MulBR2<=B2*R2; MulBS2<=B2*S2; MulCP2<=C2*P2; MulCQ2<=C2*Q2; MulCR2<=C2*R2; MulCS2<=C2*S2; MulDP2<=D2*P2; MulDQ2<=D2*Q2; MulDR2<=D2*R2; MulDS2<=D2*S2; // Sum SumAP<=MulAP1+MulAP2; SumAQ<=MulAQ1+MulAQ2; SumAR<=MulAR1+MulAR2; SumAS<=MulAS1+MulAS2; SumBP<=MulBP1+MulBP2; SumBQ<=MulBQ1+MulBQ2; SumBR<=MulBR1+MulBR2; SumBS<=MulBS1+MulBS2; SumCP<=MulCP1+MulCP2; SumCQ<=MulCQ1+MulCQ2; SumCR<=MulCR1+MulCR2; SumCS<=MulCS1+MulCS2; SumDP<=MulDP1+MulDP2; SumDQ<=MulDQ1+MulDQ2; SumDR<=MulDR1+MulDR2; SumDS<=MulDS1+MulDS2; // Scal: if I change A+B-C into two stage pipeline it does not improve the performance... ScalAP<=ScalAP+SumAP-AP; ScalAQ<=ScalAQ+SumAQ-AQ; ScalAR<=ScalAR+SumAR-AR; ScalAS<=ScalAS+SumAS-AS; ScalBP<=ScalBP+SumBP-BP; ScalBQ<=ScalBQ+SumBQ-BQ; ScalBR<=ScalBR+SumBR-BR; ScalBS<=ScalBS+SumBS-BS; ScalCP<=ScalCP+SumCP-CP; ScalCQ<=ScalCQ+SumCQ-CQ; ScalCR<=ScalCR+SumCR-CR; ScalCS<=ScalCS+SumCS-CS; ScalDP<=ScalDP+SumDP-DP; ScalDQ<=ScalDQ+SumDQ-DQ; ScalDR<=ScalDR+SumDR-DR; ScalDS<=ScalDS+SumDS-DS; end endmodule  

 

 

 

Device EP3SL150F1152C2 Top-level entity name my_t2_DE3 my_t2_DE3 Family name Stratix III Stratix II Optimization Technique Speed Balanced Use Generated Physical Constraints File Off Use smart compilation Off Off Enable parallel Assembler and TimeQuest Timing Analyzer during compilation On On Enable compact report table Off Off Restructure Multiplexers Auto Auto Create Debugging Nodes for IP Cores Off Off Preserve fewer node names On On Disable OpenCore Plus hardware evaluation Off Off Verilog Version Verilog_2001 Verilog_2001 VHDL Version VHDL_1993 VHDL_1993 State Machine Processing Auto Auto Safe State Machine Off Off Extract Verilog State Machines On On Extract VHDL State Machines On On Ignore Verilog initial constructs Off Off Iteration limit for constant Verilog loops 5000 5000 Iteration limit for non-constant Verilog loops 250 250 Add Pass-Through Logic to Inferred RAMs On On Parallel Synthesis Off Off DSP Block Balancing Auto Auto NOT Gate Push-Back On On Power-Up Don't Care On On Remove Redundant Logic Cells Off Off Remove Duplicate Registers On On Ignore CARRY Buffers Off Off Ignore CASCADE Buffers Off Off Ignore GLOBAL Buffers Off Off Ignore ROW GLOBAL Buffers Off Off Ignore LCELL Buffers Off Off Ignore SOFT Buffers On On Limit AHDL Integers to 32 Bits Off Off Carry Chain Length 70 70 Auto Carry Chains On On Auto Open-Drain Pins On On Perform WYSIWYG Primitive Resynthesis Off Off Auto ROM Replacement On On Auto RAM Replacement On On Auto DSP Block Replacement On On Auto Shift Register Replacement Auto Auto Auto Clock Enable Replacement On On Strict RAM Replacement Off Off Allow Synchronous Control Signals On On Force Use of Synchronous Clear Signals Off Off Auto RAM Block Balancing On On Auto RAM to Logic Cell Conversion Off Off Auto Resource Sharing Off Off Allow Any RAM Size For Recognition Off Off Allow Any ROM Size For Recognition Off Off Allow Any Shift Register Size For Recognition Off Off Use LogicLock Constraints during Resource Balancing On On Ignore translate_off and synthesis_off directives Off Off Timing-Driven Synthesis Off Off Show Parameter Settings Tables in Synthesis Report On On Ignore Maximum Fan-Out Assignments Off Off Synchronization Register Chain Length 2 2 PowerPlay Power Optimization Normal compilation Normal compilation HDL message level Level2 Level2 Suppress Register Optimization Related Messages Off Off Number of Removed Registers Reported in Synthesis Report 5000 5000 Number of Inverted Registers Reported in Synthesis Report 100 100 Clock MUX Protection On On Auto Gated Clock Conversion Off Off Block Design Naming Auto Auto SDC constraint protection Off Off Synthesis Effort Auto Auto Shift Register Replacement - Allow Asynchronous Clear Signal On On Analysis & Synthesis Message Level Medium Medium Disable Register Merging Across Hierarchies Auto Auto Resource Aware Inference For Block RAM On On  

 

Please, suggest me what I still can improve in settings or/and in code to get better performance! 

 

Sincerely, 

 

Ilghiz
0 Kudos
Altera_Forum
Honored Contributor II
560 Views

Hi,  

Dumb suggestion one: 

Try changing the Optimization Technique from Balanced to Speed. 

 

Dumb suggestion two:  

Add two more register levels: register the inputs and the outputs. 

 

How are you obtaining that 330MHz? Are you synthesizing your entire design or just that module? 

Which is the critical path?
0 Kudos
Altera_Forum
Honored Contributor II
560 Views

Hi, 

 

 

--- Quote Start ---  

Try changing the Optimization Technique from Balanced to Speed. 

--- Quote End ---  

I have several (not all) optimizations switched on for speed. 

 

If I synthesizing one instance of this module in the complete project, I am getting FMax=330MHz. If I have 10 instances, then FMax=310MHz only :( 

 

 

--- Quote Start ---  

Add two more register levels: register the inputs and the outputs. 

--- Quote End ---  

Please, help me with short example on it, I did not get the idea! 

 

 

--- Quote Start ---  

How are you obtaining that 330MHz? Are you synthesizing your entire design or just that module? 

Which is the critical path? 

--- Quote End ---  

Actually, I measure FMax for entire design, but it is not too complicated, right now the inputs are set from HSTC LVDS data, and the output is pipelined over "artificial part" to GPIO. 

 

I am newbie in FPGA design, I just turned into this field after 20 years massively parallel numerical math experience. I tried to follow "set_false_path" but it seems that I did not set it properly and cannot understand how to figure out where is my critical part. 

 

PS: I can publish entire project, it is just 600 lines, 100 lines already here, 300 lines just from Terasic, the rest one just binding different inputs and outputs to each other. 

 

Sincerely, 

 

Ilghiz
0 Kudos
Reply