Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
20705 Discussions

Fmax is small, please, help me to improve it!

Altera_Forum
Honored Contributor II
1,703 Views

Hi, 

 

in my small systemverilog project (about 1000 lines) I have one module that is critical to the performance (I attached it below). 

 

This module takes 4x16 bit words (In) per clock (Clk), pipelines them over N stages (Data), calculates scalar products over all possible combinations and sum them using something similar to FIR filter. Using other clock (ClkSW) I take data out of this module. I urgently need Clk running on 400MHz, and ClkSW can be run with 100MHz. 

 

I am trying to experiment it on Stratix III from DE3 board (EP3SL150F1152C2), with DSP blocks that can achieve 440MHz on a*b+c*d operations, indeed I develop everything maximally pipelined, so I perform only one operation per parallel line, and I see that Quartus uses DSP blocks for my multiplier pairs, and they are optimally implemented as a*b+c*d operators. 

 

Sure I switched all possible compiler optimizations on for speed and switched off any re-usage of synthesis for clear experiment. 

 

In my module I have the parameter N that is refer to the total amount of similar parallel work to be performed. 

 

Here I print a table with my results that I am achieving: 

 

N SHR FMAX 0C/85C Logic DSP Total Synt. Time N=18, SHR=14, 337/311MHz, Logic 22%, DSP 75%, Time=29minutes N=18, SHR= 2, 331/308MHz, Logic 19%, DSP 75%, Time=28minutes N=12, SHR= 2, 366/340MHz, Logic 13%, DSP 50%, Time=17minutes N= 8, SHR= 2, 370/339MHz, Logic 8%, DSP 33%, Time=11minutes N= 6, SHR= 2, 413/384MHz, Logic 6%, DSP 25%, Time=8minutes N= 4, SHR= 2, 448/410MHz, Logic 4%, DSP 17%, Time=5minutes N= 4, SHR=14, 382/355MHz, Logic 5%, DSP 17%, Time=6minutes N= 2, SHR=14, 432/401MHz, Logic 2%, DSP 8%, Time=3minutes Indeed if my DSP usage is small, and N=2,4, I can achieve something close to peak performance (FMax for multipliers should be about 440MHz), but I cannot achieve it if I use many multipliers even if my module behaves totally the same. 

 

I fighted with this module almost one month trying to append some intermediate registers, but it does not help, I cannot achieve even 400MHz (it should be enough for me) for N=16/18. For large projects I cannot run many attempts - each recompile costs me half-hour. 

 

Please, suggest me what I should try to achieve FMax=400MHz for Clk and N=18. I urgently need it, otherwise I will need to demux my data, and use at least SL340 with impressive $8000 price :( 

 

Sincerely, 

 

Ilghiz 

 

Here is my module, you can try it with your Quartus using Stratix III and see my problem: 

 

module test(Clk, In, ClkSW, SW, Scal); parameter N=18; // can be 2, 4, 6, ..., but I need 16 or 18 parameter SHR=14; // can be 2, 3, 4, 5, ..., but I need 12-20 input Clk, ClkSW; input signed In; input SW; reg signed Scal; output Scal; // Memory //////////////////////////// reg signed D, Data; reg signed Mul; reg signed Sum, Sum2; reg signed ScalX; reg signed ScalY; reg InDataCounter; // Reading Data from Channels - the key place where I cannot achieve to clock it with 400MHz for N=16, or 18 always @(posedge Clk) begin for(int i=0; i<2; i++) for(int j=0; j<4; j++) D<=Data; for(int j=0; j<4; j++) Data<=In; for(int i=0; i<N-1; i++) for(int j=0; j<4; j++) Data<=Data; InDataCounter<=~InDataCounter; for(int i=0; i<N; i+=2) for(int j=0; j<4; j++) for(int k=0; k<4; k++) begin Mul <=D*Data; Mul<=D*Data; end for(int i=0; i<N; i+=2) for(int j=0; j<16; j++) begin Sum<=Mul+Mul; Sum2<=Sum; // intermediate register that helps a lot ScalX<=ScalX-(ScalX>>>SHR); ScalX<=Sum2+ScalX; ScalY<=ScalX>>>(16+SHR); ScalY<=ScalX>>>(16+SHR); end end // Output, it is clocked with 100MHz and I hope that is not relevant to my performance problem always @(posedge ClkSW) begin for(int i=1; i<N; i++) begin case(SW) 4'b0000: begin Scal<=ScalY; Scal<=ScalY; end 4'b0001: begin Scal<=ScalY; Scal<=ScalY; end 4'b0010: begin Scal<=ScalY; Scal<=ScalY; end 4'b0011: begin Scal<=ScalY; Scal<=ScalY; end // 4'b0100: begin Scal<=ScalY; Scal<=ScalY; end 4'b0101: begin Scal<=ScalY; Scal<=ScalY; end 4'b0110: begin Scal<=ScalY; Scal<=ScalY; end 4'b0111: begin Scal<=ScalY; Scal<=ScalY; end // 4'b1000: begin Scal<=ScalY; Scal<=ScalY; end 4'b1001: begin Scal<=ScalY; Scal<=ScalY; end 4'b1010: begin Scal<=ScalY; Scal<=ScalY; end 4'b1011: begin Scal<=ScalY; Scal<=ScalY; end // 4'b1100: begin Scal<=ScalY; Scal<=ScalY; end 4'b1101: begin Scal<=ScalY; Scal<=ScalY; end 4'b1110: begin Scal<=ScalY; Scal<=ScalY; end 4'b1111: begin Scal<=ScalY; Scal<=ScalY; end endcase end case(SW) 0: Scal<=ScalY; 1: Scal<=ScalY; 2: Scal<=ScalY; 3: Scal<=ScalY; 4: Scal<=ScalY; 5: Scal<=ScalY; 6: Scal<=ScalY; 7: Scal<=ScalY; 8: Scal<=ScalY; 9: Scal<=ScalY; 10: Scal<=ScalY; 11: Scal<=ScalY; 12: Scal<=ScalY; 13: Scal<=ScalY; 14: Scal<=ScalY; 15: Scal<=ScalY; endcase end endmodule
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
548 Views

Your test.v shows the beauty and strength of real high level languages. But unfortunately this may also be a weakness when you target 'resource-fixed' architectures like FPGAs. 

If you inspect the inferred altmult_add files and enter into the (lowest) .tdf file you will see that almost all ports are unregistered. The top line is very long, but if you edit it (by inserting carriage returns) you can see all the assumptions taken. 

I recompiled (10.0 SP1 Web) for N=2 and SHR = 2, and failed timing by 28 ps only. 

I you select an inferred altmult_add in the navigation window and locate it in the resource Property Editor, you can see that 'dataa[]' is unregistered but 'datab[]' is. In the TimeQuest failed path reports I can see that this not-registering 'dataa[]' accounts for 718 ps interconnect delay.
0 Kudos
Altera_Forum
Honored Contributor II
548 Views

Dear Josyb, 

 

thank you for your kind answer. 

 

Would you, or somebody else, explain me, please, why unregistered memory provides such a delay, how to solve it, and why I get unregistered memory here? 

 

My questions occur due to the following: when you told about multipliers, I decide to put additional intermediate pipeline registers (in my attached code they are D1 and Data1) and I got some improvement for FMax: 

 

N SHR FMAX 0C/85C Logic DSP Total Synt. Time N=24, SHR=14, 363/336MHz, Logic 33%, DSP 100%, Time=56minutes N=12, SHR=14, 376/346MHz, Logic 16%, DSP 50%, Time=19minutes N= 6, SHR=14, 405/383MHz, Logic 8%, DSP 25%, Time= 9minutes N=24, SHR= 2, 383/357MHz, Logic 29%, DSP 100%, Time=43minutes N=12, SHR= 2, 407/380MHz, Logic 15%, DSP 50%, Time=23minutes N= 6, SHR= 2, 443/413MHz, Logic 7%, DSP 25%, Time=12minutes  

 

however, I cannot figure out myself when I should do these tricks, and what kind of other tricks are available for FMax improvement! 

 

PS: in my design I am free to append more pipeline stages, but where, please, help me with procedure to find it. I can see something in "Property Editor" however I cannot interpret it to make correct decision, please, help me!!! 

 

Thank you in advance! 

 

Sincerely, 

 

Ilghiz 

 

module test(Clk, In, ClkSW, SW, Scal); parameter N=6; // can be 2, 4, 6, ..., but I need 18 and dreaming about 24 parameter SHR=14; // can be 2, 3, 4, 5, ..., but I need 12-20 input Clk, ClkSW; input signed In; input SW; reg signed Scal; output Scal; // Memory //////////////////////////// reg signed D, Data; reg signed D1, Data1; // new pipeline registers reg signed Mul; reg signed Sum, Sum2; reg signed ScalX; reg signed ScalY; reg InDataCounter; reg signed ScalY1, ScalY2; // new intermediate reg signed ScalY3, ScalY4; // registers for simple output // Reading Data from Channels - the key place where I cannot achieve to clock it with 400MHz for N=16 or 24 always @(posedge Clk) begin for(int i=0; i<2; i++) for(int j=0; j<4; j++) D<=Data; for(int j=0; j<4; j++) Data<=In; for(int i=0; i<N-1; i++) for(int j=0; j<4; j++) Data<=Data; InDataCounter<=~InDataCounter; // for(int i=0; i<N; i++) for(int j=0; j<4; j++) Data1<=Data; // new pipeline registers that helps for small N for(int i=0; i<2; i++) for(int j=0; j<4; j++) D1<=D; // new pipeline registers that helps for small N // for(int i=0; i<N; i+=2) for(int j=0; j<4; j++) for(int k=0; k<4; k++) begin Mul <=D1*Data1; Mul<=D1*Data1; end for(int i=0; i<N; i+=2) for(int j=0; j<16; j++) begin Sum<=Mul+Mul; Sum2<=Sum; ScalX<=ScalX-(ScalX>>>SHR); ScalX<=Sum2+ScalX; ScalY<=ScalX>>>(16+SHR); ScalY<=ScalX>>>(16+SHR); end end // Output - different from previous one, just to save space... always @(posedge ClkSW) begin for(int i=0; i<N; i++) for(int j=0; j<16; j+=2) ScalY1<=ScalY+j]; for(int i=0; i<N; i++) for(int j=0; j<8; j+=2) ScalY2<=ScalY1+j]; for(int i=0; i<N; i++) for(int j=0; j<4; j+=2) ScalY3<=ScalY2+j]; for(int i=0; i<N; i++) ScalY4<=ScalY3]; Scal<=ScalY4]; end endmodule
0 Kudos
Altera_Forum
Honored Contributor II
548 Views

Ilghiz, 

 

Increasing Fmax further probably means messing up your nice code ... 

To make full use of the registering inside the altmult_add blocks, now inferred by the synthesis of your source code, you actually have to use the Megawizard to define that block to your wishes, in this case for speed by enabling all pipeline registers inside the DSP block itself. That's the easy part, the hard work is now instantiating this block in your code. Unfortunately I only know very little Verilog, let alone System-Verilog, so I can't help you much here.  

Later on you can replace the additions by calling lpm_add_sub (with pipelines), or by defining your own pipelined adder block. (I noticed failure paths in the N=18, SHR=14 compilation due to long adder chains as well).
0 Kudos
Altera_Forum
Honored Contributor II
548 Views

Josyb, 

 

thank you for your kind suggestion, actually I did small improvement with Megawizards and achieve 400MHz on 0C for N=24 and SHR=14, however it is still very unstable: 

 

N SHR FMAX 0C/85C Logic DSP Total Synt. Time N=24, SHR=14, 401/369MHz, Logic 53%, DSP 100%, Time=115m N=12, SHR=14, 362/346MHz, Logic 26%, DSP 50%, Time=55m N= 6, SHR=14, 451/418MHz, Logic 13%, DSP 25%, Time=20m  

 

Hence, the behavior is very strange, sometimes it is fast, sometimes - no, the synthesis time is impressive - almost 2 hours on modern i7 quad core. Due to this instability I will probably switch to SL340 with demux of my global clock, otherwise I will fight more with unstable results of this fitter. 

 

Indeed I was able to write nice code with Megawizard that can be written again in <100 lines :), that I am publishing below. 

 

PS and OFF to Altera Quartus developers: in case if it is interesting to improve Quartus fitter using GPU or massively parallel platforms or even apply better mathematics in the fitter, do not hesitate to ask our help. 

 

Sincerely, 

 

Ilghiz 

-- 

Elegant Mathematics Ltd. 

 

 

module TestOne(Clk, A1, A2, B1, B2, SW, Res); parameter SHR=6; // can be 2, 3, 4, 5, ..., but I need 12-20 input Clk, SW; input signed A1, A2, B1, B2; output reg signed Res; reg signed P1, P2, Q1, Q2; reg signed Mul1, Mul2; reg signed Sum, Sum2; reg signed ScalX1, ScalX2, ScalX3, ScalX4; // you need to install altmult_add module and call it as "mu_mmadd" my_mmadd my_mmadd_module(Clk, P1, Q1, P2, Q2, Sum); always @(posedge Clk) begin P1<=A1; P2<=A2; Q1<=B1; Q2<=B2; // Mul1<=P1*Q1; Mul2<=P2*Q2; // Sum<=Mul1+Mul2; Sum2<=Sum; ScalX2<=ScalX1+(ScalX1>>>SHR); ScalX4<=ScalX3+Sum2; ScalX1<=ScalX4; ScalX3<=ScalX2; Res<=(SW)?ScalX1:ScalX3; end endmodule module test(Clk, In, ClkSW, SW, Scal); parameter N=24; // can be 2, 4, 6, ..., but I need 18 input Clk, ClkSW; input signed In; input SW; output reg signed Scal; // Memory reg signed D, Data; reg InDataCounter, SW0; wire signed ScalY; reg signed ScalY1, ScalY2; reg signed ScalY3, ScalY4; // Generating modules generate genvar i, j, k; for(i=0; i<N; i+=2) begin : aaa for(j=0; j<4; j++) begin : bbb for(k=0; k<4; k++) begin : ccc TestOne TestOne_Module(Clk, D, D, Data, Data, SW0, ScalY); end end end endgenerate // Reading Data always @(posedge Clk) begin for(int i=0; i<2; i++) for(int j=0; j<4; j++) D<=Data; for(int j=0; j<4; j++) Data<=In; for(int i=0; i<N-1; i++) for(int j=0; j<4; j++) Data<=Data; InDataCounter<=~InDataCounter; SW0<=SW^InDataCounter; end // Output always @(posedge ClkSW) begin for(int i=0; i<N/2; i++) for(int j=0; j<16; j+=2) ScalY1<=ScalY+j]; for(int i=0; i<N/2; i++) for(int j=0; j<8; j+=2) ScalY2<=ScalY1+j]; for(int i=0; i<N/2; i++) for(int j=0; j<4; j+=2) ScalY3<=ScalY2+j]; for(int i=0; i<N/2; i++) ScalY4<=ScalY3]; Scal<=ScalY4]; end endmodule
0 Kudos
Altera_Forum
Honored Contributor II
548 Views

Ilghiz, 

 

I think there is room for one further improvement: the lines ScalX2<=ScalX1+(ScalX1>>>SHR); ScalX4<=ScalX3+Sum2; result in two reasonably large adders which probably cause the 'unstable' Fmax result. It would be a good idea to write a module that calculates the sum in two clocks in a split-manner: you add the lower halves of the input factors on the first clock while pipelining the upper halves, and adding these together with the carry-out of the first operation on the second clock edge. 

After that you probably can save a few pipeline stages in your module, as you will end up with a few back-to-back registers with no logic between them.
0 Kudos
Altera_Forum
Honored Contributor II
548 Views

Josyb, 

 

thank you for your kind suggestions. You was right, this long bit operations near to DSP multipliers make large instability. I tried to make it pipelined as you told, but it was only small improvement, however, when I demux my result (Sum) and compute ScalX<=ScalX+Sum2-(ScalX>>>SHR) with half frequency, everything was ok, I succeed to achieve 406MHz with N=24 (384 multipliers) and very large (56 bit ScalX). 

 

Thank you for your kind advice! 

 

Sincerely, 

 

Ilgis
0 Kudos
Reply