Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
21343 Discussions

Efficient Verilog coding style

Altera_Forum
Honored Contributor II
2,244 Views

When I changed from using MegaFunctions to Verilog and used wires for the combinatorial logic there was higher speed and less resources(I think). Specifically it was for an ALU, so it seems the ALUT function was more efficient. Are there any guidelines for efficiency of implementation? This is for StratixIII which has a new LUT design. 

Continuous assignment for each alu function and a case selected the wire for output. 

Thankyou
0 Kudos
11 Replies
Altera_Forum
Honored Contributor II
1,021 Views

Hi SimKnutt 

 

Since you used wires and comb logic may be you have opted unknowingly for asynchronous design. That is much faster but hard to master against hazards and logic delays. If there are no registers there would be no violations of reg timing and no limit for fmax from reg point of view
0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

Thanks, kaz 

The rest of the design is synchronous, the alu is just the data path between regs/rams. It seems lik using a mux to select add/sub: compare: and/or/xor used more resources but using wires put the functions in a single lut per bit.
0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

It will be helpful to demonstrate your verilog coding for one such function e.g. how did you do wiring instead of mux for the case of Add.

0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

The project archive is attached. I hope it has a copy of the verilog source because it is in a different directory. You can see the whole project as it stands. It compiled but not ready for simulation. 

Thanks again
(Virus scan in progress ...)
0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

but this is schematic based project, what about the verilog project?

0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

The alu block was created from the .v file. You should see the .v if you open the design file for the alu block. I will attach it as a separate file later.

0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

The verilog is attached

(Virus scan in progress ...)
0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

I am afraid I will come back to my same first conclusion. If you look at the number of registers used in your project then you will find it is only one single rgister. 

 

Your design is therefore not RTL based and will fail functional simulation unless you know how to design asynchronously whci is so far not standardised. 

 

By the way, a company called Achronix (I believe) were talking about their new fpgas based on asynchronous design (no clock) and hence very very fast yet the design methodlogy would use conventional RTL to keep it easy. Their tool would convert the user RTL to their asynch architecture. However, that was 4 years ago and I until now I don't hear of their product coming to the market. May it was a failed venture
0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

It would be helpful if you would focus on the question "Are there coding style guidelines to get the most efficient use of Stratix III ALUTS". The alu is combinatorial logic and that means the signal flow is not clocked. just as in any design. NOTHING NEW OR DIFFERENT OR ASYNCHRONOUS. As best that I can tell, combining the mux and data flow functions resulted in fewer ALUTS. By the way, the PLL in the design may give you a clue that there is clocking.

0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

There are certainly plenty of coding style's effect on efficiency , though not per device type. 

 

The LUT implements comb logic functions but the rtl methodology recommends using clocked register afterwards and that is why the fpga architecture is based on units of LUT followed by a register. Too long comb paths could violate timing. 

 

There are no strict rules that every LUT/reg pair should be used as such but you may skip some regs or use borrow extra regs.  

 

At the end your design is supposed to transfer logic decisions such that a reg level hands it to next logic and reg level.  

 

This approach helps achieve timing as well. Every launch register is utilised (with delays) to meet timing of latch reg. 

 

If you don't use registers then you need to apply your own asynch techniques (I don't anything about it). 

 

Presence of clock or pll is part of the story. So does presence of object reg but what matters is how many regs are clocked on the clock edge as the sampling point.
0 Kudos
Altera_Forum
Honored Contributor II
1,021 Views

Altera spent money developing Stratix III with a new "fracturable LUT" It can do any function of 6 inputs. If they do not push that, then they wasted their money. 

 

Long comb paths generally are due to long strings of if/else in the HDL whereas the LUT is a simple 2 port memory that has the same access time no matter how complex the function. Edge triggered regs are used because synthesis can only handle them. 

So the less function between regs means more wasted time for clock skew and setup/hold times. Those that think pipelining performance is simply a matter of clock speed are sadly mistaken. If the total clocks to do a function times the clock period is not less than before a stage was added then there is no gain with more power used. 

Long comb paths generally are due to long strings of if/else in the HDL whereas the LUT is a simple 2 port memory that has the same access time no matter how complex the function. Edge triggered regs are used because synthesis can only handle them. 

So the less function between regs means more wasted time for clock skew and setup/hold times. Those that think pipelining performance is simply a matter of clock speed are sadly mistaken. If the total clocks to do a function times the clock period is not less than before a stage was added then there is no gain with more power used. 

 

You are so hung up over the asynch notion that I cannot believe it. Long before TTL and edge triggered flip flops there were multiple clock pulses per machine cycle and regs were simply latches. Somehow the embedded memory blocks are not truly edge triggered but yet synthesized, so I am using the memories with a multiclock cycle to essentially do a latched data flow, so that is why you don't see dedicated regs. Use whatever is available from the technology.
0 Kudos
Reply