Stratix V ALM/LUT/register usage imbalance

Altera_Forum · ‎11-22-2011

Hi,

I am implementing signal processing algorithms on Stratix V and noticed that usage of registers is on the order of 1/4 when compared to ALMs and about 1/1 when compared to LUTs. This seems obvious, since there are 4 registers per ALM. I normally register outputs of each pipeline stage of adder tree, and also in/outs of multipliers (however these do not count in registers stats as are inside DSP block).

It would then seem natural to use more pipeline registers, for example to divide very wide adders into split adders (MSB/LSB) with 2 stage pipeline, which cuts cell delay and makes more delay available to routing. However when this is done not only register amount increases in utilization report but also ALM utilization.

The question is whether this is happening only b/c there is a plenty space on the device so that fitter does not care to put register efficiently in as low number of ALMs as possible?

In other words, would it pay off in long run to use this vast amount of regs so that when project gets bigger and device fuller it will make routing easier or rather, it will be counter productive as using regs will utilize ALMs which otherwise would be spare?

Thanks,

Michal

Altera_Forum · ‎11-23-2011

Hi Michal,

I don't have a direct answer for your question. But I do have a suggestion.

When trying to figure out how to optimize a VHDL description into hardware, I make extensive use of the RTL viewer, and play with timing constraints.

I have vague recollections that some of the registers in the FPGA (though I might be recalling Xilinx parts here) do not have asynchronous reset ports, or reset ports at all. So if your RTL description includes a logic feature that does not exist on the registers on a DSP block, then the tool has to use the resources in the fabric. So, first, go and check the ports on the megafunction of a component that you may be trying to infer from HDL.

Next set fmax to something unreasonably high, and then see where the worst-case timing paths are. You should be able to coax the tool into using the right registers to cut paths.

Cheers,

Dave

Altera_Forum · ‎11-23-2011

If by wide adder, you mean adding two large values, then it's not going to help. A single add will use the carry-chain and the register after that, but adding a second register won't really make a difference and most likely won't get packed with it. If by wide adder you mean adding many values, then it could make a difference. I believe the ALM can do a ternary add natively, so that would make the most sense as to where to pipeline.

Stratix V has twice as many registers per ALM as Stratix IV, but I believe more often than not you're not going to be able to use them. Their biggest advantage is for the LUTRAM, as Stratix IV had to use a lot of registers outside of the LAB to build a RAM, while Stratix V uses these extra registers, and I believe the only ones pushed outside are the read address.

Beyond that, there are quite a few restrictions. There are lab-wide signals like clock enable and synchronous clear and sync load that are limited. There are only 8 inputs to an ALM so if you're bringing much in for the logic you won't have enough for these extra registers. There are only four outputs. I've played some with trying to get better utilization from these, and besides simple stuff like "4 registers with a clock enable", I usually don't get much above two registers per ALM. (And quite often that is difficult).

Altera_Forum · ‎11-23-2011

@dwh

I use resets only when necessary and this not typical case for datapath, where usually you can afford unknown data after initialization for few cycles, until pipeline flushes it out.

I already tried fmax higher than necessary (350 vs 200 MHz) and encountered timing problems with wide adders. There cell delay was already 2 ns giving only 1 ns to routing.

@Rysc

By wide adder I mean 2-in adder with wide inputs, like >30 bits. The carry chain becomes very long. While Quartus/Stratix V does good job and single carry delay is quite short, it has to split chain into multiple LABs and then it can be longer, moreover there will be routing delay from e.g. DSP block.

Indeed, you can increase fmax by 2 FFs, but 2nd FF is not pure-pipeline type. In first cycle you add lower halves of inputs together and just delay msb halves. In second cycle carry from LSB adder is appended to LSB of MSB inputs and MSB halves are added. Also output is created by merging MSB and LSB adders results.

That way, you split long carry chain into two parts which helps timing.

According to Altera it is slightly less than twice as fast as an equivalent unpipelined adder.

See stx_cookbook.pdf (Alteraforums prevent me to paste link) (chapter "Pipelined Adder Chains")

LUTRAM sounds well suited for use with abundance of these registers. However when I use only read addr register, without output register then total utilization of ALMs is much lower than when used other way round. What do you saying mean read_addr was pused outside? Isn't it the case that no matter if you use read addr reg or out Q reg for LUTRAM it will make use of the very same 4 regs in ALM?

I agree that it is hard to use more than 2 regs per ALM. Altera claims that "(...) Previously, an ALUT was wasted when a pipeline register was used to retime a design. Second, this change allows for collocation of the pipeline register with the logic it is intended for.(...)" (See wp-01172-design-optimization.pdf, Tip 2).

However, I don't see this happening or just design approach for SV shall be somehow different than for older families. Another possibility is that Quartus is not yet intelligent enough to efficiently use these 4 registers.

Regards,

Michal

Altera_Forum · ‎11-23-2011

Pipelining in the middle will certainly improve performance with very little additional area(even if the register isn't packed, it's one register added to a 30+ bit adder). The big hit is the extra latency, but if that is all right in your design then it should be good.

Note that most designs don't have a lot of lone registers just sitting around. What I find is that synthesis does a good job of taking small amounts of logic feeding a register and rolling them into the control signals, such as the clock enable, sync clear and sync load. This takes a register that would have had a LUT in front of it and makes the LUT disappear. This is good, but quite often this register can't be packed as tight anymore. So it looks like there are a bunch of lone registers that can't be packed, and from a perception standpoint, that is bad.

As for the memory, I only tested LUTRAMs with registered outputs, so I'm not sure if the read address can be packed into the output registers. Note that the LUTRAM is a pretty well defined block, so I'm not sure if the registers can be used for just any function.