Re: Maximum inputs per LAB

Altera_Forum · ‎06-20-2018

Can't find this documented anywhere. Is there a cap on the total number of signals that can be routed into a single LAB? Take Cyclone V, to be specific.

From the datasheets, I was under the impression that the only limit is on signals per ALM - 8 lines not counting clock, carry, etc, which would allow me to pack incoming 80 wires per LAB. But I've been trying to work out the root cause of routing difficulties in my design, started manually assigning logic to locations, and got a curious error message:

"Info (170015): LAB legality constraint that was not satisfied: LAB requires more input signals requiring LAB lines than are available. Resources used: 65. Resources available: 46."

Altera_Forum · ‎06-20-2018

It's usually a bad idea to try to do manual location assignments, as you've discovered. Just let the Fitter do its thing. What are you trying to accomplish here?

Altera_Forum · ‎06-20-2018

I wouldn't be trying to do this if the Fitter were doing a good job.

I have a large but relatively straightforward design with a wide (512-bit) internal pipeline. The way the logic is structured, it seems like it should be possible to make the pipeline run left to right (put each stage of the pipeline into 128 ALMs in the same column), and use direct LAB to LAB links for interconnection.

Instead of doing that, the fitter gets really creative. Each stage ends up in a cloud-shaped formation, with clouds running top to bottom, or diagonally, or in a spiral. Direct links are idle (usage 4%), everything ends up on C and R interconnects, there is a routing overload, and I end up with Fmax of 140 MHz because some signal somewhere incurs a 5 ns interconnect delay just traveling from a source register to a destination register via a circuitous path.

I've been trying to encourage the fitter to do the right thing (or, alternately, to understand why it can't be done), but the fitter is proving extremely resistant to my coaxing.

Altera_Forum · ‎06-20-2018

A better solution is to first look at your timing constraints and timing reports. Can you post your .sdc file and the report on the path that is failing timing? Then based off of that, you make choices for optimizing the design, either flipping switches for how synthesis and the Fitter work or adjusting your HDL. Manual location assignments should be your absolute last resort.

Altera_Forum · ‎06-20-2018

Failing paths differ with every run. The one commonality is routing overload. The entire area where pipeline stages sit is bright red in the "routing hotspots" view. And it is difficult to optimize the design if you don't know the limitations of the hardware. If there is indeed an undocumented limit on the number of inputs per LAB, I could at least try to rework the algorithm to reflect that constraint.

Altera_Forum · ‎06-20-2018

Again, it would be most helpful to start with your timing constraints and to see the failing path report.

Altera_Forum · ‎06-20-2018

Right off the bat, I notice that you have over 0.5 ns of clock skew, which could potentially be high (I usually see it down at 100 ps or lower), and there is over 2 ns of delay on the launch clock path. Can you use a regional (quadrant) clock instead of one of the global clock resources? Regional clock resources have less skew and a smaller insertion delay.

What do you have as your optimization mode in the Compiler Settings? Make sure it is set to one of the performance modes.

The path is indeed using a lot of extra routing. Are you meeting hold timing in the fast timing model(s)? One thing you could try is turning off the optimize hold timing option from the advanced Fitter settings. The Fitter may be trying to meet hold timing by adding extra routing in the path at the expense of setup timing, which could explain the routing congestion.

Can you post the code for the combinational logic between these registers?

Altera_Forum · ‎06-20-2018

Regional clocks are an interesting idea, I haven't even considered their effect ... I really don't want to put in clock-crossing logic at this point, but it's something to keep in mind in future.

The optimization mode was "Power - High Effort" (but results are more or less the same in "Performance - High Effort".) I am meeting hold timing in all timing models. Going to rerun in "Performance" with hold timing option turned off - back to you in an hour.

The code for this particular path was effectively

out[57] <= in[56] ^ in[120] ^ in[312] ^ in[472];

Altera_Forum · ‎06-21-2018

Interesting. Turning off hold timing optimization somewhat eases the routing pressure, setup failures in this particular pipeline go away (at least at 180 MHz), but instead I see a bunch of hold failures elsewhere in the design.

Noticed that the Chip Planner confirms the existence of a cap on inputs. The "Local Interconnect", the primary routing channel that feeds all ALMs, is shown to have the maximum routing capacity of 46 in its tooltip. In addition, there is a "Local Line" that connects ALM outputs to ALM inputs within the same LAB, with the capacity of 20.

Direct links between adjacent LABs are definitely there, but not visible as such in the Chip Planner. Unclear how wide they are. If Figure 1-1 in the Device Handbook is at all to scale, they might be very narrow, less than 10 wires per LAB. In either case, first, direct links seem to count toward the cap of 46 total inputs, and second, the fitter does not seem at all interested in using them.

Altera_Forum · ‎06-22-2018

With everything packed to maximize horizontal dependencies (everything manually assigned locations and every LUT dependent on the output from a register in a LAB immediately to the left of it), the most direct left-to-right links I see crossing a single boundary is 14. Not sure if that's a hard limit, however (the fitter is still somewhat reluctant to use them - most of the time it won't even use all 14, would route 10 or 11 directly, and send the rest via a Rx/Cx interconnect.)

Out of 4 registers in each ALM, 2 are only able to drive direct left-to-right links, and the other 2 are only able to drive right-to-left links. So, the maximum width of any pipeline that can be implemented using only local routing is, at most, 20 bit per LAB.

Altera_Forum · ‎06-24-2018

A corollary. The handbook states that Cyclone V ALMs in shared arithmetic mode can do 3-input adds, with 2 output bits per ALM. However, that would require 6 inputs per ALM and 60 inputs per LAB. And, since there aren't enough feeds to supply all 60 inputs at once, the carry chain needs to be spread over multiple LABs. If you're adding three 32-bit numbers, you might naively expect to need 2 LABs for that (one full LAB and 6 out of 10 ALMs of the LAB immediately below). In practice, it's going to put bits 0..9 into the bottom half of LAB#1 (say, X78_Y6), bits 10..19 into the top half of LAB# 2 (X78_Y5), and bits 20..31 into LAB# 3 (x78_Y4). (Confirmed with Chip Planner. Conveniently, the carry chain is allowed to span only half of the LAB.)

The remainder of those LABs won't be totally wasted, though - you still have 16 feed lines left in each of the top two LABs and 10 feed lines left in the bottom LAB, so it will be possible for the fitter to squeeze some extra logic in there.