Need Opinion: Timing Closure with Incremental Compilation and Design Space Explore

Altera_Forum · ‎02-01-2013

I was given a design with a long pipeline of logic path running at 300 Mhz, and DDR3 memory controller at 166 Mhz in Arria V GX with Quartus II 12.1 SJ Version.

There were initially 8000+ timing errors but fortunately, most were falsepath and multicycle, and some were computational which can be further pipelined (re-clocked).

It got me down to less than 100 errors, but now I am dealing with very tight packing DSP block, memory locations and I followed the "Tips for Incremental Compilation And LogicLock" written by Ryan Scoville.

The document offers so much strategy/technique that by partitioning, and LogicLock, I get as little as 9 timing errors in the 100ps-300ps range for 300 Mhz clock! (but later when getting very addicted with 4+ LogicLock not recommended by the document, timing closure head downhill)

The issue I am having is, there seems no chance to get rid of these tiny little timing margin and claim victory...

So I went on and run overnight of Design Space Explore hoping to get a "seed" or suggested setting, and that actually head me the wrong way. Which I end up with tons of few nanosecond of timing error and for example, the LCELL Insertion setting caused a super long recovery error (on the reset line which the tool decided to go very far corner and pick a buffer, then route it back to the logic).

Am I running into the limitation of the silicon/device if the numbers is at best what it can do? The contradictory point to that is the errors I am getting are related to high fan-out to adders/DSP while the data path seem optimal from Chip Planner and inconsistent from each run (i.e. the DSP error may be gone, but other error unveil elsewhere such as memory block input/output path). I have also tried with a bigger device and resulted to the similar timing error. Anyone's opinion and advice much appreciated!

Altera_Forum · ‎02-04-2013

I'm guessing you've got a lot of paths that are close to meeting timing, but the fitter can't get them all to converge. With higher clock rates like 300MHz, there aren't any paths that have too much slack, making the problem more difficult.

- Try to determine which paths show up the most often.

- See if they can be modified in anyway to improve timing. Pipelining of course, but register duplication, etc. may be helpful. I find when I'm really close, some straightforward duplication is often enough to get a little bit more and hopefully get that last bit out. The following might be helpful:

http://www.alterawiki.com/wiki/register_duplication_for_timing_closure

I'd also be curious of any anecdotes of what worked or didn't work out of the Tips for Incremental Compilation and LogicLock. Good luck.

Altera_Forum · ‎02-05-2013

Thank you very much for the advice!

One additional factor is the design is high in utilization at between 79-84%, but moving to the next device did not address the timing issue to the group of troubled signal I tend to get.

Most of the error are right in between the inter-modules (4 out of 14) where I partitioned them. I've tried the register re-timing, duplicate and do not remove duplicate registers and the result seem reasonable.

When attempted to do a merge for the inter-modules the problem become too complex and I don't think I am making better decision than the tools since all modules are ultimately chained together, eventually no partition could sometimes yield to better result but I get nothing in control and only wish the pushing button solution works when more logic need to be added!

Trying parent/child LogicLock region seem to help for certain timing error where datapath and placement were critical. For example: Region 1 and 7 in the screen shot specifically uses DSP multiplier/adder, by pack them really closely these are failing in the 60-100ps at 300 Mhz which are on/off the failing path list. From the document you provided, they can be set to Post-Fit once there's no timing error between parent/child. I end up using source file for all compile because the compile time is relatively the same, and by using post-fit the utilization is reported at 97% while we still see lots of empty spaces. Nevertheless it gave us an uncomfortable feeling toward picking the size of this FPGA...but low 80% maybe OK!

I will give a try to max-fan-out for register feeding the long stages of adders and keep you posted. One other failing path I seem to have most trouble is the block ram output to a nearby location which tool claim a high number of delay, while physically it is already placed at the closest input/output... Should these be considered to fix using the register duplication technique? Thanks again!

https://www.alteraforum.com/forum/attachment.php?attachmentid=6766

Altera_Forum · ‎02-12-2013

After spent a good week optimizing the location of output pins (to avoid SSO issues), by locating the global resources (clock and reset), it turn out to give a overall help in timing!

I was able to quickly use <assignment editor> to add a handful of nodes that fails timing, after define very specific submodule and constraint them to only 2-10 fanout, its brought the design down to 11 errors. Hope this is a good input for your document.

As for incremental compile, a couple of issue I encounters are:

1. The TOP block cannot be set to empty, this appeared to be an tool issue that if any of the modules rely on PLL input/output, it won't get instantiated and the partition block will fail to route.

As a result, I set TOP to "Source File", which still provide the flexibility to partition the remaining of the logic.

2. When setting the partitions to Post-Fit. The overall utilization rise from 80% to 97%. It definitely gave us an uneasy feeling picking the device, while it's obvious the FPGA has lots of unused block remaining looking at Chip Planner. So, this maybe another tool issue and I have end up setting all of the partitions to Post-Synthesis instead, mainly to get an accurate benchmark of all the compiles. With LogicLock I was able to obtain consistent timing result with sometimes a big step backward when moving the LogicLock areas around. All in all I am happy to dive into this subject and tweak the last bit of timing error out. Thanks for creating these intuitive documents!

Altera_Forum · ‎02-13-2013

For item 1), that should work. Synthesis is actually supposed to find PLLs in empty partitions and bring them out(assuming the connections are direct, which they should be). I've seen this work on other designs. That being said, setting the top-level to Empty isn't known very well and therefore doesn't get used a lot.

For 2), I wonder if it was counting open areas in the LABs of the post-fit logic. Were you using 12.1? The Logic Resource counting is different(although the Logic Utilization should essentially be the same, just the way it's shown and built up is different). I wonder if you could look at utilization in the hierarchy browser and Fit report and see what went up? (I'm guessing either the utilization for that partition went up, or the Estimated Recoverable ALMs dropped). If a block of logic is set to post-fit, I do believe you're not going to get a lot of logic put in that region, even if you can see a lot of holes, because that's a significantly more difficult fitting problem.

If possible, filing SRs would help Altera debug.

Thanks for the info.