Re: Anyone with Experience compiling Stratix IV device w/ over 95% utilization.

Altera_Forum · ‎10-14-2010

Having trouble getting timing closure with device which is over 95% full (530 GX). The design is also fast and consumes all the memory (figured i should get my money's worth). Of course 5% remaining logic is larger than some of the largest devices 5 years ago. Has anyone else had a large device that is this full?

We are currently focused on Incremental Compile which we have been fighting with 4 years (since 6.0). It looks like it may be actually working in 10.0sp1. Prior to this it was unusable and my team was stuck doing flat compiles.

Altera_Forum · ‎10-14-2010

--- Quote Start ---

Having trouble getting timing closure with device which is over 95% full (530 GX). The design is also fast and consumes all the memory (figured i should get my money's worth). Of course 5% remaining logic is larger than some of the largest devices 5 years ago. Has anyone else had a large device that is this full?

We are currently focused on Incremental Compile which we have been fighting with 4 years (since 6.0). It looks like it may be actually working in 10.0sp1. Prior to this it was unusable and my team was stuck doing flat compiles.

--- Quote End ---

Hi,

how many failing paths do you have ? All located in one module or spread over the design ? Do you use design partitions ?

Kind regards

GPK

Altera_Forum · ‎10-14-2010

Have you tried seed sweeps? On any design I have that is on the edge of being full or on timing, I have to perform seed sweeps with the DSE in order to get a compile that both fits and meets timing. Yes it takes time, but it does work if you design is actually capable of working. Give it a try, although I'm sure your compile times are already long with that part and size, so probably not what you wanted to hear.

Altera_Forum · ‎10-14-2010

We have been doing seed sweeps. We are now investigating (doing) DSE on individual Partitions. As you can imagine, this is a very long process with unkown results. We are locking stuff down as we get closure (over 60% now locked down). I expect what we are going to find out is the design no longer fits (It will greater than 100% total).

Thanks!

Altera_Forum · ‎10-14-2010

--- Quote Start ---

Hi,

how many failing paths do you have ? All located in one module or spread over the design ? Do you use design partitions ?

Kind regards

GPK

--- Quote End ---

Partitions has never worked in the past (always something that keeps full flow from working). In 10.0sp1, it seems to be working. Past problems have been with GXBs and memory IO.

Probably 2k timing violations. Severity is very seed dependant.

Altera_Forum · ‎10-15-2010

--- Quote Start ---

Partitions has never worked in the past (always something that keeps full flow from working). In 10.0sp1, it seems to be working. Past problems have been with GXBs and memory IO.

Probably 2k timing violations. Severity is very seed dependant.

--- Quote End ---

Hi,

working with partitons have also drawbacks. Due to preserving the interfaces of the partition the design could not full optimized anymore. Have a look to the interfaces of

your partitions. Are there ports with are not driven or tied to a fix value ?

Are the timing violations between partitions or inside a partition ? If they are between

partitions did you follow the recommendation to put registers to the inputs and outputs ?

If your timing violations are inside a partition I would setup a separate project for this part of the design. It is thrn easier for Quartus to find the best placement and routing for the timing critical part of your design. After you achieved timing closure import the result to

your main project. If the timing problems are located in a block inside the partition, define

a partition for the block and follow the way mentioned above. Have also a critical look to the design itself, sometimes it is much easier to spend some additional registers to relax

the timing.

BTW: How far away are you from timing closure ?

Kind regards

GPK

Altera_Forum · ‎10-19-2010

We have a few thousand nets still open. We see a big difference with seeds.

We havent tried import/export since 8.0. Problem back then was that the tool could not deal with all the clocks. Multiple partitions were trying to use same resources.

Altera_Forum · ‎10-20-2010

--- Quote Start ---

We have a few thousand nets still open. We see a big difference with seeds.

We havent tried import/export since 8.0. Problem back then was that the tool could not deal with all the clocks. Multiple partitions were trying to use same resources.

--- Quote End ---

Hi,

are the failing paths spread over the dessign ?

Kind regards

GPK

Altera_Forum · ‎10-20-2010

yes and they move around based on seed.

Altera_Forum · ‎10-22-2010

--- Quote Start ---

yes and they move around based on seed.

--- Quote End ---

Hi,

sounds really difficult. Do have still partitions in your design ? 95% uitilization means logic cells, memory ... ??? Can you post your design summary here ?

Kind regards

GPK

Altera_Forum · ‎10-22-2010

My only suggestion (but I think is the way you moved untill now) is to set partition all partition to empty except the one with problem. That speed up a lot the compiling time.

At this point set all optimization you can for speed (at least register retiming) and try to solve all. I mean you must understand all the problem you have.

Maybe something is setting wrong and could be false path or multicycle.

Moreover constraint all also all the "pin/ports" that are slow, else quartus try to make the bast as it can on all nets without assignment.

When you've closed your time in this partition, preserve it with post fit netlist and go ahead with next partition in the same way.

"Have fun"

Altera_Forum · ‎10-22-2010

--- Quote Start ---

Hi,

sounds really difficult. Do have still partitions in your design ? 95% uitilization means logic cells, memory ... ??? Can you post your design summary here ?

Kind regards

GPK

--- Quote End ---

100% of memory and 95% of logic.

I have another question about how tool allocates memory. The fitter seems to use 100% of memory and then starts chewing up internal logic. Anyone know of any documentation that describes this?

Altera_Forum · ‎10-22-2010

The fact that you've got a Stratix IV 530 at nearly 100% utilization means you're no novice so forgive me if I state some things hear that are obvious to you.

1 - The single change that can have the most impact on timing and resource utilization is modification of the source RTL itself. This is usually the first place to look when timing fails. Many times the RTL resulted in long stages of combinatorial logic the designer was unaware of. Now of course you may not be able to do this if a lot of the logic lies within purchased 3rd party cores to which you don't have the source code.

2 - Timing constraints - Are you sure you've constrained the timing such that Quartus isn't doing unnecessary work trying to meet timing on paths that really aren't important (crossing clock domains for instance). The fitter may take away precious routing and logic resources for paths that really don't deserve it thus making it difficult for the entire design to fit.

3 - Have you tried logiclock regions yet? Physically constraining the fitter as to where certain modules may be placed allows you to provide heuristic human insight that can dramatically help the fitter out. You know the design. You know what connects to what. Locking adjacent logic into a region can be a big help.

4 - Have you tried physical placement constraints? Similar to logiclock you may need to manually restrict the placement of certain elements to specific LABs in the device. Often this can be done by scripting.

5 - You probably have the entire design set to be optimized for speed. Are there any modules in the design that don't need to be optimized for speed. Are you aware that you can set the optimization technique specific for each module in the hierarchy? So you might optimize the entire design for speed and target specific modules to optimize for size or vice versa.

6 - Have you tried tweaking the fitter/placement effort multiplier? Try a value of 4 and see what happens.

My personal opinion is that seed sweeps and DSE sweeps are the worst methods for tackling these issues. They consume an enormous amount of time and at the end of the day you're just trying to get lucky. And if you do get it to work a single change to the design may throw the whole thing off again. Find the problem, fix it, then move on to the next problem.

Jake

Altera_Forum · ‎10-28-2010

--- Quote Start ---

The fact that you've got a Stratix IV 530 at nearly 100% utilization means you're no novice so forgive me if I state some things hear that are obvious to you.

1 - The single change that can have the most impact on timing and resource utilization is modification of the source RTL itself. This is usually the first place to look when timing fails. Many times the RTL resulted in long stages of combinatorial logic the designer was unaware of. Now of course you may not be able to do this if a lot of the logic lies within purchased 3rd party cores to which you don't have the source code.

we had most of the design blocks running fine in 90 nm (stratix ii gx). the problem is ic delay.

2 - Timing constraints - Are you sure you've constrained the timing such that Quartus isn't doing unnecessary work trying to meet timing on paths that really aren't important (crossing clock domains for instance). The fitter may take away precious routing and logic resources for paths that really don't deserve it thus making it difficult for the entire design to fit.

3 - Have you tried logiclock regions yet? Physically constraining the fitter as to where certain modules may be placed allows you to provide heuristic human insight that can dramatically help the fitter out. You know the design. You know what connects to what. Locking adjacent logic into a region can be a big help.

4 - Have you tried physical placement constraints? Similar to logiclock you may need to manually restrict the placement of certain elements to specific LABs in the device. Often this can be done by scripting.

5 - You probably have the entire design set to be optimized for speed. Are there any modules in the design that don't need to be optimized for speed. Are you aware that you can set the optimization technique specific for each module in the hierarchy? So you might optimize the entire design for speed and target specific modules to optimize for size or vice versa.

we are now doing this. like i mentioned earlier we are in stratix iv and 9.x was terrible for preserving placement via incremental compile (could not remember previous fitter results). on 10.0 sp1 and things seem to be working like we expect.

6 - Have you tried tweaking the fitter/placement effort multiplier? Try a value of 4 and see what happens.

My personal opinion is that seed sweeps and DSE sweeps are the worst methods for tackling these issues. They consume an enormous amount of time and at the end of the day you're just trying to get lucky. And if you do get it to work a single change to the design may throw the whole thing off again. Find the problem, fix it, then move on to the next problem.

Jake

--- Quote End ---

1/ we had most of the design blocks running fine in 90 nm (stratix ii gx). the problem is mostly ic delay.

2/ agreed. we are quadruple checking. multi-corner analysis is eating up lots of interconnect. this is required for the types of memory. there are over 200 clocks in the design. these mostly come from altera memory ip (we have 4 clock domains).

3/ i'm a long time user of altera tools. i can never get a straight answer out of altera as to whether logic lock pays any dividends. i guess every design is different so its hard to say for sure. the 530 seems to have crossed the line where there is now denying that floorplanning and design partition is the only answer.

4/ working on it.

5/ we now have about 10 partitions are are optimizing each one for area and then trying to meet timing (gradually crank up speed setting if needed). this seems to be working. once the entire design is locked and closed wrt timing we will run out of some type of resource.

6/ fitter effort definitely helps with design partition. for the flat compiles not much help.

the path that shows the most promise is incremental compile. as you can imagine it is a slow process. i could write a book on this experience....

thanks for taking the time for the detailed replay. i will give an update for the final solution (i think 10.0 will bring closure).

best regards

Altera_Forum · ‎10-30-2010

Just some brainstorming (I mostly use Cyclone, so maybe something does not really apply for Stratix):

- Try to reduce the number of used logic (maybe, with 90% it would fit like a charm), either by changing the RTL-code and / or by changing compiler-options (e.g. maximum register-packing, optimze for area instead of optimze for speed)

- You mentioned 200 clocks. The clocks with failing paths, are they using global clocks or non-global routing-resources? If they use non-global, try to reduce the number of clock-domains so that only global resources are needed (this would also free routing-resources for normal routing).

To get an impression: What frequencies do you want to achieve, what is your "typical" slack in the moment?

Thomas

Altera_Forum · ‎10-30-2010

--- Quote Start ---

1/ we had most of the design blocks running fine in 90 nm (stratix ii gx). the problem is mostly ic delay.

--- Quote End ---

I had the same problem in a StratixII GX project (EP2SGX60F1152C3). Our GX logic uses a lot of M4K, about 2/3 of them and apparently that makes the signals cross the device from left to right and back.

We reduced the logic by using a 192 bit datasize in stead of 216 bit, this reduced the utilization to 78% (coming from 100+), but then we had to increase the speed to 175 MHz. The failing paths had IC delays of 75+ %. I cured the problem by adding pipeline registers between a few blocks. This allows the router to cross the device in two steps and give it some headroom for others.

Altera_Forum · ‎11-02-2010

--- Quote Start ---

Just some brainstorming (I mostly use Cyclone, so maybe something does not really apply for Stratix):

- Try to reduce the number of used logic (maybe, with 90% it would fit like a charm), either by changing the RTL-code and / or by changing compiler-options (e.g. maximum register-packing, optimze for area instead of optimze for speed)

- You mentioned 200 clocks. The clocks with failing paths, are they using global clocks or non-global routing-resources? If they use non-global, try to reduce the number of clock-domains so that only global resources are needed (this would also free routing-resources for normal routing).

To get an impression: What frequencies do you want to achieve, what is your "typical" slack in the moment?

Thomas

--- Quote End ---

Thanks Thomas. The failing paths are on the heavily loaded global clocks nets (155 and 311). I think one problem is the device is running out of memory resources and then start chewing up LABs. Its not clear to me what the algorithm is for selecting memory to move to LABs. After reviewing some of the results, I think it may have made bad choices.

For now we are going down the Incremental compile path and inserting pipeline stages where needed. This seems like it will work if we dont run out of space.

BTW, do you know of anyone who has tried the "Team Based Flow". We are also looking at that. So far the auto generated makefiles seem all screwed up but we can fix that. I can build a partition that takes up 10 percent of the design in 20 minutes. I can do 10 jobs in parallel for my 10 partitions. Then, in theory, should just have a short compile to route and connect things. Should be done in 2 hours instead of 24; at least that is how it is advertised....

Altera_Forum · ‎11-02-2010

--- Quote Start ---

I had the same problem in a StratixII GX project (EP2SGX60F1152C3). Our GX logic uses a lot of M4K, about 2/3 of them and apparently that makes the signals cross the device from left to right and back.

We reduced the logic by using a 192 bit datasize in stead of 216 bit, this reduced the utilization to 78% (coming from 100+), but then we had to increase the speed to 175 MHz. The failing paths had IC delays of 75+ %. I cured the problem by adding pipeline registers between a few blocks. This allows the router to cross the device in two steps and give it some headroom for others.

--- Quote End ---

We are looking into the memory. We are definitely running out of it and the tool starts moving memory cells into logic.

Altera_Forum · ‎11-03-2010

--- Quote Start ---

Thanks Thomas. The failing paths are on the heavily loaded global clocks nets (155 and 311). I think one problem is the device is running out of memory resources and then start chewing up LABs. Its not clear to me what the algorithm is for selecting memory to move to LABs. After reviewing some of the results, I think it may have made bad choices.

For now we are going down the Incremental compile path and inserting pipeline stages where needed. This seems like it will work if we dont run out of space.

BTW, do you know of anyone who has tried the "Team Based Flow". We are also looking at that. So far the auto generated makefiles seem all screwed up but we can fix that. I can build a partition that takes up 10 percent of the design in 20 minutes. I can do 10 jobs in parallel for my 10 partitions. Then, in theory, should just have a short compile to route and connect things. Should be done in 2 hours instead of 24; at least that is how it is advertised....

--- Quote End ---

Hi,

some thoughts about the team based flow ....

In my point of view it would be not a good decision to use a "team based flow" as long a your device utilization is so high. Using partitons means that the design could not optimized as good as before, because the IF of the partitions will be preserved. No optimization across partitions took place. Next point is that it is recommended to use registered input and output at the partition borders, in order to prevent timing issues between partitions. This will eat up some resources again. For the partitions you have to define LogicLock regions. That means you have to assign resources to the partitions. You can not fill up the Logiclock region 100%, you will waste some resources (especially memory could be a problem.

To get the full advantage of the flow you have to use the preserving level "Placement & Routing". I assume you will run all your partitions separat and import them in your main projekt. By doing this you could run into some problems, especially with the clock ctrl blocks. I'm not sure that Quartus is able to detect that two partitions use the same clock and the clock ctrl blocks ( they are part of the placement and routing ) could be merged.

kind regards

GPK

Altera_Forum · ‎11-03-2010

--- Quote Start ---

We are looking into the memory. We are definitely running out of it and the tool starts moving memory cells into logic.

--- Quote End ---

If it is dumping unused memory into LABs, are you sure you are using the M9Ks efficiently in the first place? To get the same size as an M9K made of MLABs you're going to eat up a lot of logic. Are you using many M9Ks with only 25% usage? Add atrributes/synthesis directives to force small memories into MLABS. Sometimes rearranging a memory map can help improve effeciency hugly, especially when using mixed widths in true dual-port mode. Simple dual ports support up to 32/36 bits in mixed mode, whereas dual port only 16/18 bits.

Another thing to check - Ive just found a bug in Q9.1 and Q10 where it was auto-generating altshift_taps for me and eating up loads of memory unnessarily (like 30+ M9Ks when it should have placed a couple of registers!). if you're sure you dont need them, turn them off in the project with:

set_global_assignment -name AUTO_SHIFT_REGISTER_RECOGNITION OFF

or going to the Analysis and Synthesis options, more settings, and setting Auto Shift Register Replacement to “OFF”, by default this is “AUTO”. You can navigate to the option, in Assignments, Settings, Analysis & Synthesis Setting, More Settings. Auto Shift Registers to “OFF”.

Altera_Forum · ‎12-21-2010

And the magic answer is:

Run a bunch of random seeds (Takes days on server farm as a single run takes 24 hours).

Change fitter and router effort to 4.0 and 3.0 respectively

Change synthesis from Speed to Balanced (Dropped size down by a few percent)

Remove all design partitions (Design partitions were taking us farther and farther away from final solution). Once they were removed, size of design dropped from 95 to 90 percent full.