- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've got a decimator block written in Verilog. It's the standard structure: flops, then constant multiplication, then an accumulate tree.
However, Quartus is using the DSP blocks for the multiplies and then failing timing specs. This seems like an early stage problem as it starts consuming DSP resources right at the early portions of Analysis & Synthesis. Is there a way to make Quartus synthesize those constant multiplies short of rewriting the block with shifts and adds by hand? I'm on Quartus 12.1sp1 on Windows 7 64-bit. Thanks.Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I dont quite get what the problem is? DSP blocks are the fastest way to do multiplies on the chip. So why would you not want to use them. Sometime you can have problmes routing into or out of a DSP block, but the solution for that is add more pipeline registers around the multiplier so that it allows the fitter to shorten the distance between stages and can put a register right next to the multiplier.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PS. Ive only had problems with the above when setting the clock speed to >350MHz on a stratix 4 and 5.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I dont quite get what the problem is? DSP blocks are the fastest way to do multiplies on the chip. So why would you not want to use them. --- Quote End --- That's true if both inputs are variables. However, when one side is a constant using the multiplier is probably the slowest. For example, 30*X would be DRAMATICALLY faster being done as 32*X - 2*X. A multiplier can't even get close.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
True, but you'll have to work that out for yourself. Afaik, the synthesisor cannot work out the 2^n constants itself. The example would also fall down and get complicated as the constant values get larger and larger. Your example is very simple and just requires a single adder. As the constants get to values requiring 10s or 100s of adders, a DSP block might be easier.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- True, but you'll have to work that out for yourself. Afaik, the synthesisor cannot work out the 2^n constants itself. The example would also fall down and get complicated as the constant values get larger and larger. Your example is very simple and just requires a single adder. As the constants get to values requiring 10s or 100s of adders, a DSP block might be easier. --- Quote End --- You can get almost every constant out to 16 bits with right around 5 operations if you allow subtractions. I'm really surprised that the tools can't work out these constants. Even the low-end ASIC synthesis tools have been able to do this kind of thing for quite a few years. Sigh. I guess I have to create some code to solve the knapsack problem and generate verilog. Again. Thanks for the advice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It might be worth raising a support request for this as I can see the benefit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I presume you know that you can always enforce multiplier implementation in logic cells on an entity or signal level by a multstyle synthesis attribute. Does it solve the timing problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The original post is about not using multipliers but using shift. One input is constant and so may be convertable to sum or difference of power of 2. As far as I Know the tool does not do this sort of conversion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- The original post is about not using multipliers but using shift. One input is constant and so may be convertable to sum or difference of power of 2. As far as I Know the tool does not do this sort of conversion. --- Quote End --- Shift and add is the way how the synthesis tool does implement a multiplier without DSP blocks. In so far multstyle = "logic" does use it. But there's no point in design synthesis where you'll see explicite adders or shift registers, because everything is translated to logic elements. The question is of course, if the synthesis tool implements the defined function effectively, but the problem isn't specific to constant multipliers. I'm under the impression that Quartus is quite good in implementing arithmetic. It has some obvious weaknesses in implementing non-arithmetic problems in arithmetic mode of logic elements.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Shift and add is the way how the synthesis tool does implement a multiplier without DSP blocks. In so far multstyle = "logic" does use it. But there's no point in design synthesis where you'll see explicite adders or shift registers, because everything is translated to logic elements. The question is of course, if the synthesis tool implements the defined function effectively, but the problem isn't specific to constant multipliers. I'm under the impression that Quartus is quite good in implementing arithmetic. It has some obvious weaknesses in implementing non-arithmetic problems in arithmetic mode of logic elements. --- Quote End --- Multipliers implemented in logic are full mults that support two variable inputs. I don’t know how they are implemented exactly in FPGAs but from per-FPGA era(Logic Design by Charles Roth) it was based on shift/add/control in an elaborate design. However, if one input is constant we don’t need that apart from shift only(or plus adder). So the two cases are distinct from design perspective. The original post is not obbssesed with shift but wants simplified design. Personally I prefer DSP blocks. I might target one or so coeffs (say) as power of 2 to save few mults just in case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As a point of reference, Stratix V now allows each DSP block to have 8 fixed coefficients that can be muxed from coeff_sel lines, and if Im reading correctly the coefficients can be up to 27bits. This might actually be a way to avoid the Ram based multipliers that ive seen used for higher speed designs when the multiplier value is fixed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- However, if one input is constant we don’t need that apart from shift only(or plus adder). So the two cases are distinct from design perspective. The original post is not obbssesed with shift but wants simplified design. Personally I prefer DSP blocks. I might target one or so coeffs (say) as power of 2 to save few mults just in case. --- Quote End --- Timing is the issue, primarily. I'm shoving things around at 150MHz to 200MHz on an Arria V. Not impossibly fast, but one definitely has to be alert to what is actually happening in synthesis. The multipliers want to finish the carry-propagate add before giving the result. Unfortunately, I have an add accumulation tree right after the multiplier, so the carry-propagate is effectively useless *and* soaks up a big chunk of time. I'd rather dump the final carry-save state and let the accumulate tree absorb it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Timing is the issue, primarily. I'm shoving things around at 150MHz to 200MHz on an Arria V. Not impossibly fast, but one definitely has to be alert to what is actually happening in synthesis. The multipliers want to finish the carry-propagate add before giving the result. Unfortunately, I have an add accumulation tree right after the multiplier, so the carry-propagate is effectively useless *and* soaks up a big chunk of time. I'd rather dump the final carry-save state and let the accumulate tree absorb it. --- Quote End --- Well I regularly get timing problems on mults (stratix iv @ 368MHz) then I realise what to do: put a pipeline register after mult result (apart from block's registers). This makes a big difference. I was afraid at times that this pipe may be repacked into blocks but it never happened apparently. It seems that -otherwise- routing is too bad from these mult blocks to the fabric. If you get latency problems then you might discard an internal block pipe if applicable. On the other hand I must confirm that with fpgas we regularly have constants into mults e.g. coefficients and we don't target designing mults as simple shift/add. DSP blocks are usually fast.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I didn't yet hear any results when enforcing logic implementation of the constant multipliers for the present problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I didn't yet hear any results when enforcing logic implementation of the constant multipliers for the present problem. --- Quote End --- Sorry for the lag, but I've been fighting a couple different issues. Enforcing the logic implementation removes the multiplier usage but loses significant speed. I would have to hand code a compression tree to win back enough speed. I may do that, at some point. If so, I will add to this post. However, unless I hit a speed wall, I probably won't do that. I'm finding that I am more than a bit underwhelmed at the speed performance of the Arria V's. I did not expect 250+MHz in Verilog to be this problematic in a 28nm technology chip.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I remember someone saying that it takes time to get values into and out of the DSP blocks - so to do a multiply in logic probably requires that you add another pipeline stage (or two) somewhere.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I remember someone saying that it takes time to get values into and out of the DSP blocks - so to do a multiple in logic probably requires that you add another pipeline stage (or two) somewhere. --- Quote End --- Yes, Ive seen the fitter have no problems with the DSP blocks themselves, but then decides to put the next/previous register to the DSP half way across the chip to move it closer to the next/previous bit of logic. So adding in redundant pipeline stages pre/post DSP gives the fitter a bit of extra leeway on the timing. You can get the same problem with RAM Blocks too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Yes, Ive seen the fitter have no problems with the DSP blocks themselves, but then decides to put the next/previous register to the DSP half way across the chip to move it closer to the next/previous bit of logic. So adding in redundant pipeline stages pre/post DSP gives the fitter a bit of extra leeway on the timing. You can get the same problem with RAM Blocks too. --- Quote End --- I took this advice to heart and double-pipelined the entry and exit point (2 flops in a row on both input and output). The system still can't hit 320MHz. It's going global clock->DSP block->flop from global clock and it can't seem to meet 320MHz for setup on the multipliers (at least at slow corners--fast corners claim to pass). And we're not talking a small miss here. On a 3.125ns clock cycle it misses by almost a full nanosecond. I did check the datasheet for the chip (and checked that my device settings are correct), and it claims 370MHz is supposed to be the minimum on these. Is there some file where I can look at exactly what this thing is doing in the corners. That seems to be an *enormous* variation from fast to slow. 18x18 multiplier in 28nm and the system can't hold 320MHz while completely pipelined and with a constant on one input? Double pipelining on input and output and over half my delay is in interconnect? Something feels broken ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How are these multipliers implemented? Ive also had experience showing that infered mult-add trees wont clock as fast as using megafunctions. And to make it clock faster, LUTs had to be in place over DSP blocks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- How are these multipliers implemented? Ive also had experience showing that infered mult-add trees wont clock as fast as using megafunctions. And to make it clock faster, LUTs had to be in place over DSP blocks. --- Quote End --- No tree. Just a pure multiply. Flop->flop->multiply->flop->flop. TimeQuest critical path shows data arrival path as global clock->DSP block->flop (CLKCTRL_G2->DSP_X70_Y69_N0->FF_X71_Y69_N14) with data required path being global clock->flop (CLKCTRL_G2->FF_X71_Y69_N14). Don't really see how I can get any cleaner than that ... I have a ticket open with Altera. I also note that the release notes for the latest Quartus mention that timing models have been changed for the Arria V series. I'll report back if something changes.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page