Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
21582 Discussions

FPGA Multiplication Resource Utilization

Altera_Forum
Honored Contributor II
3,019 Views

Hi, 

I need to do 18x7 multiplication and my FPGA doesnt have the built in DSP blocks, so I am using behavioral code in the source code to do the multiplication. It is taking around 190 LUT to do this multiplication  

Example: q_srrc_sum <= $signed(7'd63) * q_srrc; where q_srrc is an 18 bit vector 

I need to do 16 of these and it takes lot of the area. Is there a better way to optimize this , so that the resource utilization is less. 

Thank You
0 Kudos
11 Replies
Altera_Forum
Honored Contributor II
2,253 Views

190 LUT seems a lot for multiplying by the constant 63 (0x3f if I read it right). 

That multiply should be one subtraction. 

If your multiplies are all by constants, you might be able to optimise them. 

I'm also surprised the multiply takes many more than 18x7 (126) LUT - since it can be made of that many full adders (3 input bits) with a delay of 18+7 (23 or so) logic steps. 

 

Reducing the resource either means using multi cycle multipliers, or muxing the values into a single multiplier.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

 

--- Quote Start ---  

190 LUT seems a lot for multiplying by the constant 63 (0x3f if I read it right). 

That multiply should be one subtraction. 

If your multiplies are all by constants, you might be able to optimise them. 

I'm also surprised the multiply takes many more than 18x7 (126) LUT - since it can be made of that many full adders (3 input bits) with a delay of 18+7 (23 or so) logic steps. 

 

Reducing the resource either means using multi cycle multipliers, or muxing the values into a single multiplier. 

--- Quote End ---  

 

 

if I want multiply any number by 64 then I just add 6 leading zeros. 

if I want multiply by 63 instead of 64 I will try see if the minor error can be tolerated. If not I will sunbtract: 

y * 63 = y*(64-1) = y*64 -y
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

Hi, 

Thank you for your response, can you please elaborate on this, I didnt follow what you meant by: 

if I want multiply any number by 64 then I just add 6 leading zeros. 

The synthesis tools are smart enough whether I use a Multiplier (*) in my source code or a shift operator correct and use the same resources??? 

 

Thank You.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

Unfortunately I cannot add latency( 1-2 clock latency is fine nothing more than that). Is there any other option? 

Thank you for your response.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

 

--- Quote Start ---  

Unfortunately I cannot add latency( 1-2 clock latency is fine nothing more than that). Is there any other option? 

Thank you for your response. 

--- Quote End ---  

 

 

If you just want to multiply any number (binary format) with any power of 2 constant e.g. 2^6(64) then all you need is insert 6 leading zeros e.g. 

1 = '1', 2 = "10", 4 = "100" and so on. It does not need multiplier and I wouldn't wait for a tool to do that. 

 

If your constant is close to 2^n then you can subtract or add as appropriate e.g. 

112 * 65 = 112*64 + 112 so add 6 leading zeros then add one more sample of input. 

 

regarding latency issue: you can insert leading zeros without any latency. for add/subtract you may need one clock or so depending on your design. 

 

For suitable bitwidths you may also think of precomputed Look-up-table.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

Thank You soo much for the information, can you please provide some information on precomputed LUT, I havent done this before, will that reduce the area? 

Thanks a lot.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

 

--- Quote Start ---  

Thank You soo much for the information, can you please provide some information on precomputed LUT, I havent done this before, will that reduce the area? 

Thanks a lot. 

--- Quote End ---  

 

 

In general, using precomputed values stored in LUT is a method that can be used for any formula. You either precompute entire result or break down your computation into chunks; partly computed in logic and partly precomputed. 

 

To get precomputed results you need to precompute them off circuit! first by hand(well a software tool), store them in ROM. In the logic you use suitable address to get the values from rom.  

 

For multiplication with a constant value you can precompute positive values only then use input's absolute value as address which means you use positive numbers directly but convert negative number by inverting then adding 1.  

 

The method best suits small memory. An 18 bit value will require 2^17 results each with 18 bits width or more, so it is massive unrealistic memory size though there might be some other reduction techniques.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

Since an n-bit by m-bit multiply is basically n m-bit wide conditional adds (or m n-bit wide ones) somewhere you have to perform those operations. You can arrange to feed the 'carry' bits from one addition into the next, reducing the latency (the carry bits are the limiting factor) from 'n * m' to 'n + m' full addres cells. 

If you have to do all your multiplies in the same clock, then you need separate logic for each. 

If you can arrange to do them in different clocks than you can use mux to feed in the required values into a single (or smaller numer of) multiplier block. 

You can also feed the bits of the operands into a smaller multiplier and add together the results (like doing a long multiply by hand). That is probably easiest if you just split one of the inputs into 2 equal pieces.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

If you have some spare block-ram you can do full multiplication with two table look-ups, two additions and one subtract. This was used extensively on the commodore 64 for demos and boils down to: 

 

F(a*b)=F(a+b)-F(a-b) where F is a table of squares divided by 4 ( i.e. x^2/4 ) 

 

Depending on your needs, pipelining should bring this down to a two-clock operation and minimal LE resources. An 8x8 multiplication would take 256 16-bit words of memory and 48 LE's ( 2x8 for lookups and 2x16 for pipeline registers ). 

 

-Mux
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

The method Kaz mentioned earlier on is how I would do it. This is the one liner that does q_srrc_sum = (q_srrc * 64) - q_srrc which is the same thing as multiplying by 63. 

 

assign q_srrc_sum = (q_srrc << 6) - q_srrc; 

 

I expect that to be around 27 LEs and it is combinational so it shouldn't impact your latency unless you need to pipeline it.
0 Kudos
Altera_Forum
Honored Contributor II
2,253 Views

Agreed. For multiplying by a fixed number, shifts, adds and subtracts are the way to go. I once wrote a program that would give you the optimal combination in pseudo code. I'll see if I can dig it up somewhere. It basically found two shift values that were above and below the multiplier and then recursively deduced the least expensive path. 

 

-Mux
0 Kudos
Reply