Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
17042 Discussions

Dividing numbers by constants without dividers?

Altera_Forum
Honored Contributor II
8,871 Views

Hi! 

 

I would like to divide a number by a set of constants (3, 5 or 7) and I have read in an old post that using a divider might be a waste of resources. I have also read that for simple divisions like the ones I need, it would be possible to use multipliers instead. I guess it would be something like this: 

 

N = 14-bit numerator (the number I want to divide) 

C = constant where C = {3, 5, 7} 

 

Result = N*(1/C) 

 

But I do not know how to get that 1/C in Verilog... Could anyone please give me an easy example? I would like to use an LPM_MULT so I would also like to know the optimal latency for a 178.2 MHz clock on an EP4CE10F17C8L FPGA. So far I have used an LPM_MULT to get a 14-bit result form a 10-bit and a 4-bit inputs and the latency is 2, but maybe I do not know if that would change in this case. 

 

Thanks in advance!
0 Kudos
13 Replies
Altera_Forum
Honored Contributor II
5,210 Views

 

--- Quote Start ---  

Hi! 

 

I would like to divide a number by a set of constants (3, 5 or 7) and I have read in an old post that using a divider might be a waste of resources. I have also read that for simple divisions like the ones I need, it would be possible to use multipliers instead. I guess it would be something like this: 

 

N = 14-bit numerator (the number I want to divide) 

C = constant where C = {3, 5, 7} 

 

Result = N*(1/C) 

 

But I do not know how to get that 1/C in Verilog... Could anyone please give me an easy example? I would like to use an LPM_MULT so I would also like to know the optimal latency for a 178.2 MHz clock on an EP4CE10F17C8L FPGA. So far I have used an LPM_MULT to get a 14-bit result form a 10-bit and a 4-bit inputs and the latency is 2, but maybe I do not know if that would change in this case. 

 

Thanks in advance! 

--- Quote End ---  

 

 

The concept is this: 

 

suppose your input is x, coeff is y and you choose 14 bits resolution: 

c = c * 2^n/2^n; 

c = round(c*2^n)/2^n; 

 

the term round(c*2^n) you precompute it, call it y, then in design you multiply: 

x * y gives result, divide by 2^n i.e. discard n bits with/without rounding
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

 

--- Quote Start ---  

The concept is this: 

 

suppose your input is x, coeff is y and you choose 14 bits resolution: 

c = c * 2^n/2^n; 

c = round(c*2^n)/2^n; 

 

the term round(c*2^n) you precompute it, call it y, then in design you multiply: 

x * y gives result, divide by 2^n i.e. discard n bits with/without rounding 

--- Quote End ---  

 

 

Thank you for your prompt reply. I am trying to understand that but I might be missing something... 

 

So if x = 150, c = 3 and n = 3 then y = 3*2^3 = 24 -> 150*24 = 3600 and if I shift that n positions then I get 450 when I should get something near 150/3 = 50. What did i get wrong? 

 

Edit: the result is nearly correct if I shift 2*n positions. Does it mean anything?
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

 

--- Quote Start ---  

Thank you for your prompt reply. I am trying to understand that but I might be missing something... 

 

So if x = 150, c = 3 and n = 3 then y = 3*2^3 = 24 -> 150*24 = 3600 and if I shift that n positions then I get 450 when I should get something near 150/3 = 50. What did i get wrong? 

 

Edit: the result is nearly correct if I shift 2*n positions. Does it mean anything? 

--- Quote End ---  

 

 

if you want to divide by 3 then your target is 150/3 = 50 

in the design: 

convert 1/3 to 1/3 *2^n e.g. round(1/3 * 2^14) = 5461 (your n of 3 is too low) 

5461 * 150 = 819150 

 

discard n bits i.e. your result becomes 818150/2^14 = 49.996
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

if you remove the question of bit resolution, it comes down to something as simple as this: 

 

OP = X/C 

 

which is the same as  

 

OP = X * 1/C 

 

1/C can be calculated prior to implementation, and you have a constant multiplier rather than constant divisor. 

 

Lets assume C = 5. 

 

obviously, 1/C is now less than 1, and cannot be represented in the same bits that X is represented in. Lets assume X is an 8 bit integer. 1/C requires extra bits on the RHS of the number. Lets assume X is 75 (b00101011) 

1/C = 0.2.  

 

lets assume we add another 8 bits for fractional bits. We can now generate our numbers by multiplying everything by 256. 

 

75 * 256 = b0010101100000000 (19200 you just shift by 8 bits) 

0.2 * 256 = b0000000000110011 (51.2, so we lost the 51) 

 

Now we multiply 51x19200 = b0000,0000,0000,1110,1111,0001,0000,0000 (979200) 

 

We now have 16 integer bits and 16 fraction bits (8.8 * 8.8 = 16.16) 

 

so we can drop the 16 fraction bits to get our answer, 0000000000001110 = 14 (15 if you add extra rounding logic - drop all but the 2^-1 bit and add 1). 

 

VHDL 2008 adds the fixed point libraries that mean keeping track of fraction and integer separation in your words is really simple (integer bits have positive index values and fraction are negative, to match the 2^n offset). Otherwise you need to manually keep track of the separation (by using imarginary multiplication offsets etc).
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

 

--- Quote Start ---  

 

VHDL 2008 adds the fixed point libraries that mean keeping track of fraction and integer separation in your words is really simple (integer bits have positive index values and fraction are negative, to match the 2^n offset). Otherwise you need to manually keep track of the separation (by using imarginary multiplication offsets etc). 

--- Quote End ---  

 

 

Tricky, 

 

Software people mindset is geared to using fractional fixed point approach. FPGA engineers(old ones at least like myself) use fixed point as integer. 

The fractional approach as you know is imaginary in the mind only. Actual values are bits and I prefer to "imagine" them as integers (also imaginary). 

 

Manually keeping track of fraction is no problem at all and gives full visibility. Leaving that to tool such as fixed point library is an option but I find it very unnecessary and complicates type declaration when there is no problem to solve in the first place. 

 

DSPBuilder is done by software guys and they do force sometimes their mindset but options are there and it just confuses and adds more confusion using both schemes. It also leads to bugs.
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

For low numbers of bits, you can also do it using a lookup table. 14bit is starting to get a bit large for this approach, but it could be done in 1 or 2 clock cycles using four M20K's as ROM. Your 14bit number is the address (14b=2^14=16k), and the 1 bit output from each of the four memories forms the result. If you need to do two different divisions, you can use the memory in dual-port mode and be able to do two simultaneous divisions with the same memory.

0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

 

--- Quote Start ---  

For low numbers of bits, you can also do it using a lookup table. 14bit is starting to get a bit large for this approach, but it could be done in 1 or 2 clock cycles using four M20K's as ROM. Your 14bit number is the address (14b=2^14=16k), and the 1 bit output from each of the four memories forms the result. If you need to do two different divisions, you can use the memory in dual-port mode and be able to do two simultaneous divisions with the same memory. 

--- Quote End ---  

 

 

yes. This is a general theme for any result. Any equation can be precomputed in software, stored then addressed.  

In some cases part of the result is stored to reduce memory supported by some computation (partial lookup)
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

Thank you guys, your replies helped me a lot! I implemented the approach that kaz suggested using a lpm_mult and I wonder how many latency clock cycles I should choose. The clock frequency is 178.2 MHz, dataa is 15-bit long, datab is 10-bit long and the output is then 25-bit long, so I am using a built-in 18x18 DSP. I have been looking for some information about the latency but I still did not fin anything about how to choose the optimal latency. Depending on the latency I will need to delay some other signals accordingly, so I really need to know exactly what I am doing :P So how could I find out the number of cycles that are necessary to multiply two variables? Thank you again!

0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

You can use lpm_mult if you want, it gives you direct control over the primitive, but if it makes the code easier to read why not just stick a multiply directly in the code and let the synthesisor put the multiplier in? As long as you register the inputs and outputs it should infer the registers inside the DSP slot.  

 

If you have trouble with timing, the problem is more likely the routing in and out of the DSP. If this is the case, just add extra registers in the pipeline in your code. But At 178MHz, I dont think you'll need them.
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

 

--- Quote Start ---  

If you have trouble with timing, the problem is more likely the routing in and out of the DSP. If this is the case, just add extra registers in the pipeline in your code. But At 178MHz, I dont think you'll need them. 

--- Quote End ---  

 

 

Thanks for your reply :) Does that mean that the 18x18 DSP would be able to get the result in just one cycle? Why would I ever need to pipeline then? I thought it would take at least 2-3 cycles to get the first result and then it would be possible to have a result every cycle. I probably misunderstood something!
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

the actual multiply only takes a single clock (if you look at the handbook, the multiplier is actually an asynchronous component surrounded by registers in the DSP element).  

The pipelines are needed to make the routing shorter. The actual mutliply may be able to run at 400Mhz, but when you then try and take the result to some other logic somewhere in the chip, this is what can kill the FMax. Using more pipelining can make it easier for the fitter to shorten these paths between logic. 

 

Logic and registers are spread all over the chip, but the DSPs are only in specific parts of the chip in columns (to allow you to pass data from one DSP to another DSP with minimal routing). So it's usually getting data in/out of them that takes the time. Extra pipeline registers (ie registers with no logic between them) allows the fitter to place the registers right next to the DSP. Without them, it might be fighting between the DSP and some other complex logic elsewhere in the design.
0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

Setting the pipeline length also allows you to balance data paths lengths to match up with some other pipeline.

0 Kudos
Altera_Forum
Honored Contributor II
5,210 Views

Thanks a lot!! Your reply made everything crystal-clear!

0 Kudos
Reply