Hiii...I want to know the difference between floating point and fixed point arithmetic.I searched it on google but I didnt get a clear idea when to use which type of arithematic. Can any one please help me to understand it in a simple language. Thank you in advance.
Hi,let me first recall integer arithmetic: you have a predefined number of bits, which define the range in which you can do calculations; e.g. for 8 bit signed integer, you can have numbers in the range -128 to 127. In other words: if you want to do maths where the number 500 might occur, better use 16 bit numbers. In fixed point arithmetic you do the same, but you also define that a few of those bits represent a fractional part. Let's say you have an 8 bit number, of which 4 represent the integer part, and another 4 represent the digits behind the decimal separator. You could represent numbers like 3.25 (which is 0011b integer and 0100b decimals; why 0100b? 0.25 is not a half = 0, but a quarter = 1, and nothing else = 00). Actually, you can synthesize this stuff, e.g. using the "fixed" data type in VHDL 2008. So all you need to do is to define the range of your numbers, calculate how many bits you need, encode that, done. I think it's easier if I explain this in decimal number than in binary: if I want to do calculations where I expect numbers to go down to 0.001, and up to 1000, I must have at least 7 decimal digits (seven as in "0000.000"). That's around 21 bits in binary (turns out 20 bits are sufficient if you do the math in binary). Plus one more bit for a sign, okay. Sounds nice, right? Except when your mathematical problems are in a domain where both large and small numbers occur. In physics you often deal with ranges from 10^-15 up to 10^12 (I just made that up, but you know what I mean - there's nano-this, and giga-that). Does that mean you need 200 bits just to be sure you can cover each of those cases? No. Because you never to maths with more than, let's say, 3 digits. What do I mean with that? I typically use numbers like 3.14nano, or 3.14*10^-9. See what I did? I split that ugly number of 12 decimal digits (0.00000000314) into something I can represent with only 4 digits and a sign bit: 3.14 and -9. You can do the same in binary. It's called floating point, and it means that you use a fixed-point number (in my example it would represent the 3.14) and an integer number (-9 in my example). However, as you might imagine, working with floating-point is a bit more tricky (it's a fixed point plus an integer, and you need to calculate powers!). In other words: you'll need insane amounts of registers to do calculations with floating point. In my example above, I said I use four decimal digits, that's around 12 bit. Plus one for the sign bit, plus another sign for the power-of-10. That's 14 bit. In practice you normally use 32 bit numbers (as in the C-type "float"). Oops, I forgot: in my example, how do you represent the number 8192? That's 8,192*10^3, right? But I said you only have 3 digits for the number - okay, round that to 8,19. Now your "floating-point number" represents the value 8190. What can you do against that? Well, you could increase the number to, say 5 digits. (now we're around 18 binary digits) Still not clear? Here's it in a nutshell:
Fixed point is really just integer arithmetic with an offset (its not just similar - it is completely identical). It uses 2s compliment for signed arithmetic. https://en.wikipedia.org/wiki/two%27s_complementFloating point uses a different number format complety. It uses a sign bit, and exponent and a mantissa. https://en.wikipedia.org/wiki/floating_point Fixed point has a fixed range with fixed precision based on the number of bits. Floating has a fixed range with floating precision, bit width is always fixed. Floating point is computationally expensive, uses a lot of resources and has a high latency. Fixed point is cheap, few resources and low latency.
To me the main issue is that floating point is used for critical few cases especially when small value representation is treated fairly.The drawback of fixed point is the unfair representation of small values which will occupy few LSBs and waste many MSBs. For example with 8 bits unsigned the difference between 200 and 201 can be represented i.e. 1/200 is represented.But we can't pass same representation to lower values such as 1/200 of 3
--- Quote Start --- To me the main issue is that floating point is used for critical few cases especially when small value representation is treated fairly. The drawback of fixed point is the unfair representation of small values which will occupy few LSBs and waste many MSBs. For example with 8 bits unsigned the difference between 200 and 201 can be represented i.e. 1/200 is represented.But we can't pass same representation to lower values such as 1/200 of 3 --- Quote End --- Thank you every one for your response...It really helped.