Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
Related
Why does the following program print what it prints?
class Program
{
static void Main(string[] args)
{
float f1 = 0.09f*100f;
float f2 = 0.09f*99.999999f;
Console.WriteLine(f1 > f2);
}
}
Output is
false
Floating point only has so many digits of precision. If you're seeing f1 == f2, it is because any difference requires more precision than a 32-bit float can represent.
I recommend reading What Every Computer Scientist Should Read About Floating Point
The main thing is this isn't just .Net: it's a limitation of the underlying system most every language will use to represent a float in memory. The precision only goes so far.
You can also have some fun with relatively simple numbers, when you take into account that it's not even base ten. 0.1 (1/10th), for example, is a repeating decimal when represented in binary, just as 1/3rd is when represented in decimal.
In this particular case, it’s because .09 and .999999 cannot be represented with exact precision in binary (similarly, 1/3 cannot be represented with exact precision in decimal). For example, 0.111111111111111111101111 base 2 is 0.999998986721038818359375 base 10. Adding 1 to the previous binary value, 0.11111111111111111111 base 2 is 0.99999904632568359375 base 10. There isn’t a binary value for exactly 0.999999. Floating point precision is also limited by the space allocated for storing the exponent and the fractional part of the mantissa. Also, like integer types, floating point can overflow its range, although its range is larger than integer ranges.
Running this bit of C++ code in the Xcode debugger,
float myFloat = 0.1;
shows that myFloat gets the value 0.100000001. It is off by 0.000000001. Not a lot, but if the computation has several arithmetic operations, the imprecision can be compounded.
imho a very good explanation of floating point is in Chapter 14 of Introduction to Computer Organization with x86-64 Assembly Language & GNU/Linux by Bob Plantz of California State University at Sonoma (retired) http://bob.cs.sonoma.edu/getting_book.html. The following is based on that chapter.
Floating point is like scientific notation, where a value is stored as a mixed number greater than or equal to 1.0 and less than 2.0 (the mantissa), times another number to some power (the exponent). Floating point uses base 2 rather than base 10, but in the simple model Plantz gives, he uses base 10 for clarity’s sake. Imagine a system where two positions of storage are used for the mantissa, one position is used for the sign of the exponent* (0 representing + and 1 representing -), and one position is used for the exponent. Now add 0.93 and 0.91. The answer is 1.8, not 1.84.
9311 represents 0.93, or 9.3 times 10 to the -1.
9111 represents 0.91, or 9.1 times 10 to the -1.
The exact answer is 1.84, or 1.84 times 10 to the 0, which would be 18400 if we had 5 positions, but, having only four positions, the answer is 1800, or 1.8 times 10 to the zero, or 1.8. Of course, floating point data types can use more than four positions of storage, but the number of positions is still limited.
Not only is precision limited by space, but “an exact representation of fractional values in binary is limited to sums of inverse powers of two.” (Plantz, op. cit.).
0.11100110 (binary) = 0.89843750 (decimal)
0.11100111 (binary) = 0.90234375 (decimal)
There is no exact representation of 0.9 decimal in binary. Even carrying the fraction out more places doesn’t work, as you get into repeating 1100 forever on the right.
Beginning programmers often see floating point arithmetic as more
accurate than integer. It is true that even adding two very large
integers can cause overflow. Multiplication makes it even more likely
that the result will be very large and, thus, overflow. And when used
with two integers, the / operator in C/C++ causes the fractional part
to be lost. However, ... floating point representations have their own
set of inaccuracies. (Plantz, op. cit.)
*In floating point, both the sign of the number and the sign of the exponent are represented.
I have a problem understanding the precision of float type.
The msdn writes that precision from 6 to 9 digits. But I note that precision depends from on the size of the number:
float smallNumber = 1.0000001f;
Console.WriteLine(smallNumber); // 1.0000001
bigNumber = 100000001f;
Console.WriteLine(bigNumber); // 100000000
The smallNumber is more precise than big, I understand IEEE754, but I don't understand how MSDN calculate precision, and does it make sense?
Also, you can play with the representation of numbers in float format here. Please write 100000000 value in "You entered" input and click "+1" on the right. Then change the input's value to 1, and click "+1" again. You may see the difference in precision.
The MSDN documentation is nonsensical and wrong.
Bad concept. Binary-floating-point format does not have any precision in decimal digits because it has no decimal digits at all. It represents numbers with a sign, a fixed number of binary digits (bits), and an exponent for a power of two.
Wrong on the high end. The floating-point format represents many numbers exactly, with infinite precision. For example, “3” is represented exactly. You can write it in decimal arbitrarily far, 3.0000000000…, and all of the decimal digits will be correct. Another example is 1.40129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125•10−45. This number has 105 significant digits in decimal, but the float format represents it exactly (it is 2−149).
Wrong on the low end. When “999999.97” is converted from decimal to float, the result is 1,000,000. So not even one decimal digit is correct.
Not a measure of accuracy. Because the float significand has 24 bits, the resolution of its lowest bit is about 223 times finer than the resolution of its highest bit. This is about 6.9 digits in the sense that log10223 is about 6.9. But that just tells us the resolution—the coarseness—of the representation. When we convert a number to the float format, we get a result that differs from the number by at most ½ of this resolution, because we round to the nearest representable value. So a conversion to float has a relative error of at most 1 part in 224, which corresponds to about 7.2 digits in the above sense. If we are using digits to measure resolution, then we say the resolution is about 7.2 digits, not that it is 6-9 digits.
Where do these numbers came from?
So, if “~6-9 digits” is not a correct concept, does not come from actual bounds on the digits, and does not measure accuracy, where does it come from? We cannot be sure, but 6 and 9 do appear in two descriptions of the float format.
6 is the largest number x for which this is guaranteed:
If any decimal numeral with at most x significant digits is within the normal exponent bounds of the float format and is converted to the nearest value represented in the format, then, when the result is converted to the nearest decimal numeral with at most x significant digits, the result of that conversion equals the original number.
So it is reasonable to say float can preserve at least six decimal digits. However, as we will see, there is no bound involving nine digits.
9 is the smallest number x that guarantees this:
If any finite float number is converted to the nearest decimal numeral with x digits, then, when the result is converted to the nearest value representable in float, the result of that conversion equals the original number.
As an analogy, if float is a container, then the largest “decimal container” guaranteed to fit inside it is six digits, and the smallest “decimal container” guaranteed to hold it is nine digits. 6 and 9 are akin to interior and exterior measurements of the float container.
Suppose you had a block 7.2 units long, and you were looking at its placement on a line of bricks each 1 unit long. If you put the start of the block at the start of a brick, it will extend 7.2 bricks. However, if somebody else chooses where it starts, they might start it in the middle of a brick. Then it would cover part of that brick, all of the next 6 bricks, and and part of the last brick (e.g., .5 + 6 + .7 = 7.2). So a 7.2-unit block is only guaranteed to cover 6 bricks. Conversely, 8 bricks can cover the 7.2-unit block if you choose where they are placed. But if somebody else chooses where they start, the first might cover just .1 units of the block. Then you need 7 more and another fraction, so 9 bricks are needed.
The reason this analogy holds is that powers of two and powers of 10 are irregularly spaced relative to each other. 210 (1024) is near 103 (1000). 10 is the exponent used in the float format for numbers from 1024 (inclusive) to 2048 (exclusive). So this interval from 1024 to 2048 is like a block that has been placed just after the 100-1000 ends and the 1000-10,000 block starts.
But note that this property involving 9 digits is the exterior measurement—it is not a capability that float can perform or a service that it can provide. It is something that float needs (if it is to be held in a decimal format), not something it provides. So it is not a bound on how many digits a float can store.
Further Reading
For better understanding of floating-point arithmetic, consider studying the IEEE-754 Standard for Floating-Point Arithmetic or a good textbook like Handbook of Floating-Point Arithmetic by Jean-Michel Muller et al.
Yes number of digits before rounding errors is a measure of precision but you can not asses precision from just 2 numbers because you might be just closer or further from the rounding threshold.
To better understand the situation then you need to see how floats are represented.
The IEEE754 32bit floats are stored as:
bool(1bit sign) * integer(24bit mantisa) << integer(8bit exponent)
Yes mantissa is 24 bit instead of 23 as it's MSB is implicitly set to 1.
As you can see there are only integers and bitshift. So if you are representing natural number up to 2^24 you are without rounding completely. Fro bigger numbers binary zero padding occurs from the right that causes the difference.
In case of digits after decimal points the zero padding occurs from the left. But there is another problem as in binary you can not store some decadic numbers exactly. For example:
0.3 dec = 0.100110011001100110011001100110011001100... bin
0.25 dec = 0.01 bin
As you can see the sequence of 0.3 dec in binary is infinite (like we can not write 1/3 in decadic) hence if crop it to only 24 bits you lose the rest and the number is not what you want anymore.
If you compare 0.3 and 0.125 the 0.125 is exact and 0.3 is not but 0.125 is much smaller than 0.3. So your measure is not correct unless explored more very close values that will cover the rounding steps and computing the max difference from such set. For example you could compare
1.0000001f
1.0000002f
1.0000003f
1.0000004f
1.0000005f
1.0000006f
1.0000007f
1.0000008f
1.0000009f
and remember the max difference of fabs(x-round(x)) and than do the same for
100000001
100000002
100000003
100000004
100000005
100000006
100000007
100000008
100000009
And then compare the two differences.
On top of all this you are missing one very important thing. And that is the errors while converting from text to binary and back which are usually even bigger. First of all try to print your numbers without rounding (for example force to print 20 decimal digits after decimal point).
Also the numbers are stored in binary base so in order to print them you need to convert to decadic base which involves multiplication and division by 10. The more bits are missing (zero pad) from the number the bigger the print errors are. To be as precise as you can a trick is used and that is to print the number in hex (no rounding errors) and then convert the hex string itself to decadic base on integer math. That is much more accurate then naive floating point prints. for more info see related QAs:
my best attempt to print 32 bit floats with least rounding errors (integer math only)
How do libraries/programming languages convert floats to strings
How do I convert a very long binary number to decimal?
Now to get back to number of "precise" digits represented by float. For integer part of number is that easy:
dec_digits = floor(log10(2^24)) = floor(7.22) = 7
However for digits after decimal point is this not as precise (for first few decadic digits) as there are a lot rounding going on. For more info see:
How do you print the EXACT value of a floating point number?
I think what they mean in their documentation is that depending on the number that the precision ranges from 6 to 9 decimal places. Go by the standard that is explained on the page you linked, sometimes Microsoft are a bit lazy when it comes to documentation, like the rest of us.
The problem with floating point is that it is inaccurate. If you put the number 1.05 into the site in your link you will notice that it cannot be accurately stored in floating point. It's actually stored as 1.0499999523162841796875. It's stored this way to do calculations faster. It's not great for money, e.g. what if your item is priced at $1.05 and you sell a billion of them.
The smallNumber is more precise than big
Incorrect compare. The other number has more significant digits.
1.0000001f is attempting N digits of decimal precision.
100000001f attempts N+1.
I have a problem understanding the precision of float type.
To best understand float precision, think binary. Use "%a" for printing with a C99 or later compiler.
float is stored base 2. The significand is a Dyadic rational, some integer/power-of-2.
float commonly has 24 bits of binary precision. (23-bit explicitly encoded, 1 implied)
Between [1.0 ... 2.0), there are 223 different float values.
Between [2.0 ... 4.0), there are 223 different float values.
Between [4.0 ... 8.0), there are 223 different float values.
...
The possible values of a float are not distributed uniformly among powers-of-10. The grouping of float values to power-of-10 (decimal precision) results in the wobbling 6 to 9 decimal digits of precision.
How to calculate float type precision?
To find the difference between subsequent float values, since C99, use nextafterf()
Illustrative code:
#include<math.h>
#include<stdio.h>
void foooo(float b) {
float a = nextafterf(b, 0);
float c = nextafterf(b, b * 2.0f);
printf("%-15a %.9e\n", a, a);
printf("%-15a %.9e\n", b, b);
printf("%-15a %.9e\n", c, c);
printf("Local decimal precision %.2f digits\n", 1.0 - log10((c - b) / b));
}
int main(void) {
foooo(1.0000001f);
foooo(100000001.0f);
return 0;
}
Output
0x1p+0 1.000000000e+00
0x1.000002p+0 1.000000119e+00
0x1.000004p+0 1.000000238e+00
Local decimal precision 7.92 digits
0x1.7d783ep+26 9.999999200e+07
0x1.7d784p+26 1.000000000e+08
0x1.7d7842p+26 1.000000080e+08
Local decimal precision 8.10 digits
I may very well have not the proper understanding of significant figures, but the book
C# 6.0 in a Nutshell by Joseph Albahari and Ben Albahari (O’Reilly).
Copyright 2016 Joseph Albahari and Ben Albahari, 978-1-491-92706-9.
provides the table below for comparing double and decimal:
Is it not counter-intuitive that, on the one hand, a double can hold a smaller quantity of significant figures, while on the other it can represent numbers way bigger than decimal, which can hold a higher quantity of significant figures ?
Imagine you were told you can store a value, but were given a limitation: You can only store 10 digits, 0-9 and a negative symbol. You can create the rules to decode the value, so you can store any value.
The first way you store things is simply as the value xxxxxxxxxx, meaning the number 123 is stored as 0000000123. Simple to store and read. This is how an int works.
Now you decide you want to store fractional numbers, so you change the rules a bit. Now you store xxxxxxyyyy, where x is the integer portion and y is the fractional portion. So, 123.98 would be stored as 0001239800. This is roughly how a Decimal value works. You can see the largest value I can store is 9999999999, which translates to 999999.9999. This means I have a hard upper limit on the size of the value, but the number of the significant digits is large at 10.
There is a way to store larger values, and that's to store the x and y components for the formula in xxxxxxyyyy. So, to store 123.98, you need to store 01239800-2, which I can calculate as . This means I can store much bigger numbers by changing 'y', but the number of significant digits is basically fixed at 6. This is basically how a double works.
The answer lies in the way that doubles are encoded. Rather than just being a direct binary representation of a number, they have 3 parts: sign, exponent, and fraction.
The sign is obvious, it controls + or -.
The fraction part is also obvious. It's binary fraction that represents a number in between 0 and 1.
The exponent is where the magic happens. It signifies a scaling factor.
The final float calculation comes out to (-1)^$sign * (1 + $fraction) * 2 ^$exponent
This allows much higher values than a straight decimal number because of the exponent. There's a lot of reading out there on why this works and how to do addition and multiplication with these encoded numbers. Google around for "IEEE floating point format" or whatever topic you need. Hope that helps!
The Range has nothing to do with the precision. Double has a binary representation (base 2). Not all numbers can be represented exactly as we humans know them in the decimal format. Not to mention and accumulated rounding errors of addition and division. A larger range means a greater MAX VALUE and a smaller MIN VALUE than decimal.
Decimal on the other side is (base 10). It has a smaller range (smaller MAX VALUE and larger MIN VALUE). This has nothing to do with precision, since it is not represented using floating binary point representation, it can represent numbers more precisely and though is recommended for human-made numbers and calculations.
Would a higher range of floats be more accurate to multiply / divide / add / subtract, than a lower range.
For example, would 567.56 / 345.54 be more accurate than .00097854 / .00021297 ?
The answer to your question is "no." Floating point numbers are (usually*) represented with a normalized mantissa and an exponent. Multiplication and division operate first on the normalized mantissa, then on the exponents.
Addition and subtraction are, of course, another story. Operations like your examples:
567.56 + 345.54 or .00097854 - .00021297
work fine. But operations with disparate orders of magnitude like
567.56 + .00097854 or 345.54 - .00021297
may lose some low-order precision.
The IEEE Floating point standards includes denormalized numbers. If you are an astrophysicist or runtime-library developer, you may need to understand them. See http://en.wikipedia.org/wiki/Denormal_number
For IEEE 754 binary floating-point numbers (the most common), floating-point values have the same number of bits in the significand throughout most of the exponent range. However, there is a portion of the range where the significand has effectively fewer bits. And the relative error caused by rounding does vary depending on where the significand lies within its range.
IEEE 754 floating-point numbers are represented by a sign (+1 or -1, encoded as 0 or 1), an exponent (for double-precision, -1022 to 1023, encoded as the exponent plus 1023, so 1 to 2046), and a significand (for double-precision, a fraction usually from 1 to just under 2, represented with 53 bits but encoded with 52 bits because the first bit is implicitly 1).
E.g., the number 6.5 is encoded with the bits 0 (sign +1), 10000000001 (exponent 2), and 1010000000000000000000000000000000000000000000000000 (binary fraction 1.1010, hex 1.a, decimal 1.3125). We can write this in hexadecimal floating-point as 0x1.ap2 (hex fraction 1.a multiplied by 2 to the power of decimal 2). Writing in hexadecimal floating-point enables humans to see the floating-point representation fairly easily.
For the exponent, the encoding values of 0 and 2047 are special. When the encoding is 0, the exponent is the same as when the encoding is 1 (-1022), but the implicit bit of the fraction is 0 instead of 1. When the encoding is 2047, the floating-point object represents infinity (if the significand bits are all zero) or a NaN (otherwise).
When the encoded exponent is 0 and the significand bits are all zero, the number represents zero (with +0 and -0 distinguished by the sign). If the significand bits are not all zero, the number is said to be denormalized. This is because most numbers are “normalized” by adjusting the exponent so that the fraction is between 1 (inclusive) and 2 (exclusive). For denormalized numbers, the fraction is less than 1; it starts with “0.” instead of “1.”.
When the result of a floating-point operation is a denormalized number, it effectively has fewer bits in the significand. Thus, as numbers drop below 0x1p-1022 (2-1022), the effective precision decreases.
When numbers are in the normal range (not underflowing to denormals and not overflowing to infinity), then there are no differences in the significands of numbers with different exponents, so:
(2a+2b)/2 has exactly the same result as a+b.
(2a-2b)/2 has exactly the same result as a-b.
(2ab)/2 has exactly the same result as ab.
Note, however, that the relative error can change. When a floating-point operation is performed, the exact mathematical result must be rounded to a representable value. This rounding can happen only in units representable by the significand. For a given exponent, the bits in the significand have a fixed value. So the last bit in the significand represents a certain value. That value is a greater portion of a significand near 1 than it is of a significand near 2.
For a double-precision result, the unit of least precision (ULP) is 1 part in 252 of the value of the greatest bit in the significand. When using round-to-nearest mode (the most common default), the greatest error is at most half of that, because, if the representable number in one direction is more than half an ULP away, the number in the other direction is less than half an ULP away. And the closer number is returned by a proper floating-point operation.
Thus, the maximum relative error in a result with a significand near 1 is slightly over 2-53, but the maximum relative error in a result with a significand near 2 is slightly under 2-54.
For the sake of completeness, I have to disagree a bit and say Yes, it may matter somehow...
Indeed, if you perform 56756.0 / 34554.0, then you'll get the nearest representable Float to the exact mathematical result, with a single floating point rounding "error".
This is because 56756.0 and 34554.0 are representable exactly in floating point (single or double precision IEEE 754), and because according to IEEE 754 standard, operations perform an exact rounding operation (in default mode to the nearest).
If you write 567.56 / 345.54, then both numbers are not represented exactly in floating point in radix 2, so the result of this operation is cumulating 3 floating point rounding "errors".
Let's compare the result in Squeak Smalltalk in double precision (Float), converted to exact arithmetic (Fraction with arbitrary integer length at numerator and denominator):
((56756.0 / 34554.0) asFraction - (56756 / 34554)) asFloat.
-> -7.932275867322412e-17
So far, so good, the magnitude of error is less than or equal to half an ulp, as promised by IEEE 754:
(56756 / 34554) asFloat ulp / 2
-> 1.1102230246251565e-16
With cumulated rounding errors, you may get a larger error (but never a smaller):
((567.56 / 345.54) asFraction - (56756 / 34554)) asFloat
-> -3.0136736359825544e-16
((0.00056756 / 0.00034554) asFraction - (56756 / 34554)) asFloat
-> 3.647664511768385e-16
Above example is hard to generalize, and I perfectly agree with other answers: generally, NO, you should only care of relative precision.
... Unless maybe if you want to implement some function with very strict tolerance about round off errors...
No. In the sense that there's the same number of significant digits available no matter what the order of magnitude (exponent part) of your number is.
Is there any practical difference between the .net decimal values 1m and 1.0000m?
The internal storage is different:
1m : 0x00000001 0x00000000 0x00000000 0x00000000
1.0000m : 0x000186a0 0x00000000 0x00000000 0x00050000
But, is there a situation where the knowledge of "significant digits" would be used by a method in the BCL?
I ask because I'm working on a means of compressing the space required for decimal values for disk storage or network transport and am toying with the idea of "normalizing" the value before I store it to improve it's compressability. But, I'd like to know if it is likely to cause issues down the line. I'm guessing that it should be fine, but only because I don't see any methods or properties that expose the precision of the value. Does anyone know otherwise?
The reason for the difference in encoding is because the Decimal data type stores the number as a whole number (96 bit integer), with a scale which is used to form the divisor to get the fractional number. The value is essentially
integer / 10^scale
Internally the Decimal type is represented as 4 Int32, see the documentation of Decimal.GetBits for more detail. In summary, GetBits returns an array of 4 Int32s, where each element represents the follow portion of the Decimal encoding
Element 0,1,2 - Represent the low, middle and high 32 bits on the 96 bit integer
Element 3 - Bits 0-15 Unused
Bits 16-23 exponent which is the power of 10 to divide the integer by
Bits 24-30 Unused
Bit 31 the sign where 0 is positive and 1 is negative
So in your example, very simply put when 1.0000m is encoded as a decimal the actual representation is 10000 / 10^4 while 1m is represented as 1 / 10^0 mathematically the same value just encoded differently.
If you use the native .NET operators for the decimal type and do not manipulate/compare the bit/bytes yourself you should be safe.
You will also notice that the string conversions will also take this binary representation into consideration and produce different strings so you need to be careful in that case if you ever rely on the string representation.
The decimal type tracks scale because it's important in arithmetic. If you do long multiplication, by hand, of two numbers — for instance, 3.14 * 5.00 — the result has 6 digits of precision and a scale of 4.
To do the multiplication, ignore the decimal points (for now) and treat the two numbers as integers.
3.14
* 5.00
------
0000 -- 0 * 314 (0 in the one's place)
00000 -- 0 * 314 (0 in the 10's place)
157000 -- 5 * 314 (5 in the 100's place)
------
157000
That gives you the unscaled results. Now, count the total number of digits to the right of the decimal point in the expression (that would be 4) and insert the decimal point 4 places to the left:
15.7000
That result, while equivalent in value to 15.7, is more precise than the value 15.7. The value 15.7000 has 6 digits of precision and a scale of 4; 15.7 has 3 digits of precision and a scale of 1.
If one is trying to do precision arithmetic, it is important to track the precision and scale of your values and results as it tells you something about the precision of your results (note that precision isnt' the same as accuracy: measure something with a ruler graduated in 1/10ths of an inch and the best you can say about the resulting measurement, no matter how many trailing zeros you put to the right of the decimal point is that it is accurate to, at best, a 1/10th of an inch. Another way of putting it would be to say that your measurement is accurate, at best, within +/- 5/100ths of the stated value.
The only reason I can think of is so invoking `ToString returns the exact textual representation in the source code.
Console.WriteLine(1m); // 1
Console.WriteLine(1.000m); // 1.000