Is -10 between 1.5e-45 and 3.4e+38?
If yes, explain to me why. I am no more that good in Maths. So, sorry for the weakness of these questions level.
According to C# documentation (of computer programming for that language) on Microsoft site, their variables float type is between 1.5 × 1e-45 and 3.4 × 1e+38?
But, 1e-3 equals 0.001 and 1e-6 equals 0.000001. That means the more we decrease a negative exponent, the less will be the resulting value but that one will still greater than zero.
Here comes the problem, I tried to use a float variable giving to it -10 as value expecting to get an error but to my surprise, -10 was accepted.
I am confused.
The documentation doesn't say so explicitly, but the two floating-point types are signed. So -10 can be represented by a floating-point type just as 10 can.
There's a few things to note.
The first is that float (and for that matter, double) is signed, so the values can be either positive or negative (or zero).
The other is that the range is a matter of precision. If you try to set a float to a value of less absolute value than than it can handle, like 1E-50 it will be set to zero rather than error; which is what you get when you round 1×10-50 to the precision it can cope with, while if you try to give it a value larger than the absolute value it can handle, like 1E50 then it will be set to ∞ (or -∞ for -1E50) because again that's as precisely as it can represent something that large.
That C# documentation is poorly written and has some errors or omissions:
The range it describes is only for the magnitude of positive finite numbers. The floating-point formats also include zero, negative numbers, and infinities. Thus, the true range is from −∞ to +∞.
It uses approximate decimal values to describe the range. The true values for float are:
The smallest representable positive number is exactly 2−149, which is 1.40129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125e-45.
The largest representable finite number is exactly 2128−2104, which is 340282346638528859811704183484516925440.
Related
Why does the following program print what it prints?
class Program
{
static void Main(string[] args)
{
float f1 = 0.09f*100f;
float f2 = 0.09f*99.999999f;
Console.WriteLine(f1 > f2);
}
}
Output is
false
Floating point only has so many digits of precision. If you're seeing f1 == f2, it is because any difference requires more precision than a 32-bit float can represent.
I recommend reading What Every Computer Scientist Should Read About Floating Point
The main thing is this isn't just .Net: it's a limitation of the underlying system most every language will use to represent a float in memory. The precision only goes so far.
You can also have some fun with relatively simple numbers, when you take into account that it's not even base ten. 0.1 (1/10th), for example, is a repeating decimal when represented in binary, just as 1/3rd is when represented in decimal.
In this particular case, it’s because .09 and .999999 cannot be represented with exact precision in binary (similarly, 1/3 cannot be represented with exact precision in decimal). For example, 0.111111111111111111101111 base 2 is 0.999998986721038818359375 base 10. Adding 1 to the previous binary value, 0.11111111111111111111 base 2 is 0.99999904632568359375 base 10. There isn’t a binary value for exactly 0.999999. Floating point precision is also limited by the space allocated for storing the exponent and the fractional part of the mantissa. Also, like integer types, floating point can overflow its range, although its range is larger than integer ranges.
Running this bit of C++ code in the Xcode debugger,
float myFloat = 0.1;
shows that myFloat gets the value 0.100000001. It is off by 0.000000001. Not a lot, but if the computation has several arithmetic operations, the imprecision can be compounded.
imho a very good explanation of floating point is in Chapter 14 of Introduction to Computer Organization with x86-64 Assembly Language & GNU/Linux by Bob Plantz of California State University at Sonoma (retired) http://bob.cs.sonoma.edu/getting_book.html. The following is based on that chapter.
Floating point is like scientific notation, where a value is stored as a mixed number greater than or equal to 1.0 and less than 2.0 (the mantissa), times another number to some power (the exponent). Floating point uses base 2 rather than base 10, but in the simple model Plantz gives, he uses base 10 for clarity’s sake. Imagine a system where two positions of storage are used for the mantissa, one position is used for the sign of the exponent* (0 representing + and 1 representing -), and one position is used for the exponent. Now add 0.93 and 0.91. The answer is 1.8, not 1.84.
9311 represents 0.93, or 9.3 times 10 to the -1.
9111 represents 0.91, or 9.1 times 10 to the -1.
The exact answer is 1.84, or 1.84 times 10 to the 0, which would be 18400 if we had 5 positions, but, having only four positions, the answer is 1800, or 1.8 times 10 to the zero, or 1.8. Of course, floating point data types can use more than four positions of storage, but the number of positions is still limited.
Not only is precision limited by space, but “an exact representation of fractional values in binary is limited to sums of inverse powers of two.” (Plantz, op. cit.).
0.11100110 (binary) = 0.89843750 (decimal)
0.11100111 (binary) = 0.90234375 (decimal)
There is no exact representation of 0.9 decimal in binary. Even carrying the fraction out more places doesn’t work, as you get into repeating 1100 forever on the right.
Beginning programmers often see floating point arithmetic as more
accurate than integer. It is true that even adding two very large
integers can cause overflow. Multiplication makes it even more likely
that the result will be very large and, thus, overflow. And when used
with two integers, the / operator in C/C++ causes the fractional part
to be lost. However, ... floating point representations have their own
set of inaccuracies. (Plantz, op. cit.)
*In floating point, both the sign of the number and the sign of the exponent are represented.
I have a problem understanding the precision of float type.
The msdn writes that precision from 6 to 9 digits. But I note that precision depends from on the size of the number:
float smallNumber = 1.0000001f;
Console.WriteLine(smallNumber); // 1.0000001
bigNumber = 100000001f;
Console.WriteLine(bigNumber); // 100000000
The smallNumber is more precise than big, I understand IEEE754, but I don't understand how MSDN calculate precision, and does it make sense?
Also, you can play with the representation of numbers in float format here. Please write 100000000 value in "You entered" input and click "+1" on the right. Then change the input's value to 1, and click "+1" again. You may see the difference in precision.
The MSDN documentation is nonsensical and wrong.
Bad concept. Binary-floating-point format does not have any precision in decimal digits because it has no decimal digits at all. It represents numbers with a sign, a fixed number of binary digits (bits), and an exponent for a power of two.
Wrong on the high end. The floating-point format represents many numbers exactly, with infinite precision. For example, “3” is represented exactly. You can write it in decimal arbitrarily far, 3.0000000000…, and all of the decimal digits will be correct. Another example is 1.40129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125•10−45. This number has 105 significant digits in decimal, but the float format represents it exactly (it is 2−149).
Wrong on the low end. When “999999.97” is converted from decimal to float, the result is 1,000,000. So not even one decimal digit is correct.
Not a measure of accuracy. Because the float significand has 24 bits, the resolution of its lowest bit is about 223 times finer than the resolution of its highest bit. This is about 6.9 digits in the sense that log10223 is about 6.9. But that just tells us the resolution—the coarseness—of the representation. When we convert a number to the float format, we get a result that differs from the number by at most ½ of this resolution, because we round to the nearest representable value. So a conversion to float has a relative error of at most 1 part in 224, which corresponds to about 7.2 digits in the above sense. If we are using digits to measure resolution, then we say the resolution is about 7.2 digits, not that it is 6-9 digits.
Where do these numbers came from?
So, if “~6-9 digits” is not a correct concept, does not come from actual bounds on the digits, and does not measure accuracy, where does it come from? We cannot be sure, but 6 and 9 do appear in two descriptions of the float format.
6 is the largest number x for which this is guaranteed:
If any decimal numeral with at most x significant digits is within the normal exponent bounds of the float format and is converted to the nearest value represented in the format, then, when the result is converted to the nearest decimal numeral with at most x significant digits, the result of that conversion equals the original number.
So it is reasonable to say float can preserve at least six decimal digits. However, as we will see, there is no bound involving nine digits.
9 is the smallest number x that guarantees this:
If any finite float number is converted to the nearest decimal numeral with x digits, then, when the result is converted to the nearest value representable in float, the result of that conversion equals the original number.
As an analogy, if float is a container, then the largest “decimal container” guaranteed to fit inside it is six digits, and the smallest “decimal container” guaranteed to hold it is nine digits. 6 and 9 are akin to interior and exterior measurements of the float container.
Suppose you had a block 7.2 units long, and you were looking at its placement on a line of bricks each 1 unit long. If you put the start of the block at the start of a brick, it will extend 7.2 bricks. However, if somebody else chooses where it starts, they might start it in the middle of a brick. Then it would cover part of that brick, all of the next 6 bricks, and and part of the last brick (e.g., .5 + 6 + .7 = 7.2). So a 7.2-unit block is only guaranteed to cover 6 bricks. Conversely, 8 bricks can cover the 7.2-unit block if you choose where they are placed. But if somebody else chooses where they start, the first might cover just .1 units of the block. Then you need 7 more and another fraction, so 9 bricks are needed.
The reason this analogy holds is that powers of two and powers of 10 are irregularly spaced relative to each other. 210 (1024) is near 103 (1000). 10 is the exponent used in the float format for numbers from 1024 (inclusive) to 2048 (exclusive). So this interval from 1024 to 2048 is like a block that has been placed just after the 100-1000 ends and the 1000-10,000 block starts.
But note that this property involving 9 digits is the exterior measurement—it is not a capability that float can perform or a service that it can provide. It is something that float needs (if it is to be held in a decimal format), not something it provides. So it is not a bound on how many digits a float can store.
Further Reading
For better understanding of floating-point arithmetic, consider studying the IEEE-754 Standard for Floating-Point Arithmetic or a good textbook like Handbook of Floating-Point Arithmetic by Jean-Michel Muller et al.
Yes number of digits before rounding errors is a measure of precision but you can not asses precision from just 2 numbers because you might be just closer or further from the rounding threshold.
To better understand the situation then you need to see how floats are represented.
The IEEE754 32bit floats are stored as:
bool(1bit sign) * integer(24bit mantisa) << integer(8bit exponent)
Yes mantissa is 24 bit instead of 23 as it's MSB is implicitly set to 1.
As you can see there are only integers and bitshift. So if you are representing natural number up to 2^24 you are without rounding completely. Fro bigger numbers binary zero padding occurs from the right that causes the difference.
In case of digits after decimal points the zero padding occurs from the left. But there is another problem as in binary you can not store some decadic numbers exactly. For example:
0.3 dec = 0.100110011001100110011001100110011001100... bin
0.25 dec = 0.01 bin
As you can see the sequence of 0.3 dec in binary is infinite (like we can not write 1/3 in decadic) hence if crop it to only 24 bits you lose the rest and the number is not what you want anymore.
If you compare 0.3 and 0.125 the 0.125 is exact and 0.3 is not but 0.125 is much smaller than 0.3. So your measure is not correct unless explored more very close values that will cover the rounding steps and computing the max difference from such set. For example you could compare
1.0000001f
1.0000002f
1.0000003f
1.0000004f
1.0000005f
1.0000006f
1.0000007f
1.0000008f
1.0000009f
and remember the max difference of fabs(x-round(x)) and than do the same for
100000001
100000002
100000003
100000004
100000005
100000006
100000007
100000008
100000009
And then compare the two differences.
On top of all this you are missing one very important thing. And that is the errors while converting from text to binary and back which are usually even bigger. First of all try to print your numbers without rounding (for example force to print 20 decimal digits after decimal point).
Also the numbers are stored in binary base so in order to print them you need to convert to decadic base which involves multiplication and division by 10. The more bits are missing (zero pad) from the number the bigger the print errors are. To be as precise as you can a trick is used and that is to print the number in hex (no rounding errors) and then convert the hex string itself to decadic base on integer math. That is much more accurate then naive floating point prints. for more info see related QAs:
my best attempt to print 32 bit floats with least rounding errors (integer math only)
How do libraries/programming languages convert floats to strings
How do I convert a very long binary number to decimal?
Now to get back to number of "precise" digits represented by float. For integer part of number is that easy:
dec_digits = floor(log10(2^24)) = floor(7.22) = 7
However for digits after decimal point is this not as precise (for first few decadic digits) as there are a lot rounding going on. For more info see:
How do you print the EXACT value of a floating point number?
I think what they mean in their documentation is that depending on the number that the precision ranges from 6 to 9 decimal places. Go by the standard that is explained on the page you linked, sometimes Microsoft are a bit lazy when it comes to documentation, like the rest of us.
The problem with floating point is that it is inaccurate. If you put the number 1.05 into the site in your link you will notice that it cannot be accurately stored in floating point. It's actually stored as 1.0499999523162841796875. It's stored this way to do calculations faster. It's not great for money, e.g. what if your item is priced at $1.05 and you sell a billion of them.
The smallNumber is more precise than big
Incorrect compare. The other number has more significant digits.
1.0000001f is attempting N digits of decimal precision.
100000001f attempts N+1.
I have a problem understanding the precision of float type.
To best understand float precision, think binary. Use "%a" for printing with a C99 or later compiler.
float is stored base 2. The significand is a Dyadic rational, some integer/power-of-2.
float commonly has 24 bits of binary precision. (23-bit explicitly encoded, 1 implied)
Between [1.0 ... 2.0), there are 223 different float values.
Between [2.0 ... 4.0), there are 223 different float values.
Between [4.0 ... 8.0), there are 223 different float values.
...
The possible values of a float are not distributed uniformly among powers-of-10. The grouping of float values to power-of-10 (decimal precision) results in the wobbling 6 to 9 decimal digits of precision.
How to calculate float type precision?
To find the difference between subsequent float values, since C99, use nextafterf()
Illustrative code:
#include<math.h>
#include<stdio.h>
void foooo(float b) {
float a = nextafterf(b, 0);
float c = nextafterf(b, b * 2.0f);
printf("%-15a %.9e\n", a, a);
printf("%-15a %.9e\n", b, b);
printf("%-15a %.9e\n", c, c);
printf("Local decimal precision %.2f digits\n", 1.0 - log10((c - b) / b));
}
int main(void) {
foooo(1.0000001f);
foooo(100000001.0f);
return 0;
}
Output
0x1p+0 1.000000000e+00
0x1.000002p+0 1.000000119e+00
0x1.000004p+0 1.000000238e+00
Local decimal precision 7.92 digits
0x1.7d783ep+26 9.999999200e+07
0x1.7d784p+26 1.000000000e+08
0x1.7d7842p+26 1.000000080e+08
Local decimal precision 8.10 digits
I understand that due to the way floats are stored, there are some values that cannot be represented exactly. (See Why Are Floating Point Numbers Inaccurate?)
That being said, in .NET / C#, what options or guarantees does one have in terms of precision? https://msdn.microsoft.com/en-us/library/b1e65aza.aspx?f=255&MSPPError=-2147217396 refers to "approximate" range. Why is an approximation necessary in this context?
Ultimately, I'm looking to store a value like:
x.xxxxxxxx
where the value is always positive. This doesn't seem to be a hugely precise number, yet I'd like to know for sure that all possible numeric combinations above are "durable."
Edit:
To clarify about the exact level of precision I want, I'd like to be able to have eight digits following the decimal point, and a single digit to the left of it.
It sounds like you're looking for decimal.
From the link:
Precision 28-29 significant digits
Approximate Range (pasting doesn't give the formatting of the exponent etc. so just check out the link.)
what options or guarantees does one have in terms of precision
base on the doc an Wikipedia float has a precision of 7 significant decimal digits precision
something like 1234567 or 1.234567 or 0.0000001234567
Why is an approximation necessary in this context
because it is not exact. IEEE 754 floating-point value is (2 − 2−23) × 2127 ≈ 3.402823 × 10^38
though you might be able to get a number like 3.4028239999 × 10^38 and the last 5 digits (99999) cannot be trusted
I looked at decimal in C# but I wasnt 100% sure what it did.
Is it lossy? in C# writing 1.0000000000001f+1.0000000000001f results in 2 when using float (double gets you 2.0000000000002 which is correct) is it possible to add two things with decimal and not get the correct answer?
How many decimal places can I use? I see the MaxValue is 79228162514264337593543950335 but if i subtract 1 how many decimal places can I use?
Are there quirks I should know of? In C# its 128bits, in other language how many bits is it and will it work the same way as C# decimal does? (when adding, dividing, multiplication)
What you're showing isn't decimal - it's float. They're very different types. f is the suffix for float, aka System.Single. m is the suffix for decimal, aka System.Decimal. It's not clear from your question whether you thought this was actually using decimal, or whether you were just using float to demonstrate your fears.
If you use 1.0000000000001m + 1.0000000000001m you'll get exactly the right value. Note that the double version wasn't able to express either of the individual values exactly, by the way.
I have articles on both kinds of floating point in .NET, and you should read them thoroughly, along other resources:
Binary floating point (float/double)
Decimal floating point (decimal)
All floating point types have their limits of course, but in particular you should not expect binary floating point to accurately represent decimal values such as 0.1. It still can't represent anything that isn't exactly representable in 28/29 decimal digits though - so if you divide 1 by 3, you won't get the exact answer of course.
You should also note that the range of decimal is considerably smaller than that of double. So while it can have 28-29 decimal digits of precision, you can't represent truly huge numbers (e.g. 10200) or miniscule numbers (e.g. 10-200).
Decimals in programming are (almost) never 100% accurate. Sometimes it's even better to multiply the decimal value with a very high number and then calculate, but that's only if you're for example sure that the value is always between 0 and 100(so it won't get out of range of the maxvalue)
Floting point is inherently imprecise. Some numbers can't be represented faithfully. Decimal is a large floating point with high precision. If you look on the page at msdn you can see there are "28-29 significant digits." The .net framework classes are language agnostic. they will work the same in every language that uses .net.
edit (in response to Jon Skeet): If you initialize the Decimal class with the numbers above, which are less than 28 digits each after the decimal point, the number will be stored faithfully as long as the binary representation is exact. Since it works in 64-bit format, I assume the 128-bit will handle it perfectly fine. Some numbers, such as 0.1, will never be exactly representable because they are a repeating sequence in binary.
Refreshing on floating points (also PDF), IEEE-754 and taking part in this discussion on floating point rounding when converting to strings, brought me to tinker: how can I get the maximum and minimum value for a given floating point number whose binary representations are equal.
Disclaimer: for this discussion, I like to stick to 32 bit and 64 bit floating point as described by IEEE-754. I'm not interested in extended floating point (80-bits) or quads (128 bits IEEE-754-2008) or any other standard (IEEE-854).
Background: Computers are bad at representing 0.1 in binary representation. In C#, a float represents this as 3DCCCCCD internally (C# uses round-to-nearest) and a double as 3FB999999999999A. The same bit patterns are used for decimal 0.100000005 (float) and 0.1000000000000000124 (double), but not for 0.1000000000000000144 (double).
For convenience, the following C# code gives these internal representations:
string GetHex(float f)
{
return BitConverter.ToUInt32(BitConverter.GetBytes(f), 0).ToString("X");
}
string GetHex(double d)
{
return BitConverter.ToUInt64(BitConverter.GetBytes(d), 0).ToString("X");
}
// float
Console.WriteLine(GetHex(0.1F));
// double
Console.WriteLine(GetHex(0.1));
In the case of 0.1, there is no lower decimal number that is represented with the same bit pattern, any 0.99...99 will yield a different bit representation (i.e., float for 0.999999937 yields 3F7FFFFF internally).
My question is simple: how can I find the lowest and highest decimal value for a given float (or double) that is internally stored in the same binary representation.
Why: (I know you'll ask) to find the error in rounding in .NET when it converts to a string and when it converts from a string, to find the internal exact value and to understand my own rounding errors better.
My guess is something like: take the mantissa, remove the rest, get its exact value, get one (mantissa-bit) higher, and calculate the mean: anything below that will yield the same bit pattern. My main problem is: how to get the fractional part as integer (bit manipulation it not my strongest asset). Jon Skeet's DoubleConverter class may be helpful.
One way to get at your question is to find the size of an ULP, or Unit in the Last Place, of your floating-point number. Simplifying a little bit, this is the distance between a given floating-point number and the next larger number. Again, simplifying a little bit, given a representable floating-point value x, any decimal string whose value is between (x - 1/2 ulp) and (x + 1/2 ulp) will be rounded to x when converted to a floating-point value.
The trick is that (x +/- 1/2 ulp) is not a representable floating-point number, so actually calculating its value requires that you use a wider floating-point type (if one is available) or an arbitrary width big decimal or similar type to do the computation.
How do you find the size of an ulp? One relatively easy way is roughly what you suggested, written here is C-ish pseudocode because I don't know C#:
float absX = absoluteValue(x);
uint32_t bitPattern = getRepresentationOfFloat(absx);
bitPattern++;
float nextFloatNumber = getFloatFromRepresentation(bitPattern);
float ulpOfX = (nextFloatNumber - absX);
This works because adding one to the bit pattern of x exactly corresponds to adding one ulp to the value of x. No floating-point rounding occurs in the subtraction because the values involved are so close (in particular, there is a theorem of ieee-754 floating-point arithmetic that if two numbers x and y satisfy y/2 <= x <= 2y, then x - y is computed exactly). The only caveats here are:
if x happens to be the largest finite floating point number, this won't work (it will return inf, which is clearly wrong).
if your platform does not correctly support gradual underflow (say an embedded device running in flush-to-zero mode), this won't work for very small values of x.
It sounds like you're not likely to be in either of those situations, so this should work just fine for your purposes.
Now that you know what an ulp of x is, you can find the interval of values that rounds to x. You can compute ulp(x)/2 exactly in floating-point, because floating-point division by 2 is exact (again, barring underflow). Then you need only compute the value of x +/- ulp(x)/2 suitable larger floating-point type (double will work if you're interested in float) or in a Big Decimal type, and you have your interval.
I made a few simplifying assumptions through this explanation. If you need this to really be spelled out exactly, leave a comment and I'll expand on the sections that are a bit fuzzy when I get the chance.
One other note the following statement in your question:
In the case of 0.1, there is no lower
decimal number that is represented
with the same bit pattern
is incorrect. You just happened to be looking at the wrong values (0.999999... instead of 0.099999... -- an easy typo to make).
Python 3.1 just implemented something like this: see the changelog (scroll down a bit), bug report.