Convert float to double loses precision but not via ToString - c#

I have the following code:
float f = 0.3f;
double d1 = System.Convert.ToDouble(f);
double d2 = System.Convert.ToDouble(f.ToString());
The results are equivalent to:
d1 = 0.30000001192092896;
d2 = 0.3;
I'm curious to find out why this is?

Its not a loss of precision .3 is not representable in floating point. When the system converts to the string it rounds; if you print out enough significant digits you will get something that makes more sense.
To see it more clearly
float f = 0.3f;
double d1 = System.Convert.ToDouble(f);
double d2 = System.Convert.ToDouble(f.ToString("G20"));
string s = string.Format("d1 : {0} ; d2 : {1} ", d1, d2);
output
"d1 : 0.300000011920929 ; d2 : 0.300000012 "

You're not losing precision; you're upcasting to a more precise representation (double, 64-bits long) from a less precise representation (float, 32-bits long). What you get in the more precise representation (past a certain point) is just garbage. If you were to cast it back to a float FROM a double, you would have the exact same precision as you did before.
What happens here is that you've got 32 bits allocated for your float. You then upcast to a double, adding another 32 bits for representing your number (for a total of 64). Those new bits are the least significant (the farthest to the right of your decimal point), and have no bearing on the actual value since they were indeterminate before. As a result, those new bits have whatever values they happened to have when you did your upcast. They're just as indeterminate as they were before -- garbage, in other words.
When you downcast from a double to a float, it'll lop off those least-significant bits, leaving you with 0.300000 (7 digits of precision).
The mechanism for converting from a string to a float is different; the compiler needs to analyze the semantic meaning of the character string '0.3f' and figure out how that relates to a floating point value. It can't be done with bit-shifting like the float/double conversion -- thus, the value that you expect.
For more info on how floating point numbers work, you may be interested in checking out this wikipedia article on the IEEE 754-1985 standard (which has some handy pictures and good explanation of the mechanics of things), and this wiki article on the updates to the standard in 2008.
edit:
First, as #phoog pointed out below, upcasting from a float to a double isn't as simple as adding another 32 bits to the space reserved to record the number. In reality, you'll get an addition 3 bits for the exponent (for a total of 11), and an additional 29 bits for the fraction (for a total of 52). Add in the sign bit and you've got your total of 64 bits for the double.
Additionally, suggesting that there are 'garbage bits' in those least significant locations a gross generalization, and probably not be correct for C#. A bit of explanation, and some testing below suggests to me that this is deterministic for C#/.NET, and probably the result of some specific mechanism in the conversion rather than reserving memory for additional precision.
Way back in the beforetimes, when your code would compile into a machine-language binary, compilers (C and C++ compilers, at least) would not add any CPU instructions to 'clear' or initialize the value in memory when you reserved space for a variable. So, unless the programmer explicitly initialized a variable to some value, the values of the bits that were reserved for that location would maintain whatever value they had before you reserved that memory.
In .NET land, your C# or other .NET language compiles into an intermediate language (CIL, Common Intermediate Language), which is then Just-In-Time compiled by the CLR to execute as native code. There may or may not be an variable initialization step added by either the C# compiler or the JIT compiler; I'm not sure.
Here's what I do know:
I tested this by casting the float to three different doubles. Each one of the results had the exact same value.
That value was exactly the same as #rerun's value above: double d1 = System.Convert.ToDouble(f); result: d1 : 0.300000011920929
I get the same result if I cast using double d2 = (double)f; Result: d2 : 0.300000011920929
With three of us getting the same values, it looks like the upcast value is deterministic (and not actually garbage bits), indicating that .NET is doing something the same way across all of our machines. It's still true to say that the additional digits are no more or less precise than they were before, because 0.3f isn't exactly equal to 0.3 -- it's equal to 0.3, up to seven digits of precision. We know nothing about the values of additional digits beyond those first seven.

I use decimal cast for correct result in this case and same other case
float ff = 99.95f;
double dd = (double)(decimal)ff;

Related

Why "float.TryParse("51778365".ToString(), out x)" return 51778364 in C#? [duplicate]

Why does the following program print what it prints?
class Program
{
static void Main(string[] args)
{
float f1 = 0.09f*100f;
float f2 = 0.09f*99.999999f;
Console.WriteLine(f1 > f2);
}
}
Output is
false
Floating point only has so many digits of precision. If you're seeing f1 == f2, it is because any difference requires more precision than a 32-bit float can represent.
I recommend reading What Every Computer Scientist Should Read About Floating Point
The main thing is this isn't just .Net: it's a limitation of the underlying system most every language will use to represent a float in memory. The precision only goes so far.
You can also have some fun with relatively simple numbers, when you take into account that it's not even base ten. 0.1 (1/10th), for example, is a repeating decimal when represented in binary, just as 1/3rd is when represented in decimal.
In this particular case, it’s because .09 and .999999 cannot be represented with exact precision in binary (similarly, 1/3 cannot be represented with exact precision in decimal). For example, 0.111111111111111111101111 base 2 is 0.999998986721038818359375 base 10. Adding 1 to the previous binary value, 0.11111111111111111111 base 2 is 0.99999904632568359375 base 10. There isn’t a binary value for exactly 0.999999. Floating point precision is also limited by the space allocated for storing the exponent and the fractional part of the mantissa. Also, like integer types, floating point can overflow its range, although its range is larger than integer ranges.
Running this bit of C++ code in the Xcode debugger,
float myFloat = 0.1;
shows that myFloat gets the value 0.100000001. It is off by 0.000000001. Not a lot, but if the computation has several arithmetic operations, the imprecision can be compounded.
imho a very good explanation of floating point is in Chapter 14 of Introduction to Computer Organization with x86-64 Assembly Language & GNU/Linux by Bob Plantz of California State University at Sonoma (retired) http://bob.cs.sonoma.edu/getting_book.html. The following is based on that chapter.
Floating point is like scientific notation, where a value is stored as a mixed number greater than or equal to 1.0 and less than 2.0 (the mantissa), times another number to some power (the exponent). Floating point uses base 2 rather than base 10, but in the simple model Plantz gives, he uses base 10 for clarity’s sake. Imagine a system where two positions of storage are used for the mantissa, one position is used for the sign of the exponent* (0 representing + and 1 representing -), and one position is used for the exponent. Now add 0.93 and 0.91. The answer is 1.8, not 1.84.
9311 represents 0.93, or 9.3 times 10 to the -1.
9111 represents 0.91, or 9.1 times 10 to the -1.
The exact answer is 1.84, or 1.84 times 10 to the 0, which would be 18400 if we had 5 positions, but, having only four positions, the answer is 1800, or 1.8 times 10 to the zero, or 1.8. Of course, floating point data types can use more than four positions of storage, but the number of positions is still limited.
Not only is precision limited by space, but “an exact representation of fractional values in binary is limited to sums of inverse powers of two.” (Plantz, op. cit.).
0.11100110 (binary) = 0.89843750 (decimal)
0.11100111 (binary) = 0.90234375 (decimal)
There is no exact representation of 0.9 decimal in binary. Even carrying the fraction out more places doesn’t work, as you get into repeating 1100 forever on the right.
Beginning programmers often see floating point arithmetic as more
accurate than integer. It is true that even adding two very large
integers can cause overflow. Multiplication makes it even more likely
that the result will be very large and, thus, overflow. And when used
with two integers, the / operator in C/C++ causes the fractional part
to be lost. However, ... floating point representations have their own
set of inaccuracies. (Plantz, op. cit.)
*In floating point, both the sign of the number and the sign of the exponent are represented.

Accumulating errors with decimal type?

I have an application where I accumulate decimal values (both adding and subtracting.) I use the decimal type rather than double in order to avoid accumulation errors. However, I've run into a case where the behavior is not quite what I'd expect.
I have x = a + b, where a = 487.5M and b = 433.33333333333333333333333335M.
Computing the addition, I get x = 920.8333333333333333333333334M.
I then have y = 967.8750000000000000000000001M.
I want to assert that y - x = y - a - b. However,
y - x = 47.0416666666666666666666667
y - a - b = 47.04166666666666666666666675
I thought this kind of error was exactly what the decimal type was intended to avoid, so what's happening here?
Here is code that reproduces the issue:
static void Main()
{
decimal a = 487.5M;
decimal b = 433.33333333333333333333333335M;
decimal x = a + b;
decimal y = 967.8750000000000000000000001M;
Console.WriteLine(y - x);
Console.WriteLine(y - a - b);
if (y - x != y - a - b)
Console.WriteLine("x - y != y - a - b");
Console.ReadKey();
}
There was some discussion in comments as to why these high precisions are necessary, so I thought I'd address in summary here. For display purposes, I certainly round the results of these operations, but I use decimal for all internal representations. Some of the computations take fractions along the way, which results in numbers that are beyond the precision of the decimal type.
I take care, however, to try and keep everything stable for accumulation. So, for instance, if I split up a quantity into three thirds, I take x/3, x/3 and then (x - x/3 - x/3). This is a system that is accounting for physical quantities that are often divided up like this, so I don't want to introduce biases by rounding too soon. For instance, if I rounded the above for x=1 to three decimals, I would wind up with 0.333, 0.333, 0.334 as the three portions of the operation.
There are real physical limitations to the precision of what the system can do, but the logical accounting of what it's trying to do should ideally stay as precise as it can. The main critical requirement is that the sum total quantity of the system should not change as a result of these various operations. In the above case, I'm finding that decimal can violate this assumption, so I want to understand better why this is happening and how I can fix it.
The C# type Decimal is not like the decimal types used in COBOL, which actually store the numbers one decimal digit per nibble, and uses mathematical methods similar to doing decimal math by hand. Rather, it is a floating point type that simply assumes quantities will not get so large, so it uses fewer bits for exponents, and uses the remaining the bits of 128 rather than 64 for double to allow for greatly increased accuracy.
But being a floating point representation, even very simply fractional values are not represented exactly: 0.1, for example, requires a binary repeating fraction and may not be stored as an exact value. (It is not, for a double; Decimal may handle that particular value differently, but this is true in general.)
Therefore comparisons still need to be made using typical floating point math procedures, in which values are compared, added, subtracted, etc., by accepting them only to a certain point. Since there are approximately 23 decimal places of accuracy, select 16 as your standard, for example, and ignore those at the end.
For a good reference, read What Every Computer Scientist Should Know About Floating Point Precision.
The Decimal type is a floating-point type which has more bits of precision than any of the other types that have been built into .NET from the beginning, and whose values are all concisely representable in base-10 format. It is, however, bulky and slow, and because it is a floating-point type it is no more able to satisfy axioms typical of "precise" types (e.g. for any X and Y, (X+Y-Y)==X should either return true or throw an overflow exception). I would guess that it was made a floating-point type rather than fixed-point because of indecision regarding the number of digits that should be to the right of the decimal. In practice, it might would have been faster, and just as useful, to have a 128-bit fixed-point format, but the Decimal type is what it is.
Incidentally, languages like PL/I work well with fixed-point types because they recognize that precision is a function of a storage location rather than a value. Unfortunately, .NET does not provide any nice means via which a variable could be defined as holding a Fixed(6,3) and automatically scale and shift a Fixed(5,2) which is stored into it. Having the precision be part of the value means that storing a value into a variable will change the number of digits that variables represents to the right of the decimal place.

Convert.ToDecimal(double x) - System.OverflowException

What is the maximum double value that can be represented\converted to a decimal?
How can this value be derived - example please.
Update
Given a maximum value for a double that can be converted to a decimal, I would expect to be able to round-trip the double to a decimal, and then back again. However, given a figure such as (2^52)-1 as in #Jirka's answer, this does not work. For example:
Test]
public void round_trip_double_to_decimal()
{
double maxDecimalAsDouble = (Math.Pow(2, 52) - 1);
decimal toDecimal = Convert.ToDecimal(maxDecimalAsDouble);
double toDouble = Convert.ToDouble(toDecimal);
//Fails.
Assert.That(toDouble, Is.EqualTo(maxDecimalAsDouble));
}
All integers between -9,007,199,254,740,992 and 9,007,199,254,740,991 can be exactly represented in a double. (Keep reading, though.)
The upper bound is derived as 2^53 - 1. The internal representation of it is something like (0x1.fffffffffffff * 2^52) if you pardon my hexadecimal syntax.
Outside of this range, many integers can be still exactly represented if they are a multiple of a power of two.
The highest integer whatsoever that can be accurately represented would therefore be 9,007,199,254,740,991 * (2 ^ 1023), which is even higher than Decimal.MaxValue but this is a pretty meaningless fact, given that the value does not bother to change, for example, when you subtract 1 in double arithmetic.
Based on the comments and further research, I am adding info on .NET and Mono implementations of C# that relativizes most conclusions you and I might want to make.
Math.Pow does not seem to guarantee any particular accuracy and it seems to deliver a bit or two fewer than what a double can represent. This is not too surprising with a floating point function. The Intel floating point hardware does not have an instruction for exponentiation and I expect that the computation involves logarithm and multiplication instructions, where intermediate results lose some precision. One would use BigInteger.Pow if integral accuracy was desired.
However, even (decimal)(double)9007199254740991M results in a round trip violation. This time it is, however, a known bug, a direct violation of Section 6.2.1 of the C# spec. Interestingly I see the same bug even in Mono 2.8. (The referenced source shows that this conversion bug can hit even with much lower values.)
Double literals are less rounded, but still a little: 9007199254740991D prints out as 9007199254740990D. This is an artifact of internal multiplication by 10 when parsing the string literal (before the upper and lower bound converge to the same double value based on the "first zero after the decimal point"). This again violates the C# spec, this time Section 9.4.4.3.
Unlike C, C# has no hexadecimal floating point literals, so we cannot avoid that multiplication by 10 by any other syntax, except perhaps by going through Decimal or BigInteger, if these only provided accurate conversion operators. I have not tested BigInteger.
The above could almost make you wonder whether C# does not invent its own unique floating point format with reduced precision. No, Section 11.1.6 references 64bit IEC 60559 representation. So the above are indeed bugs.
So, to conclude, you should be able to fit even 9007199254740991M in a double precisely, but it's quite a challenge to get the value in place!
The moral of the story is that the traditional belief that "Arithmetic should be barely more precise than the data and the desired result" is wrong, as this famous article demonstrates (page 36), albeit in the context of a different programming language.
Don't store integers in floating point variables unless you have to.
MSDN Double data type
Decimal vs double
The value of Decimal.MaxValue is positive 79,228,162,514,264,337,593,543,950,335.

Find min/max of a float/double that has the same internal representation

Refreshing on floating points (also PDF), IEEE-754 and taking part in this discussion on floating point rounding when converting to strings, brought me to tinker: how can I get the maximum and minimum value for a given floating point number whose binary representations are equal.
Disclaimer: for this discussion, I like to stick to 32 bit and 64 bit floating point as described by IEEE-754. I'm not interested in extended floating point (80-bits) or quads (128 bits IEEE-754-2008) or any other standard (IEEE-854).
Background: Computers are bad at representing 0.1 in binary representation. In C#, a float represents this as 3DCCCCCD internally (C# uses round-to-nearest) and a double as 3FB999999999999A. The same bit patterns are used for decimal 0.100000005 (float) and 0.1000000000000000124 (double), but not for 0.1000000000000000144 (double).
For convenience, the following C# code gives these internal representations:
string GetHex(float f)
{
return BitConverter.ToUInt32(BitConverter.GetBytes(f), 0).ToString("X");
}
string GetHex(double d)
{
return BitConverter.ToUInt64(BitConverter.GetBytes(d), 0).ToString("X");
}
// float
Console.WriteLine(GetHex(0.1F));
// double
Console.WriteLine(GetHex(0.1));
In the case of 0.1, there is no lower decimal number that is represented with the same bit pattern, any 0.99...99 will yield a different bit representation (i.e., float for 0.999999937 yields 3F7FFFFF internally).
My question is simple: how can I find the lowest and highest decimal value for a given float (or double) that is internally stored in the same binary representation.
Why: (I know you'll ask) to find the error in rounding in .NET when it converts to a string and when it converts from a string, to find the internal exact value and to understand my own rounding errors better.
My guess is something like: take the mantissa, remove the rest, get its exact value, get one (mantissa-bit) higher, and calculate the mean: anything below that will yield the same bit pattern. My main problem is: how to get the fractional part as integer (bit manipulation it not my strongest asset). Jon Skeet's DoubleConverter class may be helpful.
One way to get at your question is to find the size of an ULP, or Unit in the Last Place, of your floating-point number. Simplifying a little bit, this is the distance between a given floating-point number and the next larger number. Again, simplifying a little bit, given a representable floating-point value x, any decimal string whose value is between (x - 1/2 ulp) and (x + 1/2 ulp) will be rounded to x when converted to a floating-point value.
The trick is that (x +/- 1/2 ulp) is not a representable floating-point number, so actually calculating its value requires that you use a wider floating-point type (if one is available) or an arbitrary width big decimal or similar type to do the computation.
How do you find the size of an ulp? One relatively easy way is roughly what you suggested, written here is C-ish pseudocode because I don't know C#:
float absX = absoluteValue(x);
uint32_t bitPattern = getRepresentationOfFloat(absx);
bitPattern++;
float nextFloatNumber = getFloatFromRepresentation(bitPattern);
float ulpOfX = (nextFloatNumber - absX);
This works because adding one to the bit pattern of x exactly corresponds to adding one ulp to the value of x. No floating-point rounding occurs in the subtraction because the values involved are so close (in particular, there is a theorem of ieee-754 floating-point arithmetic that if two numbers x and y satisfy y/2 <= x <= 2y, then x - y is computed exactly). The only caveats here are:
if x happens to be the largest finite floating point number, this won't work (it will return inf, which is clearly wrong).
if your platform does not correctly support gradual underflow (say an embedded device running in flush-to-zero mode), this won't work for very small values of x.
It sounds like you're not likely to be in either of those situations, so this should work just fine for your purposes.
Now that you know what an ulp of x is, you can find the interval of values that rounds to x. You can compute ulp(x)/2 exactly in floating-point, because floating-point division by 2 is exact (again, barring underflow). Then you need only compute the value of x +/- ulp(x)/2 suitable larger floating-point type (double will work if you're interested in float) or in a Big Decimal type, and you have your interval.
I made a few simplifying assumptions through this explanation. If you need this to really be spelled out exactly, leave a comment and I'll expand on the sections that are a bit fuzzy when I get the chance.
One other note the following statement in your question:
In the case of 0.1, there is no lower
decimal number that is represented
with the same bit pattern
is incorrect. You just happened to be looking at the wrong values (0.999999... instead of 0.099999... -- an easy typo to make).
Python 3.1 just implemented something like this: see the changelog (scroll down a bit), bug report.

Weird outcome when subtracting doubles [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Why is floating point arithmetic in C# imprecise?
I have been dealing with some numbers and C#, and the following line of code results in a different number than one would expect:
double num = (3600.2 - 3600.0);
I expected num to be 0.2, however, it turned out to be 0.1999999999998181. Is there any reason why it is producing a close, but still different decimal?
This is because double is a floating point datatype.
If you want greater accuracy you could switch to using decimal instead.
The literal suffix for decimal is m, so to use decimal arithmetic (and produce a decimal result) you could write your code as
var num = (3600.2m - 3600.0m);
Note that there are disadvantages to using a decimal. It is a 128 bit datatype as opposed to 64 bit which is the size of a double. This makes it more expensive both in terms of memory and processing. It also has a much smaller range than double.
There is a reason.
The reason is, that the way the number is stored in memory, in case of the double data type, doesn't allow for an exact representation of the number 3600.2. It also doesn't allow for an exact representation of the number 0.2.
0.2 has an infinite representation in binary. If You want to store it in memory or processor registers, to perform some calculations, some number close to 0.2 with finite representation is stored instead. It may not be apparent if You run code like this.
double num = (0.2 - 0.0);
This is because in this case, all binary digits available for representing numbers in double data type are used to represent the fractional part of the number (there is only the fractional part) and the precision is higher. If You store the number 3600.2 in an object of type double, some digits are used to represent the integer part - 3600 and there is less digits representing fractional part. The precision is lower and fractional part that is in fact stored in memory differs from 0.2 enough, that it becomes apparent after conversion from double to string
Change your type to decimal:
decimal num = (3600.2m - 3600.0m);
You should also read this.
See Wikipedia
Can't explain it better. I can also suggest reading What Every Computer Scientist Should Know About Floating-Point Arithmetic. Or see related questions on StackOverflow.

Categories