This feels like the kind of code that only fails in-situ, but I will attempt to adapt it into a code snippet that represents what I'm seeing.
float f = myFloat * myConstInt; /* Where myFloat==13.45, and myConstInt==20 */
int i = (int)f;
int i2 = (int)(myFloat * myConstInt);
After stepping through the code, i==269, and i2==268. What's going on here to account for the difference?
Float math can be performed at higher precision than advertised. But as soon as you store it in float f, that extra precision is lost. You're not losing that precision in the second method until, of course, you cast the result down to int.
Edit: See this question Why differs floating-point precision in C# when separated by parantheses and when separated by statements? for a better explanation than I probably provided.
Because floating point variables are not infinitely accurate. Use a decimal if you need that kind of accuracy.
Different rounding modes may also play into this issue, but the accuracy problem is the one you're running into here, AFAIK.
Floating point has limited accuracy, and is based on binary rather than decimal. The decimal number 13.45 cannot be precisely represented in binary floating point, so rounds down. The multiplication by 20 further exaggerates the loss of precision. At this point you have 268.999... - not 269 - therefore the conversion to integer truncates to 268.
To get rounding to the nearest integer, you could try adding 0.5 before converting back to integer.
For "perfect" arithmetic, you could try using a Decimal or Rational numeric type - I believe C# has libraries for both, but am not certain. These will be slower, however.
EDIT - I have found a "decimal" type so far, but not a rational - I may be wrong about that being available. Decimal floating point is inaccurate, just like binary, but it's the kind of inaccuracy we're used to, so it gives less surprising results.
Replace with
double f = myFloat * myConstInt;
and see if you get the same answer.
I'd like to offer a different explanation.
Here's the code, which I've annotated (I looked into memory to dissect the floats):
float myFloat = 13.45; //In binary is 1101.01110011001100110011
int myConstInt = 20;
float f = myFloat * myConstInt; //In binary is exactly 100001101 (269 decimal)
int i = (int)f; // Turns float 269 into int 269 -- no surprises
int i2 = (int)(myFloat * myConstInt);//"Extra precision" causes round to 268
Let's look closer at the calculations:
f = 1101.01110011001100110011 * 10100 = 100001100.111111111111111 111
The part after the space is bits 25-27, which cause bit 24 to be rounded up, and hence the whole value to be rounded up to 269
int i2 = (int)(myFloat * myConstInt)
myfloat is extended to double precision for the calculation (0s are appended): 1101.0111001100110011001100000000000000000000000000000
myfloat * 20 = 100001100.11111111111111111100000000000000000000000000
Bits 54 and beyond are 0s, so no rounding is done: the cast results in the integer 268.
(A similar explanation would work if extended precision is used.)
UPDATE: I refined my answer and wrote a full-blown article called When Floats Don’t Behave Like Floats
Related
Here's a part of code that I dont understand:
byte b1 = (byte)(64 / 0.8f); // b1 is 79
int b2 = (int)(64 / 0.8f); // b2 is 79
float fl = (64 / 0.8f); // fl is 80
Why are the first two calculations off by one? How should I perform this operation, so its fast and correct?
EDIT: I would need the result in byte
EDIT: Not entirely correct, see: Why does a division result differ based on the cast type? (Followup)
Rounding issue: By converting to byte / int, you are clipping of the decimal places.
But 64 / 0.8 should not result in any decimal places? Wrong: Due to the nature of floating point numbers, 0.8f can not be represented exactly like that in memory; it is stored as something close to 0.8f (but not exactly). See Floating point inaccuracy examples or similar threads. Thus, the result of the calculation is not 80.0f, but 79.xxx where xxx is close to 1 but still not exactly one.
You can verify this by typing the following into the Immediate Window in Visual Studio:
(64 / 0.8f)
80.0
(64 / 0.8f) - 80
-0.0000011920929
100 * 0.8f - 80
0.0000011920929
You can solve this by using rounding:
byte b1 = (byte)(64 / 0.8f + 0.5f);
int b2 = (int)(64 / 0.8f + 0.5f);
float fl = (64 / 0.8f);
I'm afraid fast and correct are at odds in cases like this.
Binary floating point arithmetic almost always creates small errors, due to the underlying representation in our CPU architectures. So in your initial expression you actually get a value a little bit smaller than the mathematically correct one. If you expect an integer as the result of a particular mathematic operation and you get something very close to it, you can use the Math.Round(Double, MidpointRounding) method to perform the correct rounding and compensate for small errors (and make sure you pick the MidpointRounding strategy you expect).
Simply casting the result to a type such as byte or int doesn't do rounding - it simply cuts off the fractional part (even 1.99999f will become 1 when you just cast it to these types).
Decimal floating point arithmetic is slower and more memory intensive, but doesn't cause these errors. To perform it, use decimal literals instead of float literals (e.g. 64 / 0.8m).
The rule of thumb is:
If you are dealing with exact quantities (typically man-made, like money), use decimal.
If you are dealing with inexact quantities (like fractional physical constants or irrational numbers like π), use double.
If you are dealing with inexact quantities (as above) and some accuracy can be further sacrificed for speed (like when working with graphics), use float.
To understand the problem, you need to understand the basics of floating point representation and operations.
0.8f can not be exactly represented in memory using a floating point number.
In mathematics, 64/0.8 equals 80.
In floating point arithmetics, 60/0.8 equals approximatively 80.
When you cast a float to an integer or a byte, only the integer part of the number is kept. In your case, the imprecise result of the floating point division is a little bit smaller than 80 hence the conversion to integer yields 79.
If you need an integer result, I would suggest you to round the result instead of casting it.
One way to do it is to use the following function, that convert to an integer by rounding to the closest integer :
Convert.ToInt32(64/0.8f);
This question already has answers here:
Why does this floating-point calculation give different results on different machines?
(5 answers)
Closed 8 years ago.
Please see the following code in C#.
float a = 10.0f;
float b = 0.1f;
float c = a / b;
int indirect = (int)(c);
// Value of indirect is 100 always
int direct = (int)(a / b);
// Value of direct is 99 in 32 bit process (?)
// Value of direct is 100 in 64 bit process
Why do we get 99 in 32-bit processes?
I am using VS2013.
When you operate directly, it's permittable for operations to be performed at a higher precision, and for that higher precision to be continued for multiple operations.
From section 4.1.6 of the C# 5 specification:
Floating-point operations may be performed with higher precision than the result type of the operation. For example, some hardware architectures support an “extended” or “long double” floating-point type with greater range and precision than the double type, and implicitly perform all floating-point operations using this higher precision type. Only at excessive cost in performance can such hardware architectures be made to perform floating-point operations with less precision, and rather than require an implementation to forfeit both performance and precision, C# allows a higher precision type to be used for all floating-point operations. Other than delivering more precise results, this rarely has any measurable effects. However, in expressions of the form x * y / z, where the multiplication produces a result that is outside the double range, but the subsequent division brings the temporary result back into the double range, the fact that the expression is evaluated in a higher range format may cause a finite result to be produced instead of an infinity.
I'd expect that in some optimization scenarios, it would even be possible for the answer to be "wrong" with the extra local variable, if the JIT decides that it never really needs the value as a float. (I've seen cases where just adding logging changes the behaviour here...)
In this case, I believe that the division is effectively being performed using 64-bit arithmetic and then cast from double straight to int rather than going via float first.
Here's some code to demonstrate that, using a DoubleConverter class which allows you to find the exact decimal representation of a floating binary point number:
using System;
class Test
{
static void Main()
{
float a = 10f;
float b = 0.1f;
float c = a / b;
double d = (double) a / (double) b;
float e = (float) d;
Console.WriteLine(DoubleConverter.ToExactString(c));
Console.WriteLine(DoubleConverter.ToExactString(d));
Console.WriteLine(DoubleConverter.ToExactString(e));
Console.WriteLine((int) c);
Console.WriteLine((int) d);
Console.WriteLine((int) e);
}
}
Output:
100
99.999998509883909036943805404007434844970703125
100
100
99
100
Note that the operation may not just be performed in 64-bits - it may be performed at even higher precision, e.g. 80 bits.
This is just one of the joys of floating binary point arithmetic - and an example of why you need to be very careful about what you're doing.
Note that 0.1f is exactly 0.100000001490116119384765625 - so more than 0.1. Given that it's more than 0.1, I would expect 10/b to be a little less than 100 - if that "little less" is representable, then truncating the result is going to naturally lead to 99.
I tried to use BigInteger.Pow method to calculate something like 10^12345.987654321 but this method only accept integer number as exponent like this:
BigInteger.Pow(BigInteger x, int y)
so how can I use double number as exponent in above method?
There's no arbitrary precision large number support in C#, so this cannot be done directly. There are some alternatives (such as looking for a 3rd party library), or you can try something like the code below - if the base is small enough, like in your case.
public class StackOverflow_11179289
{
public static void Test()
{
int #base = 10;
double exp = 12345.123;
int intExp = (int)Math.Floor(exp);
double fracExp = exp - intExp;
BigInteger temp = BigInteger.Pow(#base, intExp);
double temp2 = Math.Pow(#base, fracExp);
int fractionBitsForDouble = 52;
for (int i = 0; i < fractionBitsForDouble; i++)
{
temp = BigInteger.Divide(temp, 2);
temp2 *= 2;
}
BigInteger result = BigInteger.Multiply(temp, (BigInteger)temp2);
Console.WriteLine(result);
}
}
The idea is to use big integer math to compute the power of the integer part of the exponent, then use double (64-bit floating point) math to compute the power of the fraction part. Then, using the fact that
a ^ (int + frac) = a ^ int * a ^ frac
we can combine the two values into a single big integer. But simply converting the double value to a BigInteger would lose a lot of its precision, so we first "shift" the precision onto the bigInteger (using the loop above, and the fact that the double type uses 52 bits for the precision), then multiplying the result.
Notice that the result is an approximation, if you want a more precise number, you'll need a library that does arbitrary precision floating point math.
Update: If the base / exponent are small enough that the power would be in the range of double, we can simply do what Sebastian Piu suggested (new BigInteger(Math.Pow((double)#base, exp)))
I like carlosfigueira's answer, but of course the result of his method can only be correct on the first (most significant) 15-17 digits, because a System.Double is used as a multiplier eventually.
It is interesting to note that there does exist a method BigInteger.Log that performs the "inverse" operation. So if you want to calculate Pow(7, 123456.78) you could, in theory, search all BigInteger numbers x to find one number such that BigInteger.Log(x, 7) is equal to 123456.78 or closer to 123456.78 than any other x of type BigInteger.
Of course the logarithm function is increasing, so your search can use some kind of "binary search" (bisection search). Our answer lies between Pow(7, 123456) and Pow(7, 123457) which can both be calculated exactly.
Skip the rest if you want
Now, how can we predict in advance if there are more than one integer whose logarithm is 123456.78, up to the precision of System.Double, or if there is in fact no integer whose logarithm hits that specific Double (the precise result of an ideal Pow function being an irrational number)? In our example, there will be very many integers giving the same Double 123456.78 because the factor m = Pow(7, epsilon) (where epsilon is the smallest positive number such that 123456.78 + epilon has a representation as a Double different from the representation of 123456.78 itself) is big enough that there will be very many integers between the true answer and the true answer multiplied by m.
Remember from calculus that the derivative of the mathemtical function x → Pow(7, x) is x → Log(7)*Pow(7, x), so the slope of the graph of the exponential function in question will be Log(7)*Pow(7, 123456.78). This number multiplied by the above epsilon is still much much greater than one, so there are many integers satisfying our need.
Actually, I think carlosfigueira's method will give a "correct" answer x in the sense that Log(x, 7) has the same representation as a Double as 123456.78 has. But has anyone tried it? :-)
I'll provide another answer that is hopefully more clear. The point is: Since the precision of System.Double is limited to approx. 15-17 decimal digits, the result of any Pow(BigInteger, Double) calculation will have an even more limited precision. Therefore, there's no hope of doing better than carlosfigueira's answer does.
Let me illustrate this with an example. Suppose we wanted to calculate
Pow(10, exponent)
where in this example I choose for exponent the double-precision number
const double exponent = 100.0 * Math.PI;
This is of course only an example. The value of exponent, in decimal, can be given as one of
314.159265358979
314.15926535897933
314.1592653589793258106510620564222335815429687500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...
The first of these numbers is what you normally see (15 digits). The second version is produced with exponent.ToString("R") and contains 17 digits. Note that the precision of Double is less than 17 digits. The third representation above is the theoretical "exact" value of exponent. Note that this differs, of course, from the mathematical number 100π near the 17th digit.
To figure out what Pow(10, exponent) ought to be, I simply did BigInteger.Log10(x) on a lot of numbers x to see how I could reproduce exponent. So the results presented here simply reflect the .NET Framework's implementation of BigInteger.Log10.
It turns out that any BigInteger x from
0x0C3F859904635FC0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
through
0x0C3F85990481FE7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
makes Log10(x) equal to exponent to the precision of 15 digits. Similarly, any number from
0x0C3F8599047BDEC0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
through
0x0C3F8599047D667FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
satisfies Log10(x) == exponent to the precision of Double. Put in another way, any number from the latter range is equally "correct" as the result of Pow(10, exponent), simply because the precision of exponent is so limited.
(Interlude: The bunches of 0s and Fs reveal that .NET's implementation only considers the most significant bytes of x. They don't care to do better, precisely because the Double type has this limited precision.)
Now, the only reason to introduce third-party software, would be if you insist that exponent is to be interpreted as the third of the decimal numbers given above. (It's really a miracle that the Double type allowed you to specify exactly the number you wanted, huh?) In that case, the result of Pow(10, exponent) would be an irrational (but algebraic) number with a tail of never-repeating decimals. It couldn't fit in an integer without rounding/truncating. PS! If we take the exponent to be the real number 100π, the result, mathematically, would be different: some transcendental number, I suspect.
I have the following code:
float f = 0.3f;
double d1 = System.Convert.ToDouble(f);
double d2 = System.Convert.ToDouble(f.ToString());
The results are equivalent to:
d1 = 0.30000001192092896;
d2 = 0.3;
I'm curious to find out why this is?
Its not a loss of precision .3 is not representable in floating point. When the system converts to the string it rounds; if you print out enough significant digits you will get something that makes more sense.
To see it more clearly
float f = 0.3f;
double d1 = System.Convert.ToDouble(f);
double d2 = System.Convert.ToDouble(f.ToString("G20"));
string s = string.Format("d1 : {0} ; d2 : {1} ", d1, d2);
output
"d1 : 0.300000011920929 ; d2 : 0.300000012 "
You're not losing precision; you're upcasting to a more precise representation (double, 64-bits long) from a less precise representation (float, 32-bits long). What you get in the more precise representation (past a certain point) is just garbage. If you were to cast it back to a float FROM a double, you would have the exact same precision as you did before.
What happens here is that you've got 32 bits allocated for your float. You then upcast to a double, adding another 32 bits for representing your number (for a total of 64). Those new bits are the least significant (the farthest to the right of your decimal point), and have no bearing on the actual value since they were indeterminate before. As a result, those new bits have whatever values they happened to have when you did your upcast. They're just as indeterminate as they were before -- garbage, in other words.
When you downcast from a double to a float, it'll lop off those least-significant bits, leaving you with 0.300000 (7 digits of precision).
The mechanism for converting from a string to a float is different; the compiler needs to analyze the semantic meaning of the character string '0.3f' and figure out how that relates to a floating point value. It can't be done with bit-shifting like the float/double conversion -- thus, the value that you expect.
For more info on how floating point numbers work, you may be interested in checking out this wikipedia article on the IEEE 754-1985 standard (which has some handy pictures and good explanation of the mechanics of things), and this wiki article on the updates to the standard in 2008.
edit:
First, as #phoog pointed out below, upcasting from a float to a double isn't as simple as adding another 32 bits to the space reserved to record the number. In reality, you'll get an addition 3 bits for the exponent (for a total of 11), and an additional 29 bits for the fraction (for a total of 52). Add in the sign bit and you've got your total of 64 bits for the double.
Additionally, suggesting that there are 'garbage bits' in those least significant locations a gross generalization, and probably not be correct for C#. A bit of explanation, and some testing below suggests to me that this is deterministic for C#/.NET, and probably the result of some specific mechanism in the conversion rather than reserving memory for additional precision.
Way back in the beforetimes, when your code would compile into a machine-language binary, compilers (C and C++ compilers, at least) would not add any CPU instructions to 'clear' or initialize the value in memory when you reserved space for a variable. So, unless the programmer explicitly initialized a variable to some value, the values of the bits that were reserved for that location would maintain whatever value they had before you reserved that memory.
In .NET land, your C# or other .NET language compiles into an intermediate language (CIL, Common Intermediate Language), which is then Just-In-Time compiled by the CLR to execute as native code. There may or may not be an variable initialization step added by either the C# compiler or the JIT compiler; I'm not sure.
Here's what I do know:
I tested this by casting the float to three different doubles. Each one of the results had the exact same value.
That value was exactly the same as #rerun's value above: double d1 = System.Convert.ToDouble(f); result: d1 : 0.300000011920929
I get the same result if I cast using double d2 = (double)f; Result: d2 : 0.300000011920929
With three of us getting the same values, it looks like the upcast value is deterministic (and not actually garbage bits), indicating that .NET is doing something the same way across all of our machines. It's still true to say that the additional digits are no more or less precise than they were before, because 0.3f isn't exactly equal to 0.3 -- it's equal to 0.3, up to seven digits of precision. We know nothing about the values of additional digits beyond those first seven.
I use decimal cast for correct result in this case and same other case
float ff = 99.95f;
double dd = (double)(decimal)ff;
Refreshing on floating points (also PDF), IEEE-754 and taking part in this discussion on floating point rounding when converting to strings, brought me to tinker: how can I get the maximum and minimum value for a given floating point number whose binary representations are equal.
Disclaimer: for this discussion, I like to stick to 32 bit and 64 bit floating point as described by IEEE-754. I'm not interested in extended floating point (80-bits) or quads (128 bits IEEE-754-2008) or any other standard (IEEE-854).
Background: Computers are bad at representing 0.1 in binary representation. In C#, a float represents this as 3DCCCCCD internally (C# uses round-to-nearest) and a double as 3FB999999999999A. The same bit patterns are used for decimal 0.100000005 (float) and 0.1000000000000000124 (double), but not for 0.1000000000000000144 (double).
For convenience, the following C# code gives these internal representations:
string GetHex(float f)
{
return BitConverter.ToUInt32(BitConverter.GetBytes(f), 0).ToString("X");
}
string GetHex(double d)
{
return BitConverter.ToUInt64(BitConverter.GetBytes(d), 0).ToString("X");
}
// float
Console.WriteLine(GetHex(0.1F));
// double
Console.WriteLine(GetHex(0.1));
In the case of 0.1, there is no lower decimal number that is represented with the same bit pattern, any 0.99...99 will yield a different bit representation (i.e., float for 0.999999937 yields 3F7FFFFF internally).
My question is simple: how can I find the lowest and highest decimal value for a given float (or double) that is internally stored in the same binary representation.
Why: (I know you'll ask) to find the error in rounding in .NET when it converts to a string and when it converts from a string, to find the internal exact value and to understand my own rounding errors better.
My guess is something like: take the mantissa, remove the rest, get its exact value, get one (mantissa-bit) higher, and calculate the mean: anything below that will yield the same bit pattern. My main problem is: how to get the fractional part as integer (bit manipulation it not my strongest asset). Jon Skeet's DoubleConverter class may be helpful.
One way to get at your question is to find the size of an ULP, or Unit in the Last Place, of your floating-point number. Simplifying a little bit, this is the distance between a given floating-point number and the next larger number. Again, simplifying a little bit, given a representable floating-point value x, any decimal string whose value is between (x - 1/2 ulp) and (x + 1/2 ulp) will be rounded to x when converted to a floating-point value.
The trick is that (x +/- 1/2 ulp) is not a representable floating-point number, so actually calculating its value requires that you use a wider floating-point type (if one is available) or an arbitrary width big decimal or similar type to do the computation.
How do you find the size of an ulp? One relatively easy way is roughly what you suggested, written here is C-ish pseudocode because I don't know C#:
float absX = absoluteValue(x);
uint32_t bitPattern = getRepresentationOfFloat(absx);
bitPattern++;
float nextFloatNumber = getFloatFromRepresentation(bitPattern);
float ulpOfX = (nextFloatNumber - absX);
This works because adding one to the bit pattern of x exactly corresponds to adding one ulp to the value of x. No floating-point rounding occurs in the subtraction because the values involved are so close (in particular, there is a theorem of ieee-754 floating-point arithmetic that if two numbers x and y satisfy y/2 <= x <= 2y, then x - y is computed exactly). The only caveats here are:
if x happens to be the largest finite floating point number, this won't work (it will return inf, which is clearly wrong).
if your platform does not correctly support gradual underflow (say an embedded device running in flush-to-zero mode), this won't work for very small values of x.
It sounds like you're not likely to be in either of those situations, so this should work just fine for your purposes.
Now that you know what an ulp of x is, you can find the interval of values that rounds to x. You can compute ulp(x)/2 exactly in floating-point, because floating-point division by 2 is exact (again, barring underflow). Then you need only compute the value of x +/- ulp(x)/2 suitable larger floating-point type (double will work if you're interested in float) or in a Big Decimal type, and you have your interval.
I made a few simplifying assumptions through this explanation. If you need this to really be spelled out exactly, leave a comment and I'll expand on the sections that are a bit fuzzy when I get the chance.
One other note the following statement in your question:
In the case of 0.1, there is no lower
decimal number that is represented
with the same bit pattern
is incorrect. You just happened to be looking at the wrong values (0.999999... instead of 0.099999... -- an easy typo to make).
Python 3.1 just implemented something like this: see the changelog (scroll down a bit), bug report.