Why does a float variable stop incrementing at 16777216 in C#? - c#

float a = 0;
while (true)
{
a++;
if (a > 16777216)
break; // Will never break... a stops at 16777216
}
Can anyone explain this to me why a float value stops incrementing at 16777216 in this code?
Edit:
Or even more simple:
float a = 16777217; // a becomes 16777216

Short roundup of IEEE-754 floating point numbers (32-bit) off the top of my head:
1 bit sign (0 means positive number, 1 means negative number)
8 bit exponent (with -127 bias, not important here)
23 bits "mantissa"
With exceptions for the exponent values 0 and 255, you can calculate the value as: (sign ? -1 : +1) * 2^exponent * (1.0 + mantissa)
The mantissa bits represent binary digits after the decimal separator, e.g. 1001 0000 0000 0000 0000 000 = 2^-1 + 2^-4 = .5 + .0625 = .5625 and the value in front of the decimal separator is not stored but implicitly assumed as 1 (if exponent is 255, 0 is assumed but that's not important here), so for an exponent of 30, for instance, this mantissa example represents the value 1.5625
Now to your example:
16777216 is exactly 224, and would be represented as 32-bit float like so:
sign = 0 (positive number)
exponent = 24 (stored as 24+127=151=10010111)
mantissa = .0
As 32 bits floating-point representation: 0 10010111 00000000000000000000000
Therefore: Value = (+1) * 2^24 * (1.0 + .0) = 2^24 = 16777216
Now let's look at the number 16777217, or exactly 224+1:
sign and exponent are the same
mantissa would have to be exactly 2-24 so that (+1) * 2^24 * (1.0 + 2^-24) = 2^24 + 1 = 16777217
And here's the problem. The mantissa cannot have the value 2-24 because it only has 23 bits, so the number 16777217 just cannot be represented with the accuracy of 32-bit floating points numbers!

16777217 cannot be represented exactly with a float. The next highest number that a float can represent exactly is 16777218.
So, you try to increment the float value 16777216 to 16777217, which cannot be represented in a float.

When you look at that value in its binary representation, you'll see that it's a one and many zeroes, namely 1 0000 0000 0000 0000 0000 0000, or exactly 2^24. That means, at 16777216, the number has just grown by one digit.
As it's a floating point number, this can mean that the last digit at its end that is still stored (i.e. within its precision) is shifted to the left, as well.
Probably, what you're seeing is that the last digit of precision has just shifted to something bigger than one, so adding one doesn't make any difference any more.

Imagine this in decimal form. Suppose you had the number:
1.000000 * 10^6
or 1,000,000. If all you had were six digits of accuracy, adding 0.5 to this number would yield
1.0000005 * 10^6
However, current thinking with fp rounding modes is to use "Round to Even," rather than "Round to Nearest." In this instance, every time you increment this value, it will round back down in the floating point unit back to 16,777,216, or 2^24. Singles in IEE 754 is represented as:
+/- exponent (1.) fraction
where the "1." is implied and the fraction is another 23 bits, all zeros, in this case. The extra binary 1 will spill into the guard digit, carry down to the rounding step, and be deleted each time, no matter how many times you increment it. The ulp or unit in the last place will always be zero. The last successful increment is from:
+2^23 * (+1.) 11111111111111111111111 -> +2^24 * (1.) 00000000000000000000000

Related

Convert Long to Float Changing the number c#

I have long vairable long x = 231021578;
and when I convert it to float like float y = x;
the value of y will be 231021584
I want to know why this happen. Float is stored in 4 bytes and its range is from ±1.5x10^−45 to ±3.4x10^38 and this value is between this range.
I want to know the concept behind this change I searched alot but I didn't reach anything.
A 4-byte object can encode, at most, 232 different values.
Do not expect a float, with its typically 24 bit precision, to encode exactly every integer value from 0 to ±3.4x1038.
231021578 (0xDC5 1C0A), with its 27 leading significant bits*1, is not one of those float. The closest float is 231021584 (0xDC5 1C10). That has 24 or less significant bits*2.
*1
Hex: 0xDC5 1C0A
Bin: 1101 1100 0101 0001 1100 0000 1010
^-------------------------------^ 27 significant leading bits.
*2
Hex: 0xDC5 1C10
Bin: 1101 1100 0101 0001 1100 0001 0000
^---------------------------^ 24 significant leading bits.
Floats are an approximation and cannot easily represent ALL rational values with only 4 bytes, long is 8 bytes, so you expect to lose some information when you convert the value to a type with less precision, but on top of that float is stored differently, using base-2 notation, where as integral types like long are stored using base-10.
You will get better results with double or decimal
As a general rule I use decimal for discrete values that you need to maintain their exact value to a specific number of decimal places 100% of the time, for instance for monetary values on invoices and transactions. Many other measurements are acceptable to be stored and processed using double.
The key take-away is that double is better for un unspecified number of decimal places, where as decimal is suited for implementations that have a fixed number of decimal places. Both of these concepts can lead to rounding errors at different points in your logic, decimal forces you to deliberately deal with rounding up front, double allows you to defer management of rounding until you need to display the value.
long x = 231021578;
float y = x;
double z = x;
decimal m = x;
Console.WriteLine("long: {0}", x);
Console.WriteLine("float: {0}", y);
Console.WriteLine("double: {0}", z);
Console.WriteLine("decimal: {0}", m);
Results:
long: 231021578
float: 2.310216E+08
double: 231021578
decimal: 231021578
DotNetPerls - C# float Numbers
Greg Low - SQL: Newbie Mistake #1: Using float instead of decimal
C# Decimal
Floating-point numeric types (C# reference)
Its out of scope for this post, but there was a healthy discussion related to this 10 years ago: Why is the data stored in a Float datatype considered to be an approximate value?

What's the meaning of precision digits in float / double / decimal? [duplicate]

This question already has answers here:
How to calculate float type precision and does it make sense?
(4 answers)
Closed 2 years ago.
On the C# reference for floating-point numeric types one can read that
float has a precision of 6 to 9 digits
double has a precision of 15 to 17 digits
decimal has a precision of 28 to 29 digits
What does precision mean in this context and especially, how can the precision be a range? As the number of bits for the exponent and mantissa are fixed, how can the precision be variable (in the described range)? Can someone please give an example for e.g. a float with a precision of 6 and one with a precision of 9?
float and double
(I'll explain float, that is IEEE-754 Single-precision floating-point format, but double, that is IEEE-754 Double-precision floating-point format is the same but with bigger numbers.
In general you can imagine a float to be:
mantissa₂ * (2 ^ exponent₂)
where mantissa₂ means mantissa in base two, and exponent₂ means exponent in base two
The mantissa₂ is 23 bits, the exponent₂ 8 bits. There is an extra bit for the sign, and the exponent₂ has a special format with special range that we will see much below
There is another trick: floating points are normally saved in "normalized" form:
1₂ mantissa₂ * (2 ^ exponent₂)
so the first digit is always 1₂, and so there is a 1₂ plus 23 binary digits for the mantissa₂, so a total of 24 digits for the complete mantissa₂.
Now, with 24 bits you can have numbers between 0 and 16,777,216, that is 7 full digits plus the 8th that is "partial" (you can't have 26,777,216 for example). In fact log₁₀ 2^24 = 7.22471989594
The exponent "moves" a floating decimal point, so that you can have, for example
1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂ . 1₂ (there are a total of 24 binary digits 1, I hope... I counted them)
or
1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂ . 1₂1₂
or
1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂0₂
or
1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂1₂0₂0₂
and so on.
The exponent₂ has three ranges: [-1;-127], [1;127], 0 for denormalized numbers and 255 for NaN and Infinite (where 255 means that all the bits of the exponent are at 1)
In the range [-1;-127] the decimal point is moved to the left, for a number of steps equal to the range, in the range [1;127]` the decimal point is moved to the right in the same way.
If the exponent is 0, the number is "denormalized". They are ugly floating point numbers that have special handling and are slower for this reason. When the number is "denormalized" then there is no implicit 1₂ at the beginning of the number, so you only have 23 bits of mantissa, that is 6 dot something digits of precision (log₁₀ 2^23 = 6.9236899)
Can't explain how the 9 digits of precision come out.
decimal
With decimal it is easy: the format is:
mantissa₂ / (10 ^ exponent₂)
where mantissa₂ is 96 bits, exponent₂ is 5 bits (a little less, the range is [0;28]), plus there is a sign bit, and many unused bits. The exact format is written in the reference source. In decimals there is no implicit initial 1₂, so it is pure 96 bits, and log₁₀ 2^96 = 28.8988795837, so 28 or 29 digits.
What's the meaning of precision digits in float / double / decimal?
The decimal digits needed to round trip between text and the type.
text-FP-text: When a number is decimal text and then converted to floating point type and then converted back to text with the same number of digits and get the same value, over the entire exponent range of the FP type, the max number of significant decimal digits in the text version is the lower number like 6 for float. As long as the text version has only 6 digits display, float can encode a value close enough.
FP-text-FP: When a FP number is converted to decimal text and then converted back to FP and get the same FP value, the number of significant decimal digits needed for the text version is the higher number like 9 for float. As long as a text version reports 9+ significant digits, the original FP value can be recovered exactly.
float has 24 bits of binary precession. To translate that into decimal, the above context is important. The minimum non-zero double takes about 330+ decimal digits to printout exactly, yet that is rarely thought of as the precession of that number.
Can someone please give an example for e.g. a float with a precision of 6 and one with a precision of 9?
6 decimal digits always works .... "9999979e3" and "9999978e3" both convert to 9.999978e+09, so 9 significant text digits needed to round-trip.

Explicit conversion from Single to Decimal results in different bit representation

If I convert single s into decimal d I've noticed it's bit representation differs from that of the decimal created directly.
For example:
Single s = 0.01f;
Decimal d = 0.01m;
int[] bitsSingle = Decimal.GetBits((decimal)s)
int[] bitsDecimal = Decimal.GetBits(d)
Returns (middle elements removed for brevity):
bitsSingle:
[0] = 10
[3] = 196608
bitsDecimal:
[0] = 1
[3] = 131072
Both of these are decimal numbers, which both (appear) to be accurately representing 0.01:
Looking at the spec sheds no light except perhaps:
§4.1.7 Contrary to the float and double data types, decimal fractional
numbers such as 0.1 can be represented exactly in the decimal
representation.
Suggesting that this is somehow affected by single not being able accurately represent 0.01 before the conversion, therefore:
Why is this not accurate by the time the conversion is done?
Why do we seem to have two ways to represent 0.01 in the same datatype?
TL;DR
Both decimals precisely represent 0.1. It's just that the decimal format, allows multiple bitwise-different values that represent the exact same number.
Explanation
It isn't about single not being able to represent 0.1 precisely. As per the documentation of GetBits:
The binary representation of a Decimal number consists of a 1-bit
sign, a 96-bit integer number, and a scaling factor used to divide the
integer number and specify what portion of it is a decimal fraction.
The scaling factor is implicitly the number 10, raised to an exponent
ranging from 0 to 28.
The return value is a four-element array of 32-bit signed integers.
The first, second, and third elements of the returned array contain
the low, middle, and high 32 bits of the 96-bit integer number.
The fourth element of the returned array contains the scale factor and
sign. It consists of the following parts:
Bits 0 to 15, the lower word, are unused and must be zero.
Bits 16 to 23 must contain an exponent between 0 and 28, which
indicates the power of 10 to divide the integer number.
Bits 24 to 30 are unused and must be zero.
Bit 31 contains the sign: 0 mean positive, and 1 means negative.
Note that the bit representation differentiates between negative and
positive zero. These values are treated as being equal in all
operations.
The fourth integer of each decimal in your example is 0x00030000 for bitsSingle and 0x00020000 for bitsDecimal. In binary this maps to:
bitsSingle 00000000 00000011 00000000 00000000
|\-----/ \------/ \---------------/
| | | |
sign <-+ unused exponent unused
| | | |
|/-----\ /------\ /---------------\
bitsDecimal 00000000 00000010 00000000 00000000
NOTE: exponent represents multiplication by negative power of 10
Therefore, in the first case the 96-bit integer is divided by an additional factor of 10 compared to the second -- bits 16 to 23 give the value 3 instead of 2. But that is offset by the 96-bit integer itself, which in the first case is also 10 times greater than in the second (obvious from the values of the first elements).
The difference in observed values can therefore be attributed simply to the fact that the conversion from single uses subtly different logic to derive the internal representation compared to the "straight" constructor.

Why do float and int have such different maximum values even though they're the same number of bits? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
what the difference between the float and integer data type when the size is same in java?
As you probably know, both of these types are 32-bits.int can hold only integer numbers, whereas float also supports floating point numbers (as the type names suggest).
How is it possible then that the max value of int is 231, and the max value of float is 3.4*1038, while both of them are 32 bits?
I think that int's max value capacity should be higher than the float because it doesn't save memory for the floating number and accepts only integer numbers. I'll be glad for an explanation in that case.
Your intuition quite rightly tells you that there can be no more information content in one than the other, because they both have 32 bits. But that doesn't mean we can't use those bits to represent different values.
Suppose I invent two new datatypes, uint4 and foo4. uint4 uses 4 bits to represent an integer, in the standard binary representation, so we have
bits value
0000 0
0001 1
0010 2
...
1111 15
But foo4 uses 4 bits to represent these values:
bits value
0000 0
0001 42
0010 -97
0011 1
...
1110 pi
1111 e
Now foo4 has a much wider range of values than uint4, despite having the same number of bits! How? Because there are some uint4 values that can't be represented by foo4, so those 'slots' in the bit mapping are available for other values.
It is the same for int and float - they can both store values from a set of 232 values, just different sets of 232 values.
A float might store a higher number value, but it will not be precise even on digits before the decimal dot.
Consider the following example:
float a = 123456789012345678901234567890f; //30 digits
Console.WriteLine(a); // 1.234568E+29
Notice that barely any precision is kept.
An integer on the other hand will always precisely store any number within its range of values.
For the sake of comparison, let's look at a double precision floating point number:
double a = 123456789012345678901234567890d; //30 digits
Console.WriteLine(a); // 1.23456789012346E+29
Notice that roughly twice as much significant digits are preserved.
These are based on IEEE754 floating point specification, that is why it is possible. Please read this documentation. It is not just about how many bits.
The hint is in the "floating" part of "floating point". What you say basically assumes fixed point. A floating point number does not "reserve space" for the digits after the decimal point - it has a limited number of digits (23 binary) and remembers what power of two to multiply it by.

Do floating points have more precision if calculated at with a range of high values rather than low values?

Would a higher range of floats be more accurate to multiply / divide / add / subtract, than a lower range.
For example, would 567.56 / 345.54 be more accurate than .00097854 / .00021297 ?
The answer to your question is "no." Floating point numbers are (usually*) represented with a normalized mantissa and an exponent. Multiplication and division operate first on the normalized mantissa, then on the exponents.
Addition and subtraction are, of course, another story. Operations like your examples:
567.56 + 345.54 or .00097854 - .00021297
work fine. But operations with disparate orders of magnitude like
567.56 + .00097854 or 345.54 - .00021297
may lose some low-order precision.
The IEEE Floating point standards includes denormalized numbers. If you are an astrophysicist or runtime-library developer, you may need to understand them. See http://en.wikipedia.org/wiki/Denormal_number
For IEEE 754 binary floating-point numbers (the most common), floating-point values have the same number of bits in the significand throughout most of the exponent range. However, there is a portion of the range where the significand has effectively fewer bits. And the relative error caused by rounding does vary depending on where the significand lies within its range.
IEEE 754 floating-point numbers are represented by a sign (+1 or -1, encoded as 0 or 1), an exponent (for double-precision, -1022 to 1023, encoded as the exponent plus 1023, so 1 to 2046), and a significand (for double-precision, a fraction usually from 1 to just under 2, represented with 53 bits but encoded with 52 bits because the first bit is implicitly 1).
E.g., the number 6.5 is encoded with the bits 0 (sign +1), 10000000001 (exponent 2), and 1010000000000000000000000000000000000000000000000000 (binary fraction 1.1010, hex 1.a, decimal 1.3125). We can write this in hexadecimal floating-point as 0x1.ap2 (hex fraction 1.a multiplied by 2 to the power of decimal 2). Writing in hexadecimal floating-point enables humans to see the floating-point representation fairly easily.
For the exponent, the encoding values of 0 and 2047 are special. When the encoding is 0, the exponent is the same as when the encoding is 1 (-1022), but the implicit bit of the fraction is 0 instead of 1. When the encoding is 2047, the floating-point object represents infinity (if the significand bits are all zero) or a NaN (otherwise).
When the encoded exponent is 0 and the significand bits are all zero, the number represents zero (with +0 and -0 distinguished by the sign). If the significand bits are not all zero, the number is said to be denormalized. This is because most numbers are “normalized” by adjusting the exponent so that the fraction is between 1 (inclusive) and 2 (exclusive). For denormalized numbers, the fraction is less than 1; it starts with “0.” instead of “1.”.
When the result of a floating-point operation is a denormalized number, it effectively has fewer bits in the significand. Thus, as numbers drop below 0x1p-1022 (2-1022), the effective precision decreases.
When numbers are in the normal range (not underflowing to denormals and not overflowing to infinity), then there are no differences in the significands of numbers with different exponents, so:
(2a+2b)/2 has exactly the same result as a+b.
(2a-2b)/2 has exactly the same result as a-b.
(2ab)/2 has exactly the same result as ab.
Note, however, that the relative error can change. When a floating-point operation is performed, the exact mathematical result must be rounded to a representable value. This rounding can happen only in units representable by the significand. For a given exponent, the bits in the significand have a fixed value. So the last bit in the significand represents a certain value. That value is a greater portion of a significand near 1 than it is of a significand near 2.
For a double-precision result, the unit of least precision (ULP) is 1 part in 252 of the value of the greatest bit in the significand. When using round-to-nearest mode (the most common default), the greatest error is at most half of that, because, if the representable number in one direction is more than half an ULP away, the number in the other direction is less than half an ULP away. And the closer number is returned by a proper floating-point operation.
Thus, the maximum relative error in a result with a significand near 1 is slightly over 2-53, but the maximum relative error in a result with a significand near 2 is slightly under 2-54.
For the sake of completeness, I have to disagree a bit and say Yes, it may matter somehow...
Indeed, if you perform 56756.0 / 34554.0, then you'll get the nearest representable Float to the exact mathematical result, with a single floating point rounding "error".
This is because 56756.0 and 34554.0 are representable exactly in floating point (single or double precision IEEE 754), and because according to IEEE 754 standard, operations perform an exact rounding operation (in default mode to the nearest).
If you write 567.56 / 345.54, then both numbers are not represented exactly in floating point in radix 2, so the result of this operation is cumulating 3 floating point rounding "errors".
Let's compare the result in Squeak Smalltalk in double precision (Float), converted to exact arithmetic (Fraction with arbitrary integer length at numerator and denominator):
((56756.0 / 34554.0) asFraction - (56756 / 34554)) asFloat.
-> -7.932275867322412e-17
So far, so good, the magnitude of error is less than or equal to half an ulp, as promised by IEEE 754:
(56756 / 34554) asFloat ulp / 2
-> 1.1102230246251565e-16
With cumulated rounding errors, you may get a larger error (but never a smaller):
((567.56 / 345.54) asFraction - (56756 / 34554)) asFloat
-> -3.0136736359825544e-16
((0.00056756 / 0.00034554) asFraction - (56756 / 34554)) asFloat
-> 3.647664511768385e-16
Above example is hard to generalize, and I perfectly agree with other answers: generally, NO, you should only care of relative precision.
... Unless maybe if you want to implement some function with very strict tolerance about round off errors...
No. In the sense that there's the same number of significant digits available no matter what the order of magnitude (exponent part) of your number is.

Categories