After a bit of light reading, this article piqued my interest:
I'd have thought that yes, the two statements are equivalent, given MSDN's statement:
Represents the smallest positive Double value that is greater than zero. This field is constant.
Curious to see what people think.
EDIT: Found a computer with VS on and ran this Test. Turns out that yes, as expected, they're equivalent.
[Test]
public void EpsilonTest()
{
Compare(0d);
Compare(double.Epsilon);
Compare(double.Epsilon * 0.5);
Compare(double.NaN);
Compare(double.PositiveInfinity);
Compare(double.NegativeInfinity);
Compare(double.MaxValue);
Compare(double.MinValue);
}
public void Compare(double x)
{
Assert.AreEqual(Math.Abs(x) == 0d, Math.Abs(x) < double.Epsilon);
}
IL code seems to cast some light on this.
Epsilon is simply a double number with the fraction part being 1, sign 0, exponent 0.
Zero is a double number with the fraction part being 0, sign 0, exponent 0.
According to http://en.wikipedia.org/wiki/IEEE_754-1985, floating point numbers with the same sign and exponent are compared ordinally, which means that (x < 1) is the same as (x == 0).
Now, is it possible to get a zero that isn't fraction = 0, exponent = 0 (we don't care about sign, there's a Math.Abs in place)?
Yes, as far as I can tell they should be equivalent. This is because no difference can have a magnitude less than epsilon and also be nonzero.
My only thought was concerning values such as double.NaN, I tested that and PositiveInfinity, etc. and the results were the same. By the way, comparing double.NaN to a number returns false.
I'm not sure what you mean by "equivalent" here, as that's a pretty vague term.
If you mean, will .NET consider any value less than double.Epsilon to be equal to 0d, then yes, as the article you linked to clearly demonstrates. You can show this pretty easily:
var d1 = 0d;
var d2 = double.Epsilon * 0.5;
Console.WriteLine("{0:r} = {1:r}: {2}", d1, d2, d1.Equals(d2));
// Prints: 0 = 0: True
In that sense, if you somehow produce a value of x that is less than double.Epislon, it will already be stored in-memory as a zero value, so Abs(x) will just be Abs(0) which is, == 0d.
But this is a limitation of the binary representation as used by .NET to hold floating point numbers: it simply can't represent a non-zero number smaller than double.Epsilon so it rounds.
That doesn't mean the two statements are "equivalent", because that's entirely context-dependent. Clearly, 4.94065645841247E-324 * 0.5 is not zero, it is 2.470328229206235e-324. If you are doing calculations that require that level of precision, than no, they are not equivalent -- and you're also out of luck trying to do them in C#.
In most cases, the value of double.Epsilon is entirely too small to be of any value, meaning that Abs(x) should == 0d for values much larger than double.Epison, but C# relies on you to figure that out; it will happily do the calculations down to that precision, if asked.
Unfortunately, the statement "Math.Abs(x) < double.Epsilon is equivalent to Math.Abs(x) == 0d" is not true at all for ARM systems.
MSDN on Double.Epsilon contradicts itself by stating that
On ARM systems, the value of the Epsilon constant is too small to be detected, so it equates to zero.
That means that on ARM systems, there are no non-negative double values less than Double.Epsilon, so the expression Math.Abs(x) < double.Epsilon is just another way to say false.
Related
Assume I have an array of bytes which are truly random (e.g. captured from an entropy source).
byte[] myTrulyRandomBytes = MyEntropyHardwareEngine.GetBytes(8);
Now, I want to get a random double precision floating point value, but between the values of 0 and positive 1 (like the Random.NextDouble() function performs).
Simply passing an array of 8 random bytes into BitConverter.ToDouble() can yield strange results, but most importantly, the results will almost never be less than 1.
I am fine with bit-manipulation, but the formatting of floating point numbers has always been mysterious to me. I tried many combinations of bits to apply randomness to and always ended up finding the numbers were either just over 1, always VERY close to 0, or very large.
Can someone explain which bits should be made random in a double in order to make it random within the range 0 and 1?
Though working answers have been given, I'll give an other one, that looks worse but isn't:
long asLong = BitConverter.ToInt64(myTrulyRandomBytes, 0);
double number = (double)(asLong & long.MaxValue) / long.MaxValue;
The issue with casting from an ulong to double is that it's not directly supported by hardware, so it compiles to this:
vxorps xmm0,xmm0,xmm0
vcvtsi2sd xmm0,xmm0,rcx ; interpret ulong as long and convert it to double
test rcx,rcx ; add fixup if it was "negative"
jge 000000000000001D
vaddsd xmm0,xmm0,mmword ptr [00000060h]
vdivsd xmm0,xmm0,mmword ptr [00000068h]
Whereas with my suggestion it will compile more nicely:
vxorps xmm0,xmm0,xmm0
vcvtsi2sd xmm0,xmm0,rcx
vdivsd xmm0,xmm0,mmword ptr [00000060h]
Both tested with the x64 JIT in .NET 4, but this applies in general, there just isn't a nice way to convert an ulong to a double.
Don't worry about the bit of entropy being lost: there are only 262 doubles between 0.0 and 1.0 in the first place, and most of the smaller doubles cannot be chosen so the number of possible results is even less.
Note that this as well as the presented ulong examples can result in exactly 1.0 and distribute the values with slightly differing gaps between adjacent results because they don't divide by a power of two. You can change them exclude 1.0 and get a slightly more uniform spacing (but see the first plot below, there is a bunch of different gaps, but this way it is very regular) like this:
long asLong = BitConverter.ToInt64(myTrulyRandomBytes, 0);
double number = (double)(asLong & long.MaxValue) / ((double)long.MaxValue + 1);
As a really nice bonus, you can now change the division to a multiplication (powers of two usually have inverses)
long asLong = BitConverter.ToInt64(myTrulyRandomBytes, 0);
double number = (double)(asLong & long.MaxValue) * 1.08420217248550443400745280086994171142578125E-19;
Same idea for ulong, if you really want to use that.
Since you also seemed interested specifically in how to do it with double-bits trickery, I can show that too.
Because of the whole significand/exponent deal, it can't really be done in a super direct way (just reinterpreting the bits and that's it), mainly because choosing the exponent uniformly spells trouble (with a uniform exponent, the numbers are necessarily clumped preferentially near 0 since most exponents are there).
But if the exponent is fixed, it's easy to make a double that's uniform in that region. That cannot be 0 to 1 because that spans a lot of exponents, but it can be 1 to 2 and then we can subtract 1.
So first mask away the bits that won't be part of the significand:
x &= (1L << 52) - 1;
Put in the exponent (1.0 - 2.0 range, excluding 2)
x |= 0x3ff0000000000000;
Reinterpret and adjust for the offset of 1:
return BitConverter.Int64BitsToDouble(x) - 1;
Should be pretty fast, too. An unfortunate side effect is that this time it really does cost a bit of entropy, because there are only 52 but there could have been 53. This way always leaves the least significant bit zero (the implicit bit steals a bit).
There were some concerns about the distributions, which I will address now.
The approach of choosing a random (u)long and dividing it by the maximum value clearly has a uniformly chosen (u)long, and what happens after that is actually interesting. The result can justifiably be called a uniform distribution, but if you look at it as a discrete distribution (which it actually is) it looks (qualitatively) like this: (all examples for minifloats)
Ignore the "thicker" lines and wider gaps, that's just the histogram being funny. These plots used division by a power of two, so there is no spacing problem in reality, it's only plotted strangely.
Top is what happens when you use too many bits, as happens when dividing a complete (u)long by its max value. This gives the lower floats a better resolution, but lots of different (u)longs get mapped onto the same float in the higher regions. That's not necessarily a bad thing, if you "zoom out" the density is the same everywhere.
The bottom is what happens when the resolution is limited to the worst case (0.5 to 1.0 region) everywhere, which you can do by limiting the number of bits first and then doing the "scale the integer" deal. My second suggesting with the bit hacks does not achieve this, it's limited to half that resolution.
For what it's worth, NextDouble in System.Random scales a non-negative int into the 0.0 .. 1.0 range. The resolution of that is obviously a lot lower than it could be. It also uses an int that cannot be int.MaxValue and therefore scales by approximately 1/(231-1) (cannot be represented by a double, so slightly rounded), so there are actually 33 slightly different gaps between adjacent possible results, though the majority of the gaps is the same distance.
Since int.MaxValue is small compared to what can be brute-forced these days, you can easily generate all possible results of NextDouble and examine them, for example I ran this:
const double scale = 4.6566128752458E-10;
double prev = 0;
Dictionary<long, int> hist = new Dictionary<long, int>();
for (int i = 0; i < int.MaxValue; i++)
{
long bits = BitConverter.DoubleToInt64Bits(i * scale - prev);
if (!hist.ContainsKey(bits))
hist[bits] = 1;
else
hist[bits]++;
prev = i * scale;
if ((i & 0xFFFFFF) == 0)
Console.WriteLine("{0:0.00}%", 100.0 * i / int.MaxValue);
}
This is easier than you think; its all about scaling (also true when going from a 0-1 range to some other range).
Basically, if you know that you have 64 truly random bits (8 bytes) then just do this:
double zeroToOneDouble = (double)(BitConverter.ToUInt64(bytes) / (decimal)ulong.MaxValue);
The trouble with this kind of algorithm comes when your "random" bits aren't actually uniformally random. That's when you need a specialized algorithm, such as a Mersenne Twister.
I don't know wether it's the best solution for this, but it should do the job:
ulong asLong = BitConverter.ToUInt64(myTrulyRandomBytes, 0);
double number = (double)asLong / ulong.MaxValue;
All I'm doing is converting the byte array to a ulong which is then divided by it's max value, so that the result is between 0 and 1.
To make sure the long value is within the range from 0 to 1, you can apply the following mask:
long longValue = BitConverter.ToInt64(myTrulyRandomBytes, 0);
longValue &= 0x3fefffffffffffff;
The resulting value is guaranteed to lay in the range [0, 1).
Remark. The 0x3fefffffffffffff value is very-very close to 1 and will be printed as 1, but it is really a bit less than 1.
If you want to make the generated values greater, you could set a number higher bits of an exponent to 1. For instance:
longValue |= 0x03c00000000000000;
Summarizing: example on dotnetfiddle.
If you care about the quality of the random numbers generated, be very suspicious of the answers that have appeared so far.
Those answers that use Int64BitsToDouble directly will definitely have problems with NaNs and infinities. For example, 0x7ff0000000000001, a perfectly good random bit pattern, converts to NaN (and so do thousands of others).
Those that try to convert to a ulong and then scale, or convert to a double after ensuring that various bit-pattern constraints are met, won't have NaN problems, but they are very likely to have distributional problems. Representable floating point numbers are not distributed uniformly over (0, 1), so any scheme that randomly picks among all representable values will not produce values with the required uniformity.
To be safe, just use ToInt32 and use that int as a seed for Random. (To be extra safe, reject 0.) This won't be as fast as the other schemes, but it will be much safer. A lot of research and effort has gone into making RNGs good in ways that are not immediately obvious.
Simple piece of code to print the bits out for you.
for (double i = 0; i < 1.0; i+=0.05)
{
var doubleToInt64Bits = BitConverter.DoubleToInt64Bits(i);
Console.WriteLine("{0}:\t{1}", i, Convert.ToString(doubleToInt64Bits, 2));
}
0.05: 11111110101001100110011001100110011001100110011001100110011010
0.1: 11111110111001100110011001100110011001100110011001100110011010
0.15: 11111111000011001100110011001100110011001100110011001100110100
0.2: 11111111001001100110011001100110011001100110011001100110011010
0.25: 11111111010000000000000000000000000000000000000000000000000000
0.3: 11111111010011001100110011001100110011001100110011001100110011
0.35: 11111111010110011001100110011001100110011001100110011001100110
0.4: 11111111011001100110011001100110011001100110011001100110011001
0.45: 11111111011100110011001100110011001100110011001100110011001100
0.5: 11111111011111111111111111111111111111111111111111111111111111
0.55: 11111111100001100110011001100110011001100110011001100110011001
0.6: 11111111100011001100110011001100110011001100110011001100110011
0.65: 11111111100100110011001100110011001100110011001100110011001101
0.7: 11111111100110011001100110011001100110011001100110011001100111
0.75: 11111111101000000000000000000000000000000000000000000000000001
0.8: 11111111101001100110011001100110011001100110011001100110011011
0.85: 11111111101011001100110011001100110011001100110011001100110101
0.9: 11111111101100110011001100110011001100110011001100110011001111
0.95: 11111111101110011001100110011001100110011001100110011001101001
I tried to use BigInteger.Pow method to calculate something like 10^12345.987654321 but this method only accept integer number as exponent like this:
BigInteger.Pow(BigInteger x, int y)
so how can I use double number as exponent in above method?
There's no arbitrary precision large number support in C#, so this cannot be done directly. There are some alternatives (such as looking for a 3rd party library), or you can try something like the code below - if the base is small enough, like in your case.
public class StackOverflow_11179289
{
public static void Test()
{
int #base = 10;
double exp = 12345.123;
int intExp = (int)Math.Floor(exp);
double fracExp = exp - intExp;
BigInteger temp = BigInteger.Pow(#base, intExp);
double temp2 = Math.Pow(#base, fracExp);
int fractionBitsForDouble = 52;
for (int i = 0; i < fractionBitsForDouble; i++)
{
temp = BigInteger.Divide(temp, 2);
temp2 *= 2;
}
BigInteger result = BigInteger.Multiply(temp, (BigInteger)temp2);
Console.WriteLine(result);
}
}
The idea is to use big integer math to compute the power of the integer part of the exponent, then use double (64-bit floating point) math to compute the power of the fraction part. Then, using the fact that
a ^ (int + frac) = a ^ int * a ^ frac
we can combine the two values into a single big integer. But simply converting the double value to a BigInteger would lose a lot of its precision, so we first "shift" the precision onto the bigInteger (using the loop above, and the fact that the double type uses 52 bits for the precision), then multiplying the result.
Notice that the result is an approximation, if you want a more precise number, you'll need a library that does arbitrary precision floating point math.
Update: If the base / exponent are small enough that the power would be in the range of double, we can simply do what Sebastian Piu suggested (new BigInteger(Math.Pow((double)#base, exp)))
I like carlosfigueira's answer, but of course the result of his method can only be correct on the first (most significant) 15-17 digits, because a System.Double is used as a multiplier eventually.
It is interesting to note that there does exist a method BigInteger.Log that performs the "inverse" operation. So if you want to calculate Pow(7, 123456.78) you could, in theory, search all BigInteger numbers x to find one number such that BigInteger.Log(x, 7) is equal to 123456.78 or closer to 123456.78 than any other x of type BigInteger.
Of course the logarithm function is increasing, so your search can use some kind of "binary search" (bisection search). Our answer lies between Pow(7, 123456) and Pow(7, 123457) which can both be calculated exactly.
Skip the rest if you want
Now, how can we predict in advance if there are more than one integer whose logarithm is 123456.78, up to the precision of System.Double, or if there is in fact no integer whose logarithm hits that specific Double (the precise result of an ideal Pow function being an irrational number)? In our example, there will be very many integers giving the same Double 123456.78 because the factor m = Pow(7, epsilon) (where epsilon is the smallest positive number such that 123456.78 + epilon has a representation as a Double different from the representation of 123456.78 itself) is big enough that there will be very many integers between the true answer and the true answer multiplied by m.
Remember from calculus that the derivative of the mathemtical function x → Pow(7, x) is x → Log(7)*Pow(7, x), so the slope of the graph of the exponential function in question will be Log(7)*Pow(7, 123456.78). This number multiplied by the above epsilon is still much much greater than one, so there are many integers satisfying our need.
Actually, I think carlosfigueira's method will give a "correct" answer x in the sense that Log(x, 7) has the same representation as a Double as 123456.78 has. But has anyone tried it? :-)
I'll provide another answer that is hopefully more clear. The point is: Since the precision of System.Double is limited to approx. 15-17 decimal digits, the result of any Pow(BigInteger, Double) calculation will have an even more limited precision. Therefore, there's no hope of doing better than carlosfigueira's answer does.
Let me illustrate this with an example. Suppose we wanted to calculate
Pow(10, exponent)
where in this example I choose for exponent the double-precision number
const double exponent = 100.0 * Math.PI;
This is of course only an example. The value of exponent, in decimal, can be given as one of
314.159265358979
314.15926535897933
314.1592653589793258106510620564222335815429687500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...
The first of these numbers is what you normally see (15 digits). The second version is produced with exponent.ToString("R") and contains 17 digits. Note that the precision of Double is less than 17 digits. The third representation above is the theoretical "exact" value of exponent. Note that this differs, of course, from the mathematical number 100π near the 17th digit.
To figure out what Pow(10, exponent) ought to be, I simply did BigInteger.Log10(x) on a lot of numbers x to see how I could reproduce exponent. So the results presented here simply reflect the .NET Framework's implementation of BigInteger.Log10.
It turns out that any BigInteger x from
0x0C3F859904635FC0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
through
0x0C3F85990481FE7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
makes Log10(x) equal to exponent to the precision of 15 digits. Similarly, any number from
0x0C3F8599047BDEC0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
through
0x0C3F8599047D667FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
satisfies Log10(x) == exponent to the precision of Double. Put in another way, any number from the latter range is equally "correct" as the result of Pow(10, exponent), simply because the precision of exponent is so limited.
(Interlude: The bunches of 0s and Fs reveal that .NET's implementation only considers the most significant bytes of x. They don't care to do better, precisely because the Double type has this limited precision.)
Now, the only reason to introduce third-party software, would be if you insist that exponent is to be interpreted as the third of the decimal numbers given above. (It's really a miracle that the Double type allowed you to specify exactly the number you wanted, huh?) In that case, the result of Pow(10, exponent) would be an irrational (but algebraic) number with a tail of never-repeating decimals. It couldn't fit in an integer without rounding/truncating. PS! If we take the exponent to be the real number 100π, the result, mathematically, would be different: some transcendental number, I suspect.
I have a double value that equals 1.212E+25
When I throw it out to text I do myVar.ToString("0000000000000000000000")
The problem is even if I do myVar++ 3 or 4 times the value seems to stay the same.
Why is that?
That is because the precision of a double is not sufficient. It simply can't hold that many significant digits.
It will not fit into a long, but probably into a Decimal.
But... do you really need this level of precision?
To expand on the other answer the smallest increase you can make to a double is one Unit in the Last Place, or ULP, as double is a floating point type then the size of an ULP changes, at 1E+25 it will be about 1E+10.
as you can see compared to 1E+10 incrementing by 1 really might as well be adding nothing. which is exactly what double will do, so it wouldnt matter if you tried it 10^25 times it still won't increase unless you try to increase by at least 1 ULP
if incrementing by an ULP is useful you can do this by casting the bits to long and back here is a quick extension method to do that
public static double UlpChange(this double val, int ulp)
{
if (!double.IsInfinity(val) && !double.IsNaN(val))
{
//should probably do something if we are at max or min values
//but its not clear what
long bits = BitConverter.DoubleToInt64Bits(val);
return BitConverter.Int64BitsToDouble(bits + ulp);
}
return val;
}
double (Double) holds about 16 digits of precision and long (Int64) about 18 digits.
Neither of these appear to have sufficient precision for your needs.
However decimal (Decimal) holds up to 30 digits of precision. Although this appears to be great enough for your needs I'd recommend caution in case your requirement grows even larger. In that case you may need a third party numeric library.
Related StackOverflow entries are:
How can I represent a very large integer in .NET?
Big integers in C#
You may want to read the Floating-Point Guide to understand how doubles work.
Basically, a double only has about 16 decimal digits of precision. At a magnitude of 10^25, an increase of 1.0 is below the threshold of precision and gets lost. Due to the binary representation, this may not be obvious.
The smallest increment that'll work is 2^30+1 which will actually increment the double by 2^31. You can test this kind of thing easily enough with LINQPad:
double inc = 1.0;
double num = 1.212e25;
while(num+inc == num) inc*=2;
inc.Dump(); //2147483648 == 2^31
(num+inc == num).Dump(); //false due to loop invariant
(num+(inc/2.0) == num).Dump();//true due to loop invariant
(num+(inc/2.0+1.0) == num).Dump();//false - 2^30+1 suffices to change the number
(num+(inc/2.0+1.0) == num + inc).Dump();//true - 2^30+1 and 2^31 are equiv. increments
((num+(inc/2.0+1.0)) - num == inc ).Dump();//true - the effective increment is 2^31
Since a double is essentially a binary number with limited precision, that means that the smallest possible increment will itself always be a power of two (this increment can be determined directly from the bit-pattern of the double, but it's probably clearer to do so with a loop as above since that's portable across float, double and other floating point representations (which don't exist in .NET).
One error I stumble upon every few month is this one:
double x = 19.08;
double y = 2.01;
double result = 21.09;
if (x + y == result)
{
MessageBox.Show("x equals y");
}
else
{
MessageBox.Show("that shouldn't happen!"); // <-- this code fires
}
You would suppose the code to display "x equals y" but that's not the case.
The short explanation is that the decimal places are, represented as a binary digit, do not fit into double.
Example:
2.625 would look like:
10.101
because
1-------0-------1---------0----------1
1 * 2 + 0 * 1 + 1 * 0.5 + 0 * 0.25 + 1 * 0,125 = 2.65
And some values (like the result of 19.08 plus 2.01) cannot be be represented with the bits of a double.
One solution is to use a constant:
double x = 19.08;
double y = 2.01;
double result = 21.09;
double EPSILON = 10E-10;
if ( x + y - result < EPSILON )
{
MessageBox.Show("x equals y"); // <-- this code fires
}
else
{
MessageBox.Show("that shouldn't happen!");
}
If I use decimal instead of double in the first example, the result is "x equals y".
But I'm asking myself If this is because of "decimal" type is not vulnerable of this behaviour or it just works in this case because the values "fit" into 128 bit.
Maybe someone has a better solution than using a constant?
Btw. this is not a dotNet/C# problem, it happens in most programming languages I think.
Decimal will be accurate so long as you stay within values which are naturally decimals in an appropriate range. So if you just add and subtract, for example, without doing anything which would skew the range of digits required too much (adding a very very big number to a very very small number) you will end up with easily comparable results. Multiplication is likely to be okay too, but I suspect it's easier to get inaccuracies with it.
As soon as you start dividing, that's where the problems can come - particularly if you start dividing by numbers which include prime factors other than 2 or 5.
Bottom line: it's safe in certain situations, but you really need to have a good handle on exactly what operations you'll be performing.
Note that it's not the 128-bitness of decimal which is helping you here - it's the representation of numbers as floating decimal point values rather than floating binary point values. See my articles on .NET binary floating point and decimal floating point for more information.
System.Decimal is just a floating point number with a different base so, in theory, it is still vulnerable to the sort of error you point out. I think you just happened on a case where rounding doesn't happen. More information here.
Yes, the .NET System.Double structure is subject to the problem you describe.
from http://msdn.microsoft.com/en-us/library/system.double.epsilon.aspx:
Two apparently equivalent floating-point numbers might not compare equal because of differences in their least significant digits. For example, the C# expression, (double)1/3 == (double)0.33333, does not compare equal because the division operation on the left side has maximum precision while the constant on the right side is precise only to the specified digits. If you create a custom algorithm that determines whether two floating-point numbers can be considered equal, you must use a value that is greater than the Epsilon constant to establish the acceptable absolute margin of difference for the two values to be considered equal. (Typically, that margin of difference is many times greater than Epsilon.)
In short: how can I execute a+b such that any loss-of-precision due to truncation is away from zero rather than toward zero?
The Long Story
I'm computing the sum of a long series of floating point values for the purpose of computing the sample mean and variance of the set. Since Var(X) = E(X2) - E(X)2, it suffices to maintain running count of all numbers, the sum of all numbers so far, and the sum of the squares of all numbers so far.
So far so good.
However, it's absolutely required that E(X2) > E(X)2, which due to floating point accuracy isn't always the case. In pseudo-code, the problem is this:
int count;
double sum, sumOfSquares;
...
double value = <current-value>;
double sqrVal = value*value;
count++;
sum += value; //slightly rounded down since value is truncated to fit into sum
sumOfSquares += sqrVal; //rounded down MORE since the order-of-magnitude
//difference between sqrVal and sumOfSquares is twice that between value and sum;
For variable sequences, this isn't a big issue - you end up slightly under-estimating the variance, but it's often not a big issue. However, for constant or almost-constant sets with a non-zero mean, it can mean that E(X2) < E(X)2, resulting in a negative computed variance, which violates expectations of consuming code.
Now, I know about Kahan Summation, which isn't an attractive solution. Firstly, it makes the code susceptible to optimization vagaries (depending on optimization flags, code may or may not exhibit this problem), and secondly, the problem isn't really due to the precision - which is good enough - it's because addition introduces systematic error towards zero. If I could execute the line
sumOfSquares += sqrVal;
in such a way as to ensure that sqrVal is rounded up, not down, into the precision of sumOfSquares, I'd have a numerically reasonable solution. But how can I achieve that?
Edit: Finished question - why does pressing enter in the drop-down-list in the tag field submit the question anyhow?
There's another single-pass algorithm which rearranges the calculation a bit. In
pseudocode:
n = 0
mean = 0
M2 = 0
for x in data:
n = n + 1
delta = x - mean
mean = mean + delta/n
M2 = M2 + delta*(x - mean) # This expression uses the new value of mean
variance_n = M2/n # Sample variance
variance = M2/(n - 1) # Unbiased estimate of population variance
(Source: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance )
This seems better behaved with respect to the issues you pointed out
with the usual algorithm.
IEEE provides four rounding modes, (toward -inf, toward +inf, toward 0, tonearest). Toward +inf is what you seem to want. There is no standard control in C90 or C++. C99 added the header <fenv.h> which is also present as an extension in some C90 and C++ implementation. To respect the C99 standard, you'd have to write something like:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
int old_round_mode = fegetround();
int set_round_ok = fesetround(FE_UPWARD);
assert(set_round_ok == 0);
...
int set_round_ok = fesetround(old_round_mode);
assert(set_round_ok == 0);
It is well known that the algorithm you use is numerically unstable and has precision problem. It is better for precision to do two passes on the data.
If you don't worry about the precision, but just about a negative variance, why don't you simply do V(x) = Max(0, E(X^2) - E(X)^2)