How to detect outliers in an array of doubles? - c#

I'm trying to determine whether or not there is data outliers in a list of doubles. Basically if anything is below 10 percent of the limits or above 90 percent. I have the following code working, but it doesn't work properly with negative numbers and I'm not seeing what is wrong. Is there a better way to approach there, or is there something glaring in the code or math?
public bool DataHasOutliers(IEnumerable<double> results, Limits limits)
{
foreach (double result in results)
{
//detect if any result values are in the low or high regions of the acceptable limits
double deltaAbsolute = (limits.High - limits.Low) < 0 ? (limits.High - limits.Low) * -1 : limits.High - limits.Low;
double absoluteResult = result < 0 ? result * -1 : result;
double lowLimitAbsolute = limits.Low < 0 ? limits.Low * -1 : limits.Low;
double upperThreshold = 0.9 * deltaAbsolute + limits.Low;
double lowerThreshold = 0.1 * deltaAbsolute + limits.Low;
if (absoluteResult >= upperThreshold)
{
"".Dump("Upper threshold violated");
return true;
}
if (absoluteResult <= lowerThreshold)
{
"".Dump("Lower threshold violated");
return true;
}
}
return false;
}
public class Limits
{
public double High { get; set; }
public double Low { get; set; }
public string Error { get; set; }
}

If the limits are [-10, 0] and a result is -5, with the current code you'll effectively be checking if 5 is in [11, 19], which is not correct.
I suggest to keep the sign of the boundaries increasing/decreasing them by the 1/10-th of the range and then check the original result value against this reduced range:
double deltaAbsolute = Math.Abs(limits.High - limits.Low);
double lowerThreshold = limits.Low + 0.1 * deltaAbsolute;
double upperThreshold = limits.High - 0.1 * deltaAbsolute;
if (result >= upperThreshold)
{
"".Dump("Upper threshold violated");
return true;
}
if (result <= lowerThreshold)
{
"".Dump("Lower threshold violated");
return true;
}

Why are you recalculating limits IN the loop.
If High > Low then you don't need any of the absolute stuff.
public bool DataHasOutliers(IEnumerable<double> results, Limits limits)
{
if(limits.High < limits.Low)
{
throw new ArgumentOutOfRangeException();
}
double delta = limits.High - limits.Low;
double upperThreshold = 0.9 * delta + limits.Low;
double lowerThreshold = 0.1 * delta + limits.Low;
foreach (double result in results)
{
//detect if any result values are in the low or high regions of the acceptable limits
if (result >= upperThreshold)
{
"".Dump("Upper threshold violated");
return true;
}
if (result <= lowerThreshold)
{
"".Dump("Lower threshold violated");
return true;
}
}
return false;
}

Related

Linear regression in a list with linq

I have a list of 'steps' that form a ramps series. Eeach step has a start value, an end value and a duration. Here is an example plot:
It is guaranteed, that the start value of a subsequent step is equal to the end value. Its a monotonous function.
Now I need to get the value at a given time. I have already a working implementation using good old foreach but I wonder if there is some clever way to do it with linq. Perhaps someome has an idea to substitute the GetValueAt function?
class Program
{
class Step
{
public double From { get; set; }
public double To { get; set; }
public int Duration { get; set; }
}
static void Main(string[] args)
{
var steps = new List<Step>
{
new Step { From = 0, To = 10, Duration = 20},
new Step { From = 10, To = 12, Duration = 10},
};
const double doubleTolerance = 0.001;
// test turning points
Debug.Assert(Math.Abs(GetValueAt(steps, 0) - 0) < doubleTolerance);
Debug.Assert(Math.Abs(GetValueAt(steps, 20) - 10) < doubleTolerance);
Debug.Assert(Math.Abs(GetValueAt(steps, 30) - 12) < doubleTolerance);
// test linear interpolation
Debug.Assert(Math.Abs(GetValueAt(steps, 10) - 5) < doubleTolerance);
Debug.Assert(Math.Abs(GetValueAt(steps, 25) - 11) < doubleTolerance);
}
static double GetValueAt(IList<Step> steps, int seconds)
{
// guard statements if seconds is within steps omitted here
var runningTime = steps.First().Duration;
var runningSeconds = seconds;
foreach (var step in steps)
{
if (seconds <= runningTime)
{
var x1 = 0; // stepStartTime
var x2 = step.Duration; // stepEndTime
var y1 = step.From; // stepStartValue
var y2 = step.To; // stepEndValue
var x = runningSeconds;
// linear interpolation
return y1 + (y2 - y1) / (x2 - x1) * (x - x1);
}
runningTime += step.Duration;
runningSeconds -= step.Duration;
}
return double.NaN;
}
}
You could try Aggregate:
static double GetValueAt(IList<Step> steps, int seconds)
{
var (value, remaining) = steps.Aggregate(
(Value: 0d, RemainingSeconds: seconds),
(secs, step) =>
{
if (secs.RemainingSeconds > step.Duration)
{
return (step.To, secs.RemainingSeconds - step.Duration);
}
else
{
return (secs.Value + ((step.To - step.From) / step.Duration) * secs.RemainingSeconds, 0);
}
});
return remaining > 0 ? double.NaN : value;
}
let's ignore linq for a moment...
for small amounts of steps, your foreach approach is quite effective ... also if you can manage the accessing side to favor ordered sequential access instead of random access, you could optimize the way of accessing the required step to calculate the value... think of an iterator that only goes forward if the requested point is not on the current step
if your amount of steps becomes larger and you need to access the values in a random order, you might want to introduce a balanced tree structure for searching the right step element

Why this sin(x) function in C# return NaN instead of a number

I have this function wrote in C# to calc the sin(x). But when I try with x = 3.14, the printed result of sin X is NaN (not a number),
but when debugging, its is very near to 0.001592653
The value is not too big, neither too small. So how could the NaN appear here?
static double pow(double x, int mu)
{
if (mu == 0)
return 1;
if (mu == 1)
return x;
return x * pow(x, mu - 1);
}
static double fact(int n)
{
if (n == 1 || n == 0)
return 1;
return n * fact(n - 1);
}
static double sin(double x)
{
var s = x;
for (int i = 1; i < 1000; i++)
{
s += pow(-1, i) * pow(x, 2 * i + 1) / fact(2 * i + 1);
}
return s;
}
public static void Main(String[] param)
{
try
{
while (true)
{
Console.WriteLine("Enter x value: ");
double x = double.Parse(Console.ReadLine());
var sinX = sin(x);
Console.WriteLine("Sin of {0} is {1}: " , x , sinX);
Console.ReadLine();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
It fails because both pow(x, 2 * i + 1) and fact(2 * i + 1) eventually return Infinity.
In my case, it's when x = 4, i = 256.
Note that pow(x, 2 * i + 1) = 4 ^ (2 * 257) = 2.8763090157797054523668883052624395737887631663 × 10^309 - a stupidly large number which is just over the max value of a double, which is approximately 1.79769313486232 x 10 ^ 308.
You might be interested in just using Math.Sin(x)
Also note that fact(2 * i + 1) = 513! =an even more ridiculously large number which is more than 10^1000 times larger than the estimated number of atoms in the observable universe.
When x == 3.14 and i == 314 then you get Infinity:
?pow(-1, 314)
1.0
?pow(x, 2 * 314 + 1)
Infinity
? fact(2 * 314 + 1)
Infinity
The problem here is an understanding of floating point representation of 'real' numbers.
Double numbers while allowing a large range of values only has a precision of 15 to 17 decimal digits.
In this example we are calculating a value between -1 and 1.
We calculate the value of the sin function by using the series expansion of it which is basically a the sum of terms. In that expansion the terms become smaller and smaller as we go along.
When the terms have reached a value less than 1e-17 adding them to what is already there will not make any difference. This is so because we only have 52 bit of precision which are used up by the time we get to a term of less than 1e-17.
So instead of doing a constant 1000 loops you should do something like this:
static double sin(double x)
{
var s = x;
for (int i = 1; i < 1000; i++)
{
var term = pow(x, 2 * i + 1) / fact(2 * i + 1);
if (term < 1e-17)
break;
s += pow(-1, i) * term;
}
return s;
}

Ideas wanted for analyzing near-realtime data over specific intervals with memory/cpu efficiency

I have some environmental sensors and I want to detect sudden changes in temperature, and slow trends over time... however I'd like to do most of the math based on what's in memory with parameters that may look like this: (subject to change)
(note: Items in parens are computed in realtime as data is added)
5 minute (derivative, max, min, avg) + 36 datapoints for most current 3 hours
10 minute (derivative, max, min, avg) + 0 datapoints, calc is based on 5min sample
30 minute (derivative, max, min, avg) + 0 datapoints, calc is based on 5 min sample
Hourly (derivative, max, min, avg) + 24 datapoints for most current 1 day
Daily (derivative, max,min,avg) + 32 datapoints for most current month
Monthly (derivative, max,min,avg) + 12 datapoints for past year
Each datapoint is a two byte float. So each sensor will consume up to 124 Floats, plus the 24 calculated variables. I'd like to support as many sensors as the .NET embededd device will permit.
Since I'm using an embedded device for this project, my memory is constrained and so is my IO and CPU power.
How would you go about implementing this in .NET? So far, I've created a couple of structs and called it a "TrackableFloat" where the insertion of a value causes the old one to drop off the array and a recalculation is done.
The only thing that makes this more
complicated than it would be, is that
for any sensor does not report back
data, then that datapoint needs to be
excluded/ignored from all subsequent
realtime calulations.
When all is said and done, if any of the values: (derivative, max,min,avg) reach a pre defined setting, then a .NET event fires
I think someone out there will think this is an interesting problem, and would love to hear how they would approach implementing it.
Would you use a Class or a Struct?
How would you trigger the calculations? (Events most likely)
How would the alerts be triggered?
How would you store the data, in tiers?
Is there a library that already does something like this? (maybe that should be my first question )
How would you efficiently calculate the derivative?
Here is my first crack at this, and it doesn't completely hit the spec, but is very efficient. Would be interested in hearing your thoughts.
enum UnitToTrackEnum
{
Minute,
FiveMinute,
TenMinute,
FifteenMinute,
Hour,
Day,
Week,
Month,
unknown
}
class TrackableFloat
{
object Gate = new object();
UnitToTrackEnum trackingMode = UnitToTrackEnum.unknown;
int ValidFloats = 0;
float[] FloatsToTrack;
public TrackableFloat(int HistoryLength, UnitToTrackEnum unitToTrack)
{
if (unitToTrack == UnitToTrackEnum.unknown)
throw new InvalidOperationException("You must not have an unknown measure of time to track.");
FloatsToTrack = new float[HistoryLength];
foreach (var i in FloatsToTrack)
{
float[i] = float.MaxValue;
}
trackingMode = unitToTrack;
Min = float.MaxValue;
Max = float.MinValue;
Sum = 0;
}
public void Add(DateTime dt, float value)
{
int RoundedDTUnit = 0;
switch (trackingMode)
{
case UnitToTrackEnum.Minute:
{
RoundedDTUnit = dt.Minute;
break;
}
case UnitToTrackEnum.FiveMinute:
{
RoundedDTUnit = System.Math.Abs(dt.Minute / 5);
break;
}
case UnitToTrackEnum.TenMinute:
{
RoundedDTUnit = System.Math.Abs(dt.Minute / 10);
break;
}
case UnitToTrackEnum.FifteenMinute:
{
RoundedDTUnit = System.Math.Abs(dt.Minute / 15);
break;
}
case UnitToTrackEnum.Hour:
{
RoundedDTUnit = dt.Hour;
break;
}
case UnitToTrackEnum.Day:
{
RoundedDTUnit = dt.Day;
break;
}
case UnitToTrackEnum.Week:
{
//RoundedDTUnit = System.Math.Abs( );
break;
}
case UnitToTrackEnum.Month:
{
RoundedDTUnit = dt.Month;
break;
}
case UnitToTrackEnum.unknown:
{
throw new InvalidOperationException("You must not have an unknown measure of time to track.");
}
default:
break;
}
bool DoRefreshMaxMin = false;
if (FloatsToTrack.Length < RoundedDTUnit)
{
if (value == float.MaxValue || value == float.MinValue)
{
// If invalid data...
lock (Gate)
{
// Get rid of old data...
var OldValue = FloatsToTrack[RoundedDTUnit];
if (OldValue != float.MaxValue || OldValue != float.MinValue)
{
Sum -= OldValue;
ValidFloats--;
if (OldValue == Max || OldValue == Min)
DoRefreshMaxMin = true;
}
// Save new data
FloatsToTrack[RoundedDTUnit] = value;
}
}
else
{
lock (Gate)
{
// Get rid of old data...
var OldValue = FloatsToTrack[RoundedDTUnit];
if (OldValue != float.MaxValue || OldValue != float.MinValue)
{
Sum -= OldValue;
ValidFloats--;
}
// Save new data
FloatsToTrack[RoundedDTUnit] = value;
Sum += value;
ValidFloats++;
if (value < Min)
Min = value;
if (value > Max)
Max = value;
if (OldValue == Max || OldValue == Min)
DoRefreshMaxMin = true;
}
}
// Function is placed here to avoid a deadlock
if (DoRefreshMaxMin == true)
RefreshMaxMin();
}
else
{
throw new IndexOutOfRangeException("Index " + RoundedDTUnit + " is out of range for tracking mode: " + trackingMode.ToString());
}
}
public float Sum { get; set; }
public float Average
{
get
{
if (ValidFloats > 0)
return Sum / ValidFloats;
else
return float.MaxValue;
}
}
public float Min { get; set; }
public float Max { get; set; }
public float Derivative { get; set; }
public void RefreshCounters()
{
lock (Gate)
{
float sum = 0;
ValidFloats = 0;
Min = float.MaxValue;
Max = float.MinValue;
foreach (var i in FloatsToTrack)
{
if (i != float.MaxValue || i != float.MinValue)
{
if (Min == float.MaxValue)
{
Min = i;
Max = i;
}
sum += i;
ValidFloats++;
if (i < Min)
Min = i;
if (i > Max)
Max = i;
}
}
Sum = sum;
}
}
public void RefreshMaxMin()
{
if (ValidFloats > 0)
{
Min = float.MaxValue;
Max = float.MinValue;
lock (Gate)
{
foreach (var i in FloatsToTrack)
{
if (i != float.MaxValue || i != float.MinValue)
{
if (i < Min)
Min = i;
if (i > Max)
Max = i;
}
}
}
}
}
}
You should consider looking at a CEP library like Nesper.

Formatting double to latitude/longitude human readable format

If the formula for converting latitude or longitude to double is
((Degree) + (Minute) / 60 + (Second) / 3600) * ((South || West) ? -1 : 1)
then what's the formula for parsing degrees, minutes, seconds from a double?
It'd make sense to have two separate methods for parsing latitude and longitude, but I'm not sure how to parse the degrees, minutes, seconds from the double.
ParseLatitude(double value)
{
//value is South if negative, else is North.
}
ParseLongitude(double value)
{
//value is West if negative, else is East.
}
Example coordinates:
latitude: 43.81234123
longitude: -119.8374747
The final code to convert back and forth, thanks again to Peter and James for the answer. I had to convert value to Decimal because this is being used in Silverlight and Math.Truncate(double) is not available):
public class Coordinate
{
public double Degrees { get; set; }
public double Minutes { get; set; }
public double Seconds { get; set; }
public CoordinatesPosition Position { get; set; }
public Coordinate() { }
public Coordinate(double value, CoordinatesPosition position)
{
//sanity
if (value < 0 && position == CoordinatesPosition.N)
position = CoordinatesPosition.S;
//sanity
if (value < 0 && position == CoordinatesPosition.E)
position = CoordinatesPosition.W;
//sanity
if (value > 0 && position == CoordinatesPosition.S)
position = CoordinatesPosition.N;
//sanity
if (value > 0 && position == CoordinatesPosition.W)
position = CoordinatesPosition.E;
var decimalValue = Convert.ToDecimal(value);
decimalValue = Math.Abs(decimalValue);
var degrees = Decimal.Truncate(decimalValue);
decimalValue = (decimalValue - degrees) * 60;
var minutes = Decimal.Truncate(decimalValue);
var seconds = (decimalValue - minutes) * 60;
Degrees = Convert.ToDouble(degrees);
Minutes = Convert.ToDouble(minutes);
Seconds = Convert.ToDouble(seconds);
Position = position;
}
public Coordinate(double degrees, double minutes, double seconds, CoordinatesPosition position)
{
Degrees = degrees;
Minutes = minutes;
Seconds = seconds;
Position = position;
}
public double ToDouble()
{
var result = (Degrees) + (Minutes) / 60 + (Seconds) / 3600;
return Position == CoordinatesPosition.W || Position == CoordinatesPosition.S ? -result : result;
}
public override string ToString()
{
return Degrees + "º " + Minutes + "' " + Seconds + "'' " + Position;
}
}
public enum CoordinatesPosition
{
N, E, S, W
}
Unit Test (nUnit)
[TestFixture]
public class CoordinateTests
{
[Test]
public void ShouldConvertDoubleToCoordinateAndBackToDouble()
{
const double baseLatitude = 43.81234123;
const double baseLongitude = -119.8374747;
var latCoordN = new Coordinate(baseLatitude, CoordinatesPosition.N);
var latCoordS = new Coordinate(baseLatitude, CoordinatesPosition.S);
var lonCoordE = new Coordinate(baseLongitude, CoordinatesPosition.E);
var lonCoordW = new Coordinate(baseLongitude, CoordinatesPosition.W);
var convertedLatitudeS = latCoordS.ToDouble();
var convertedLatitudeN = latCoordN.ToDouble();
var convertedLongitudeW = lonCoordW.ToDouble();
var convertedLongitudeE = lonCoordE.ToDouble();
Assert.AreEqual(convertedLatitudeS, convertedLatitudeN);
Assert.AreEqual(baseLatitude, convertedLatitudeN);
Assert.AreEqual(convertedLongitudeE, convertedLongitudeW);
Assert.AreEqual(baseLongitude, convertedLongitudeE);
}
}
ParseLatitude(double Value)
{
var direction = Value < 0 ? Direction.South : Direction.North;
Value = Math.Abs(Value);
var degrees = Math.Truncate(Value);
Value = (Value - degrees) * 60; //not Value = (Value - degrees) / 60;
var minutes = Math.Truncate(Value);
var seconds = (Value - minutes) * 60; //not Value = (Value - degrees) / 60;
//...
}
ParseLongitude(double Value)
{
var direction = Value < 0 ? Direction.West : Direction.East;
Value = Math.Abs(Value);
var degrees = Math.Truncate(Value);
Value = (Value - degrees) * 60; //not Value = (Value - degrees) / 60;
var minutes = Math.Truncate(Value);
var seconds = (Value - minutes) * 60; //not Value = (Value - degrees) / 60;
//...
}
EDIT
I came back to this because of a recent upvote. Here's a DRY-er version, with the Value parameter renamed to reflect the most common coding convention, in which parameters start with lower-case letters:
ParseLatitude(double value)
{
var direction = value < 0 ? Direction.South : Direction.North;
return ParseLatituteOrLongitude(value, direction);
}
ParseLongitude(double value)
{
var direction = value < 0 ? Direction.West : Direction.East;
return ParseLatituteOrLongitude(value, direction);
}
//This must be a private method because it requires the caller to ensure
//that the direction parameter is correct.
ParseLatitudeOrLongitude(double value, Direction direction)
{
value = Math.Abs(value);
var degrees = Math.Truncate(value);
value = (value - degrees) * 60; //not Value = (Value - degrees) / 60;
var minutes = Math.Truncate(value);
var seconds = (value - minutes) * 60; //not Value = (Value - degrees) / 60;
//...
}
#include <math.h>
void ParseLatitude(double Value, bool &north, double &deg, double &min, double &sec)
{
if ( Value < 0 )
{
ParseLatitude( -Value, north, deg, min, sec );
north = false;
}
else
{
north = true;
deg = floor(Value);
Value = 60*(Value - deg);
min = floor(Value);
Value = 60*(Value - min);
sec = Value;
}
}
// ParseLongitude is similar
I've written a class in C# which does a lot of this. Perhaps it is useful, otherwise you can check out the implementation:
http://code.google.com/p/exif-utils/source/browse/trunk/ExifUtils/ExifUtils/GpsCoordinate.cs
In addition to parsing out the degree, minutes, seconds (which is just radix 60 arithmetic), you may also want to deal with sign of the doubles being converted to "North/South" for latitude and "East/West" for longitude.
It's pretty standard to identify positive degrees latitude with the Northern Hemisphere and negative degrees latitude with the Southern Hemisphere. It's also common here in the Western Hemisphere to take positive degrees longitude to mean degrees West of the Greenwich Meridian and conversely negative degrees longitude to mean degrees East of that Meridian. However the preferred convention for this is the opposite, to take degrees East of the Greenwich Meridian as negative. You may want to consult with your client/analyze the application design to determine which choice applies to this conversion.
Note also that the discontinuity of longitude at ±180 is a cause for care in converting coordinates that may result from calculations. If the conversion is not intended to handle wrap-around at the 180° meridian, then it's likely an exception should be thrown for such inputs. Of course the design decision should be documented either way.
Certainly latitudes outside the ±90° range are errors on input.
Added: Given the above differences in parsing latitude and longitude, issues that would best be handled in the distinct ParseLatitude & ParseLongitude routines, we could use a common utility to do the conversion from double to degrees/minutes/seconds.
I'm not sure what the target language should be here, so I wrote something in plain vanilla C:
#include <math.h>
void double2DegMinSec(double angle, int *Sign, int *Deg, int *Min, double *Sec)
{ /* extract radix 60 Degrees/Minutes/Seconds from "angle" */
Sign = 1;
if (angle < 0.0) /* reduce to case of nonnegative angle */
{
Sign = -Sign;
angle = -angle;
}
*Deg = floor(angle);
angle -= *Deg;
angle *= 60.0;
*Min = floor(angle);
angle -= *Min;
angle *= 60.0;
*Sec = angle;
return;
}
Likely ParseLatitude and ParseLongitude should manage the conversion of angle's sign to the appropriate geographic designation, but I've included an argument Sign that will allow that sign checking to be done after conversion (though it would be fine if the conversion were only ever called with nonnegative angles).
I made the function double2DegMinSec have a return type of void. Results are thus to be returned through its formal arguments of type pointer to int and pointer to double (in the case of seconds Sec, which might have fractional part).
Calling the conversion in C might be done like this:
double longitude = -119.8374747;
int Sign, Degrees, Minutes;
double Seconds;
double2DegMinSec(longitude, &Sign, &Degrees, &Minutes, &Seconds);
In C++ we would make the calling syntax a bit glibber by using call-by-reference instead of pointers.
With just multiplying you'll get conversion errors without noticing it, I noticed it when mapping the points on a map. You'll need to take into account the slope and other variables, like this:
public static void GeoToMercator(double xIn, double yIn, out double xOut, out double yOut)
{
double xArg = xIn / 100000, yArg = yIn / 100000;
xArg = 6371000.0 * Math.PI / 180 * xArg;
yArg = 6371000.0 * Math.Log(Math.Tan(Math.PI / 4 + Math.PI / 180 * yArg * 0.5));
xOut = xArg / 10000;
yOut = yArg / 10000;
}
I'm guessing you're using Mercator values as double representation. To convert the Mercator value back into the correct longitude/latitude values, just use the reverse:
public static void MercatorToGeo(double xIn, double yIn, out double xOut, out double yOut)
{
double xArg = xIn, yArg = yIn;
xArg = 180 / Math.PI * xArg / 6371000.0;
yArg = 180 / Math.PI * (Math.Atan(Math.Exp(yArg / 6371000.0)) - Math.PI / 4) / 0.5;
xOut = xArg * 10;
yOut = yArg * 10;
}
This did the trick for me.

How do I determine the standard deviation (stddev) of a set of values?

I need to know if a number compared to a set of numbers is outside of 1 stddev from the mean, etc..
While the sum of squares algorithm works fine most of the time, it can cause big trouble if you are dealing with very large numbers. You basically may end up with a negative variance...
Plus, don't never, ever, ever, compute a^2 as pow(a,2), a * a is almost certainly faster.
By far the best way of computing a standard deviation is Welford's method. My C is very rusty, but it could look something like:
public static double StandardDeviation(List<double> valueList)
{
double M = 0.0;
double S = 0.0;
int k = 1;
foreach (double value in valueList)
{
double tmpM = M;
M += (value - tmpM) / k;
S += (value - tmpM) * (value - M);
k++;
}
return Math.Sqrt(S / (k-2));
}
If you have the whole population (as opposed to a sample population), then use return Math.Sqrt(S / (k-1));.
EDIT: I've updated the code according to Jason's remarks...
EDIT: I've also updated the code according to Alex's remarks...
10 times faster solution than Jaime's, but be aware that,
as Jaime pointed out:
"While the sum of squares algorithm works fine most of the time, it
can cause big trouble if you are dealing with very large numbers. You
basically may end up with a negative variance"
If you think you are dealing with very large numbers or a very large quantity of numbers, you should calculate using both methods, if the results are equal, you know for sure that you can use "my" method for your case.
public static double StandardDeviation(double[] data)
{
double stdDev = 0;
double sumAll = 0;
double sumAllQ = 0;
//Sum of x and sum of x²
for (int i = 0; i < data.Length; i++)
{
double x = data[i];
sumAll += x;
sumAllQ += x * x;
}
//Mean (not used here)
//double mean = 0;
//mean = sumAll / (double)data.Length;
//Standard deviation
stdDev = System.Math.Sqrt(
(sumAllQ -
(sumAll * sumAll) / data.Length) *
(1.0d / (data.Length - 1))
);
return stdDev;
}
The accepted answer by Jaime is great, except you need to divide by k-2 in the last line (you need to divide by "number_of_elements-1").
Better yet, start k at 0:
public static double StandardDeviation(List<double> valueList)
{
double M = 0.0;
double S = 0.0;
int k = 0;
foreach (double value in valueList)
{
k++;
double tmpM = M;
M += (value - tmpM) / k;
S += (value - tmpM) * (value - M);
}
return Math.Sqrt(S / (k-1));
}
The Math.NET library provides this for you to of the box.
PM> Install-Package MathNet.Numerics
var populationStdDev = new List<double>(1d, 2d, 3d, 4d, 5d).PopulationStandardDeviation();
var sampleStdDev = new List<double>(2d, 3d, 4d).StandardDeviation();
See PopulationStandardDeviation for more information.
Code snippet:
public static double StandardDeviation(List<double> valueList)
{
if (valueList.Count < 2) return 0.0;
double sumOfSquares = 0.0;
double average = valueList.Average(); //.NET 3.0
foreach (double value in valueList)
{
sumOfSquares += Math.Pow((value - average), 2);
}
return Math.Sqrt(sumOfSquares / (valueList.Count - 1));
}
You can avoid making two passes over the data by accumulating the mean and mean-square
cnt = 0
mean = 0
meansqr = 0
loop over array
cnt++
mean += value
meansqr += value*value
mean /= cnt
meansqr /= cnt
and forming
sigma = sqrt(meansqr - mean^2)
A factor of cnt/(cnt-1) is often appropriate as well.
BTW-- The first pass over the data in Demi and McWafflestix answers are hidden in the calls to Average. That kind of thing is certainly trivial on a small list, but if the list exceed the size of the cache, or even the working set, this gets to be a bid deal.
I found that Rob's helpful answer didn't quite match what I was seeing using excel. To match excel, I passed the Average for valueList in to the StandardDeviation calculation.
Here is my two cents... and clearly you could calculate the moving average (ma) from valueList inside the function - but I happen to have already before needing the standardDeviation.
public double StandardDeviation(List<double> valueList, double ma)
{
double xMinusMovAvg = 0.0;
double Sigma = 0.0;
int k = valueList.Count;
foreach (double value in valueList){
xMinusMovAvg = value - ma;
Sigma = Sigma + (xMinusMovAvg * xMinusMovAvg);
}
return Math.Sqrt(Sigma / (k - 1));
}
With Extension methods.
using System;
using System.Collections.Generic;
namespace SampleApp
{
internal class Program
{
private static void Main()
{
List<double> data = new List<double> {1, 2, 3, 4, 5, 6};
double mean = data.Mean();
double variance = data.Variance();
double sd = data.StandardDeviation();
Console.WriteLine("Mean: {0}, Variance: {1}, SD: {2}", mean, variance, sd);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
public static class MyListExtensions
{
public static double Mean(this List<double> values)
{
return values.Count == 0 ? 0 : values.Mean(0, values.Count);
}
public static double Mean(this List<double> values, int start, int end)
{
double s = 0;
for (int i = start; i < end; i++)
{
s += values[i];
}
return s / (end - start);
}
public static double Variance(this List<double> values)
{
return values.Variance(values.Mean(), 0, values.Count);
}
public static double Variance(this List<double> values, double mean)
{
return values.Variance(mean, 0, values.Count);
}
public static double Variance(this List<double> values, double mean, int start, int end)
{
double variance = 0;
for (int i = start; i < end; i++)
{
variance += Math.Pow((values[i] - mean), 2);
}
int n = end - start;
if (start > 0) n -= 1;
return variance / (n);
}
public static double StandardDeviation(this List<double> values)
{
return values.Count == 0 ? 0 : values.StandardDeviation(0, values.Count);
}
public static double StandardDeviation(this List<double> values, int start, int end)
{
double mean = values.Mean(start, end);
double variance = values.Variance(mean, start, end);
return Math.Sqrt(variance);
}
}
}
/// <summary>
/// Calculates standard deviation, same as MATLAB std(X,0) function
/// <seealso cref="http://www.mathworks.co.uk/help/techdoc/ref/std.html"/>
/// </summary>
/// <param name="values">enumumerable data</param>
/// <returns>Standard deviation</returns>
public static double GetStandardDeviation(this IEnumerable<double> values)
{
//validation
if (values == null)
throw new ArgumentNullException();
int lenght = values.Count();
//saves from devision by 0
if (lenght == 0 || lenght == 1)
return 0;
double sum = 0.0, sum2 = 0.0;
for (int i = 0; i < lenght; i++)
{
double item = values.ElementAt(i);
sum += item;
sum2 += item * item;
}
return Math.Sqrt((sum2 - sum * sum / lenght) / (lenght - 1));
}
The trouble with all the other answers is that they assume you have your
data in a big array. If your data is coming in on the fly, this would be
a better approach. This class works regardless of how or if you store your data. It also gives you the choice of the Waldorf method or the sum-of-squares method. Both methods work using a single pass.
public final class StatMeasure {
private StatMeasure() {}
public interface Stats1D {
/** Add a value to the population */
void addValue(double value);
/** Get the mean of all the added values */
double getMean();
/** Get the standard deviation from a sample of the population. */
double getStDevSample();
/** Gets the standard deviation for the entire population. */
double getStDevPopulation();
}
private static class WaldorfPopulation implements Stats1D {
private double mean = 0.0;
private double sSum = 0.0;
private int count = 0;
#Override
public void addValue(double value) {
double tmpMean = mean;
double delta = value - tmpMean;
mean += delta / ++count;
sSum += delta * (value - mean);
}
#Override
public double getMean() { return mean; }
#Override
public double getStDevSample() { return Math.sqrt(sSum / (count - 1)); }
#Override
public double getStDevPopulation() { return Math.sqrt(sSum / (count)); }
}
private static class StandardPopulation implements Stats1D {
private double sum = 0.0;
private double sumOfSquares = 0.0;
private int count = 0;
#Override
public void addValue(double value) {
sum += value;
sumOfSquares += value * value;
count++;
}
#Override
public double getMean() { return sum / count; }
#Override
public double getStDevSample() {
return (float) Math.sqrt((sumOfSquares - ((sum * sum) / count)) / (count - 1));
}
#Override
public double getStDevPopulation() {
return (float) Math.sqrt((sumOfSquares - ((sum * sum) / count)) / count);
}
}
/**
* Returns a way to measure a population of data using Waldorf's method.
* This method is better if your population or values are so large that
* the sum of x-squared may overflow. It's also probably faster if you
* need to recalculate the mean and standard deviation continuously,
* for example, if you are continually updating a graphic of the data as
* it flows in.
*
* #return A Stats1D object that uses Waldorf's method.
*/
public static Stats1D getWaldorfStats() { return new WaldorfPopulation(); }
/**
* Return a way to measure the population of data using the sum-of-squares
* method. This is probably faster than Waldorf's method, but runs the
* risk of data overflow.
*
* #return A Stats1D object that uses the sum-of-squares method
*/
public static Stats1D getSumOfSquaresStats() { return new StandardPopulation(); }
}
We may be able to use statistics module in Python. It has stedev() and pstdev() commands to calculate standard deviation of sample and population respectively.
details here: https://www.geeksforgeeks.org/python-statistics-stdev/
import statistics as st
print(st.ptdev(dataframe['column name']))
This is Population standard deviation
private double calculateStdDev(List<double> values)
{
double average = values.Average();
return Math.Sqrt((values.Select(val => (val - average) * (val - average)).Sum()) / values.Count);
}
For Sample standard deviation, just change [values.Count] to [values.Count -1] in above code.
Make sure you don't have only 1 data point in your set.

Categories