Looking for a Histogram Binning algorithm for decimal data

Looking for a Histogram Binning algorithm for decimal data - c#

I need to generate bins for the purposes of calculating a histogram. Language is C#. Basically I need to take in an array of decimal numbers and generate a histogram plot out of those.
Haven't been able to find a decent library to do this outright so now I'm just looking for either a library or an algorithm to help me do the binning of the data.
So...
Are there any C# libraries out there that will take in an array of decimal data and output a binned histogram?
Is there generic algorithm for building the bins to be used in generated a histogram?

Here is a simple bucket function I use. Sadly, .NET generics doesn't support a numerical type contraint so you will have to implement a different version of the following function for decimal, int, double, etc.
public static List<int> Bucketize(this IEnumerable<decimal> source, int totalBuckets)
{
var min = source.Min();
var max = source.Max();
var buckets = new List<int>();
var bucketSize = (max - min) / totalBuckets;
foreach (var value in source)
{
int bucketIndex = 0;
if (bucketSize > 0.0)
{
bucketIndex = (int)((value - min) / bucketSize);
if (bucketIndex == totalBuckets)
{
bucketIndex--;
}
}
buckets[bucketIndex]++;
}
return buckets;
}

I got odd results using #JakePearson accepted answer. It has to do with an edge case.
Here is the code I used to test his method. I changed the extension method ever so slightly, returning an int[] and accepting double instead of decimal.
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
Random rand = new Random(1325165);
int maxValue = 100;
int numberOfBuckets = 100;
List<double> values = new List<double>();
for (int i = 0; i < 10000000; i++)
{
double value = rand.NextDouble() * (maxValue+1);
values.Add(value);
}
int[] bins = values.Bucketize(numberOfBuckets);
PointPairList points = new PointPairList();
for (int i = 0; i < numberOfBuckets; i++)
{
points.Add(i, bins[i]);
}
zedGraphControl1.GraphPane.AddBar("Random Points", points,Color.Black);
zedGraphControl1.GraphPane.YAxis.Title.Text = "Count";
zedGraphControl1.GraphPane.XAxis.Title.Text = "Value";
zedGraphControl1.AxisChange();
zedGraphControl1.Refresh();
}
}
public static class Extension
{
public static int[] Bucketize(this IEnumerable<double> source, int totalBuckets)
{
var min = source.Min();
var max = source.Max();
var buckets = new int[totalBuckets];
var bucketSize = (max - min) / totalBuckets;
foreach (var value in source)
{
int bucketIndex = 0;
if (bucketSize > 0.0)
{
bucketIndex = (int)((value - min) / bucketSize);
if (bucketIndex == totalBuckets)
{
bucketIndex--;
}
}
buckets[bucketIndex]++;
}
return buckets;
}
}
Everything works well when using 10,000,000 random double values between 0 and 100 (exclusive). Each bucket has roughly the same number of values, which makes sense given that Random returns a normal distribution.
But when I changed the value generation line from
double value = rand.NextDouble() * (maxValue+1);
to
double value = rand.Next(0, maxValue + 1);
and you get the following result, which double counts the last bucket.
It appears that when a value is same as one of the boundaries of a bucket, the code as it is written puts the value in the incorrect bucket. This artifact doesn't appear to happen with random double values as the chance of a random number being equal to a boundary of a bucket is rare and wouldn't be obvious.
The way I corrected this is to define what side of the bucket boundary is inclusive vs. exclusive.
Think of
0< x <=1 1< x <=2 ... 99< x <=100
vs.
0<= x <1 1<= x <2 ... 99<= x <100
You cannot have both boundaries inclusive, as the method wouldn't know which bucket to put it in if you have a value that is exactly equal to a boundary.
public enum BucketizeDirectionEnum
{
LowerBoundInclusive,
UpperBoundInclusive
}
public static int[] Bucketize(this IList<double> source, int totalBuckets, BucketizeDirectionEnum inclusivity = BucketizeDirectionEnum.UpperBoundInclusive)
{
var min = source.Min();
var max = source.Max();
var buckets = new int[totalBuckets];
var bucketSize = (max - min) / totalBuckets;
if (inclusivity == BucketizeDirectionEnum.LowerBoundInclusive)
{
foreach (var value in source)
{
int bucketIndex = (int)((value - min) / bucketSize);
if (bucketIndex == totalBuckets)
continue;
buckets[bucketIndex]++;
}
}
else
{
foreach (var value in source)
{
int bucketIndex = (int)Math.Ceiling((value - min) / bucketSize) - 1;
if (bucketIndex < 0)
continue;
buckets[bucketIndex]++;
}
}
return buckets;
}
The only issue now is if the input dataset has a lot of min and max values, the binning method will exclude many of those values and the resulting graph will misrepresent the dataset.

Related

Random() times 2^32 - 1 always returns even last digit

I have encountered this weird behaviour, which is probably best described by a small example:
Random R = new Random();
for (int i = 0; i < 10_000; i++)
{
double d = R.NextDouble() * uint.MaxValue;
}
Now, the last digit of d before the decimal mark is always even, i.e. int r = (int) (d % 10) is always 0, 2, 4, 6, or 8. There are odd digits on either side, though.
I suspected that multiplying with uint.MaxValue (2^32 - 1) could force some rounding error in the last digits, but since double has more than 50 bits of precision, this should beat uint with about 20 bits to spare after the separator. This behaviour also occurs if I explicitly store uint.MaxValue as a double before the multiplication and use that instead.
Can someone shed any light on this?

This is a deficiency in the .Net Random class.
If you inspect the source code you will see the following comment in the implementation of the private method GetSampleForLargeRange():
// The distribution of double value returned by Sample
// is not distributed well enough for a large range.
// If we use Sample for a range [Int32.MinValue..Int32.MaxValue)
// We will end up getting even numbers only.
This is used in the implementation of Next():
public virtual int Next(int minValue, int maxValue) {
if (minValue>maxValue) {
throw new ArgumentOutOfRangeException("minValue",Environment.GetResourceString("Argument_MinMaxValue", "minValue", "maxValue"));
}
Contract.EndContractBlock();
long range = (long)maxValue-minValue;
if( range <= (long)Int32.MaxValue) {
return ((int)(Sample() * range) + minValue);
}
else {
return (int)((long)(GetSampleForLargeRange() * range) + minValue);
}
}
But it is NOT used for the values returned from NextDouble() (which just returns the value returned from Sample().
So the answer is that NextDouble() is not well-distributed.
You can use RNGCryptoServiceProvider to generate better random numbers, but it's a bit of a fiddle to create the double. From this answer:
static void Main()
{
var R = new RNGCryptoServiceProvider();
var bytes = new Byte[8];
for (int i = 0; i < 10_000; i++)
{
R.GetBytes(bytes);
var ul = BitConverter.ToUInt64(bytes, 0) / (1 << 11);
var d = ul / (double)(1UL << 53);
d *= uint.MaxValue;
Console.WriteLine(d);
}
}

How to use Shuffling to improve rng?

Following a pseudo code implementation of Bays & Durham I'm trying to sort a random sequence of numbers using the inbuilt random in c#.
The pseudo code I'm following is:
An array of random numbers (V) is generated
A random number is generated (Y) - the seed should be the last number in the array
Generate a random Index (y) using the following formula: (K*Y) / m
Where:
K - Size of the array
Y - The random number generated
M - The modulus used in the random number generator used to populate the array
The element at position Y in the array i.e. V[y] is returned
The element at position Y to in the array is set to Y itself i.e. V[y] = Y
int[] seq = GetRand(size, min, max);
int nextRand = cRNG.Next(seq[seq.Length-1]);
//int index = (seq.Length * nextRand) / [Add Modulus];
return seq;
The first couple steps I could follow. Now, I need to return a shuffled array so the pseudo code needs to be modified a little.
Few pointers on above code:
cRNG -> The instance name of Random
GetRand(...) -> Has a forloop using cRNG
Now what I'm not understanding are the following:
1) Given I'm using the inbuilt random how can I get the modulus used to populate the array since it's simply a for loop using cRNG.Next(min, max)
2) The last step I don't fully understand it
Any help would be quite appreciated!
[Edit]
Following Samuel Vidal solution would this work? (Assuming I have to do it all in a single method)
public static int[] Homework(int size, int min, int max)
{
var seq = GetRand(size, min, max);
nextRand = cRNG.Next(min, max);
var index = (int) ((seq.Length * (double) (nextRand - min)) / (max - min));
nextRand = seq[index];
seq[index] = cRNG.Next(min, max);
return sequence; // nextRand that should be returned instead.
// Made it return the array because it should return the newly shuffled array hence the for loop
}

I think what you want is the following,
var index = (int) ((sequence.Length * (double)(randNum - min)) / (max - min));
But I suggest that you fill the array with unaltered random numbers from the source, then that you map the output of you random number generator to the interval [min, max[ as a overloading method Next(int min, int max) that calls the output of your Bays-Durham shuffle process.
public class BaysDurham
{
private readonly int[] t;
private int y; // auxiliary variable
// Knuth TAOCP 2 p. 34 Algorithm B
public BaysDurham(int k)
{
t = new int[k];
for (int i = 0; i < k; i++)
{
t[i] = rand.Next();
}
y = rand.Next();
}
public int Next()
{
var i = (int)(((long) t.Length * y) / int.MaxValue);
y = t[i];
t[i] = rand.Next();
return y;
}
public int Next(int min, int max)
{
long m = max - min;
return min + (int) ((Next() * m) / int.MaxValue);
}
private readonly Random rand = new Random();
}
Generally, it is a great software design principle to have separation of concerns.
Test code:
[Test]
public void Test()
{
const int k = 30;
const int n = 1000 * 1000;
const int min = 10;
const int max = 27;
var rng = new BaysDurham(k);
var count = new int[max];
for (int i = 0; i < n; i++)
{
var index = rng.Next(min, max);
count[index] ++;
}
Console.WriteLine($"Expected : {1.0 / (max - min):F4}, Actual :");
for (int i = min; i < max; i++)
{
Console.WriteLine($"{i} : {count[i] / (double) (n):F4}");
}
}
Result:
Expected : 0.0588, Actual :
10 : 0.0584
11 : 0.0588
12 : 0.0588
13 : 0.0587
14 : 0.0587
15 : 0.0589
16 : 0.0590
17 : 0.0592
18 : 0.0586
19 : 0.0590
20 : 0.0586
21 : 0.0588
22 : 0.0586
23 : 0.0587
24 : 0.0590
25 : 0.0594
26 : 0.0590

Get n distinct random numbers between two values whose sum is equal to a given number

I would like to find distinct random numbers within a range that sums up to given number.
Note: I found similar questions in stackoverflow, however they do not address exactly this problem (ie they do not consider a negative lowerLimit for the range).
If I wanted that the sum of my random number was equal to 1 I just generate the required random numbers, compute the sum and divided each of them by the sum; however here I need something a bit different; I will need my random numbers to add up to something different than 1 and still my random numbers must be within a given range.
Example: I need 30 distinct random numbers (non integers) between -50 and 50 where the sum of the 30 generated numbers must be equal to 300; I wrote the code below, however it will not work when n is much larger than the range (upperLimit - lowerLimit), the function could return numbers outside the range [lowerLimit - upperLimit]. Any help to improve the current solution?
static void Main(string[] args)
{
var listWeights = GetRandomNumbersWithConstraints(30, 50, -50, 300);
}
private static List<double> GetRandomNumbersWithConstraints(int n, int upperLimit, int lowerLimit, int sum)
{
if (upperLimit <= lowerLimit || n < 1)
throw new ArgumentOutOfRangeException();
Random rand = new Random(Guid.NewGuid().GetHashCode());
List<double> weight = new List<double>();
for (int k = 0; k < n; k++)
{
//multiply by rand.NextDouble() to avoid duplicates
double temp = (double)rand.Next(lowerLimit, upperLimit) * rand.NextDouble();
if (weight.Contains(temp))
k--;
else
weight.Add(temp);
}
//divide each element by the sum
weight = weight.ConvertAll<double>(x => x / weight.Sum()); //here the sum of my weight will be 1
return weight.ConvertAll<double>(x => x * sum);
}
EDIT - to clarify
Running the current code will generate the following 30 numbers that add up to 300. However those numbers are not within -50 and 50
-4.425315699
67.70219958
82.08592061
46.54014109
71.20352208
-9.554070146
37.65032717
-75.77280868
24.68786878
30.89874589
142.0796933
-1.964407284
9.831226893
-15.21652248
6.479463312
49.61283063
118.1853036
-28.35462683
49.82661159
-65.82706541
-29.6865969
-54.5134262
-56.04708803
-84.63783048
-3.18402453
-13.97935982
-44.54265204
112.774348
-2.911427266
-58.94098071

Ok, here how it could be done
We will use Dirichlet Distribution, which is distribution for random numbers xi in the range [0...1] such that
Sumi xi = 1
So, after linear rescaling condition for sum would be satisfied automatically. Dirichlet distribution is parametrized by αi, but we assume all RN to be from the same marginal distribution, so there is only one parameter α for each and every index.
For reasonable large value of α, mean value of sampled random numbers would be =1/n, and variance ~1/(n * α), so larger α lead to random value more close to the mean.
Ok, now back to rescaling,
vi = A + B*xi
And we have to get A and B. As #HansKeﬆing rightfully noted, with only two free parameters we could satisfy only two constraints, but you have three. So we would strictly satisfy low bound constraint, sum value constraint, but occasionally violate upper bound constraint. In such case we just throw whole sample away and do another one.
Again, we have a knob to turn, α getting larger means we are close to mean values and less likely to hit upper bound. With α = 1 I'm rarely getting any good sample, but with α = 10 I'm getting close to 40% of good samples. With α = 16 I'm getting close to 80% of good samples.
Dirichlet sampling is done via Gamma distribution, using code from MathDotNet.
Code, tested with .NET Core 2.1
using System;
using MathNet.Numerics.Distributions;
using MathNet.Numerics.Random;
class Program
{
static void SampleDirichlet(double alpha, double[] rn)
{
if (rn == null)
throw new ArgumentException("SampleDirichlet:: Results placeholder is null");
if (alpha <= 0.0)
throw new ArgumentException($"SampleDirichlet:: alpha {alpha} is non-positive");
int n = rn.Length;
if (n == 0)
throw new ArgumentException("SampleDirichlet:: Results placeholder is of zero size");
var gamma = new Gamma(alpha, 1.0);
double sum = 0.0;
for(int k = 0; k != n; ++k) {
double v = gamma.Sample();
sum += v;
rn[k] = v;
}
if (sum <= 0.0)
throw new ApplicationException($"SampleDirichlet:: sum {sum} is non-positive");
// normalize
sum = 1.0 / sum;
for(int k = 0; k != n; ++k) {
rn[k] *= sum;
}
}
static bool SampleBoundedDirichlet(double alpha, double sum, double lo, double hi, double[] rn)
{
if (rn == null)
throw new ArgumentException("SampleDirichlet:: Results placeholder is null");
if (alpha <= 0.0)
throw new ArgumentException($"SampleDirichlet:: alpha {alpha} is non-positive");
if (lo >= hi)
throw new ArgumentException($"SampleDirichlet:: low {lo} is larger than high {hi}");
int n = rn.Length;
if (n == 0)
throw new ArgumentException("SampleDirichlet:: Results placeholder is of zero size");
double mean = sum / (double)n;
if (mean < lo || mean > hi)
throw new ArgumentException($"SampleDirichlet:: mean value {mean} is not within [{lo}...{hi}] range");
SampleDirichlet(alpha, rn);
bool rc = true;
for(int k = 0; k != n; ++k) {
double v = lo + (mean - lo)*(double)n * rn[k];
if (v > hi)
rc = false;
rn[k] = v;
}
return rc;
}
static void Main(string[] args)
{
double[] rn = new double [30];
double lo = -50.0;
double hi = 50.0;
double alpha = 10.0;
double sum = 300.0;
for(int k = 0; k != 1_000; ++k) {
var q = SampleBoundedDirichlet(alpha, sum, lo, hi, rn);
Console.WriteLine($"Rng(BD), v = {q}");
double s = 0.0;
foreach(var r in rn) {
Console.WriteLine($"Rng(BD), r = {r}");
s += r;
}
Console.WriteLine($"Rng(BD), summa = {s}");
}
}
}
UPDATE
Usually, when people ask such question, there is an implicit assumption/requirement - all random numbers shall be distribution in the same way. It means that if I draw marginal probability density function (PDF) for item indexed 0 from the sampled array, I shall get the same distribution as I draw marginal probability density function for the last item in the array. People usually sample random arrays to pass it down to other routines to do some interesting stuff. If marginal PDF for item 0 is different from marginal PDF for last indexed item, then just reverting array will produce wildly different result with the code which uses such random values.
Here I plotted distributions of random numbers for item 0 and last item (#29) for original conditions([-50...50] sum=300), using my sampling routine. Look similar, isn't it?
Ok, here is a picture from your sampling routine, same original conditions([-50...50] sum=300), same number of samples
UPDATE II
User supposed to check return value of the sampling routine and accept and use sampled array if (and only if) return value is true. This is acceptance/rejection method. As an illustration, below is code used to histogram samples:
int[] hh = new int[100]; // histogram allocated
var s = 1.0; // step size
int k = 0; // good samples counter
for( ;; ) {
var q = SampleBoundedDirichlet(alpha, sum, lo, hi, rn);
if (q) // good sample, accept it
{
var v = rn[0]; // any index, 0 or 29 or ....
var i = (int)((v - lo) / s);
i = System.Math.Max(i, 0);
i = System.Math.Min(i, hh.Length-1);
hh[i] += 1;
++k;
if (k == 100000) // required number of good samples reached
break;
}
}
for(k = 0; k != hh.Length; ++k)
{
var x = lo + (double)k * s + 0.5*s;
var v = hh[k];
Console.WriteLine($"{x} {v}");
}

Here you go. It'll probably run for centuries before actually returning the list, but it'll comply :)
public List<double> TheThing(int qty, double lowest, double highest, double sumto)
{
if (highest * qty < sumto)
{
throw new Exception("Impossibru!");
// heresy
highest = sumto / 1 + (qty * 2);
lowest = -highest;
}
double rangesize = (highest - lowest);
Random r = new Random();
List<double> ret = new List<double>();
while (ret.Sum() != sumto)
{
if (ret.Count > 0)
ret.RemoveAt(0);
while (ret.Count < qty)
ret.Add((r.NextDouble() * rangesize) + lowest);
}
return ret;
}

I come up with this solution which is fast. I am sure it couldbe improved, but for the moment it does the job.
n = the number of random numbers that I will need to find
Constraints
the n random numbers must add up to finalSum the n random numbers
the n random numbers must be within lowerLimit and upperLimit
The idea is to remove from the initial list (that sums up to finalSum) of random numbers the numbers outside the range [lowerLimit, upperLimit].
Then count the number left of the list (called nValid) and their sum (called sumOfValid).
Now, iteratively search for (n-nValid) random numbers within the range [lowerLimit, upperLimit] whose sum is (finalSum-sumOfValid)
I tested it with several combinations for the inputs variables (including negative sum) and the results looks good.
static void Main(string[] args)
{
int n = 100;
int max = 5000;
int min = -500000;
double finalSum = -1000;
for (int i = 0; i < 5000; i++)
{
var listWeights = GetRandomNumbersWithConstraints(n, max, min, finalSum);
Console.WriteLine("=============");
Console.WriteLine("sum = " + listWeights.Sum());
Console.WriteLine("max = " + listWeights.Max());
Console.WriteLine("min = " + listWeights.Min());
Console.WriteLine("count = " + listWeights.Count());
}
}
private static List<double> GetRandomNumbersWithConstraints(int n, int upperLimit, int lowerLimit, double finalSum, int precision = 6)
{
if (upperLimit <= lowerLimit || n < 1) //todo improve here
throw new ArgumentOutOfRangeException();
Random rand = new Random(Guid.NewGuid().GetHashCode());
List<double> randomNumbers = new List<double>();
int adj = (int)Math.Pow(10, precision);
bool flag = true;
List<double> weights = new List<double>();
while (flag)
{
foreach (var d in randomNumbers.Where(x => x <= upperLimit && x >= lowerLimit).ToList())
{
if (!weights.Contains(d)) //only distinct
weights.Add(d);
}
if (weights.Count() == n && weights.Max() <= upperLimit && weights.Min() >= lowerLimit && Math.Round(weights.Sum(), precision) == finalSum)
return weights;
/* worst case - if the largest sum of the missing elements (ie we still need to find 3 elements,
* then the largest sum is 3*upperlimit) is smaller than (finalSum - sumOfValid)
*/
if (((n - weights.Count()) * upperLimit < (finalSum - weights.Sum())) ||
((n - weights.Count()) * lowerLimit > (finalSum - weights.Sum())))
{
weights = weights.Where(x => x != weights.Max()).ToList();
weights = weights.Where(x => x != weights.Min()).ToList();
}
int nValid = weights.Count();
double sumOfValid = weights.Sum();
int numberToSearch = n - nValid;
double sum = finalSum - sumOfValid;
double j = finalSum - weights.Sum();
if (numberToSearch == 1 && (j <= upperLimit || j >= lowerLimit))
{
weights.Add(finalSum - weights.Sum());
}
else
{
randomNumbers.Clear();
int min = lowerLimit;
int max = upperLimit;
for (int k = 0; k < numberToSearch; k++)
{
randomNumbers.Add((double)rand.Next(min * adj, max * adj) / adj);
}
if (sum != 0 && randomNumbers.Sum() != 0)
randomNumbers = randomNumbers.ConvertAll<double>(x => x * sum / randomNumbers.Sum());
}
}
return randomNumbers;
}

Rounding a value to only a list of certain values in C#

I have a list of double values, I want to Round a variable's value to only that list of numbers
Example:
The list contents are: {12,15,23,94,35,48}
The Variable's value is 17, So it will be rounded to 15
If The variable's value is less than the least number, it will be rounded to it, if it's value is larger than the largest number, it will be rounded to it.
The list contents is always changing according to an external factor, So I cannot hardocde the values I Want to round up or down to.
How can do it in C# ?

Here's a method using LINQ:
var list = new[] { 12, 15, 23, 94, 35, 48 };
var input = 17;
var diffList = from number in list
select new {
number,
difference = Math.Abs(number - input)
};
var result = (from diffItem in diffList
orderby diffItem.difference
select diffItem).First().number;
EDIT: Renamed some of the variables so the code is less confusing...
EDIT:
The list variable is an implicitly declare array of int. The first LINQ statement diffList defines an anonymous type that has your original number from the list (number) as well as the difference between it and your current value (input).
The second LINQ statement result orders that anonymous type collection by the difference, which is your "rounding" requirement. It takes the first item in that list as it will have the smallest difference, and then selects only the original .number from the anonymous type.

Assuming the array is sorted, you could perform a binary search in the array, narrowing it down to which two numbers the given number lies between.
Then once you have these two numbers, you simply round to the nearest of the two.
static int RoundToArray(int value, int[] array) {
int min = 0;
if (array[min] >= value) return array[min];
int max = array.Length - 1;
if (array[max] <= value) return array[max];
while (max - min > 1) {
int mid = (max + min) / 2;
if (array[mid] == value) {
return array[mid];
} else if (array[mid] < value) {
min = mid;
} else {
max = mid;
}
}
if (array[max] - value <= value - array[min]) {
return array[max];
} else {
return array[min];
}
}

Using linq:
int value = 17;
var values = new float[] { 12, 15, 23, 94, 35, 48 };
if(value < values.First()) return value.First();
if(value > values.Last()) return value.Last();
float below = values.Where(v => v <= value).Max();
float above = values.Where(v => v >= value).Min();
if(value - below < above - value)
return below;
else
return above;
As long as then number of possible values is quite small this should work. If you have thousands of possible values another solution should be used, which takes advantage of values being sorted (if it really is sorted).

You could loop through the array of numbers and set a roundedNum variable equal to each one if a variable delta is less than the current lowest delta. Some things are best described in code.
int roundedNum = myNum;
int delta = myArray[myArray.Length-1] + 1;
for(int i=0; i<myArray.Length; ++i) {
if(Math.Abs(myNum - myArray[i]) < delta) {
delta = Math.Abs(myNum - myArray[i]);
roundedNum = myArray[i];
}
}
That should do the trick quite nicely.

Do something like this:
double distance = double.PositiveInfinity;
float roundedValue = float.NaN;
foreach (float f in list)
{
double d = Math.Abs(d - f);
if (d < distance)
{
distance = d;
roundedValue = f;
}
}

Simple rounding down in linq
public decimal? RoundDownToList(decimal input, decimal[] list)
{
var result = (from number in list
where number <= input
orderby number descending
select number).FirstOrDefault();
return result;
}

LINQ to calculate a moving average of a SortedList<dateTime,double>

I have a time series in the form of a SortedList<dateTime,double>. I would like to calculate a moving average of this series. I can do this using simple for loops. I was wondering if there is a better way to do this using linq.
my version:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var mySeries = new SortedList<DateTime, double>();
mySeries.Add(new DateTime(2011, 01, 1), 10);
mySeries.Add(new DateTime(2011, 01, 2), 25);
mySeries.Add(new DateTime(2011, 01, 3), 30);
mySeries.Add(new DateTime(2011, 01, 4), 45);
mySeries.Add(new DateTime(2011, 01, 5), 50);
mySeries.Add(new DateTime(2011, 01, 6), 65);
var calcs = new calculations();
var avg = calcs.MovingAverage(mySeries, 3);
foreach (var item in avg)
{
Console.WriteLine("{0} {1}", item.Key, item.Value);
}
}
}
class calculations
{
public SortedList<DateTime, double> MovingAverage(SortedList<DateTime, double> series, int period)
{
var result = new SortedList<DateTime, double>();
for (int i = 0; i < series.Count(); i++)
{
if (i >= period - 1)
{
double total = 0;
for (int x = i; x > (i - period); x--)
total += series.Values[x];
double average = total / period;
result.Add(series.Keys[i], average);
}
}
return result;
}
}
}

In order to achieve an asymptotical performance of O(n) (as the hand-coded solution does), you could use the Aggregate function like in
series.Skip(period-1).Aggregate(
new {
Result = new SortedList<DateTime, double>(),
Working = List<double>(series.Take(period-1).Select(item => item.Value))
},
(list, item)=>{
list.Working.Add(item.Value);
list.Result.Add(item.Key, list.Working.Average());
list.Working.RemoveAt(0);
return list;
}
).Result;
The accumulated value (implemented as anonymous type) contains two fields: Result contains the result list build up so far. Working contains the last period-1 elements. The aggregate function adds the current value to the Working list, builds the current average and adds it to the result and then removes the first (i.e. oldest) value from the working list.
The "seed" (i.e. the starting value for the accumulation) is build by putting the first period-1 elements into Working and initializing Result to an empty list.
Consequently tha aggregation starts with element period (by skipping (period-1) elements at the beginning)
In functional programming this is a typical usage pattern for the aggretate (or fold) function, btw.
Two remarks:
The solution is not "functionally" clean in that the same list objects (Working and Result) are reused in every step. I'm not sure if that might cause problems if some future compilers try to parallellize the Aggregate function automatically (on the other hand I'm also not sure, if that's possible after all...). A purely functional solution should "create" new lists at every step.
Also note that C# lacks powerful list expressions. In some hypothetical Python-C#-mixed pseudocode one could write the aggregation function like
(list, item)=>
new {
Result = list.Result + [(item.Key, (list.Working+[item.Value]).Average())],
Working=list.Working[1::]+[item.Value]
}
which would be a bit more elegant in my humble opinion :)

For the most efficient way possible to compute a Moving Average with LINQ, you shouldn't use LINQ!
Instead I propose creating a helper class which computes a moving average in the most efficient way possible (using a circular buffer and causal moving average filter), then an extension method to make it accessible to LINQ.
First up, the moving average
public class MovingAverage
{
private readonly int _length;
private int _circIndex = -1;
private bool _filled;
private double _current = double.NaN;
private readonly double _oneOverLength;
private readonly double[] _circularBuffer;
private double _total;
public MovingAverage(int length)
{
_length = length;
_oneOverLength = 1.0 / length;
_circularBuffer = new double[length];
}
public MovingAverage Update(double value)
{
double lostValue = _circularBuffer[_circIndex];
_circularBuffer[_circIndex] = value;
// Maintain totals for Push function
_total += value;
_total -= lostValue;
// If not yet filled, just return. Current value should be double.NaN
if (!_filled)
{
_current = double.NaN;
return this;
}
// Compute the average
double average = 0.0;
for (int i = 0; i < _circularBuffer.Length; i++)
{
average += _circularBuffer[i];
}
_current = average * _oneOverLength;
return this;
}
public MovingAverage Push(double value)
{
// Apply the circular buffer
if (++_circIndex == _length)
{
_circIndex = 0;
}
double lostValue = _circularBuffer[_circIndex];
_circularBuffer[_circIndex] = value;
// Compute the average
_total += value;
_total -= lostValue;
// If not yet filled, just return. Current value should be double.NaN
if (!_filled && _circIndex != _length - 1)
{
_current = double.NaN;
return this;
}
else
{
// Set a flag to indicate this is the first time the buffer has been filled
_filled = true;
}
_current = _total * _oneOverLength;
return this;
}
public int Length { get { return _length; } }
public double Current { get { return _current; } }
}
This class provides a very fast and lightweight implementation of a MovingAverage filter. It creates a circular buffer of Length N and computes one add, one subtract and one multiply per data-point appended, as opposed to the N multiply-adds per point for the brute force implementation.
Next, to LINQ-ify it!
internal static class MovingAverageExtensions
{
public static IEnumerable<double> MovingAverage<T>(this IEnumerable<T> inputStream, Func<T, double> selector, int period)
{
var ma = new MovingAverage(period);
foreach (var item in inputStream)
{
ma.Push(selector(item));
yield return ma.Current;
}
}
public static IEnumerable<double> MovingAverage(this IEnumerable<double> inputStream, int period)
{
var ma = new MovingAverage(period);
foreach (var item in inputStream)
{
ma.Push(item);
yield return ma.Current;
}
}
}
The above extension methods wrap the MovingAverage class and allow insertion into an IEnumerable stream.
Now to use it!
int period = 50;
// Simply filtering a list of doubles
IEnumerable<double> inputDoubles;
IEnumerable<double> outputDoubles = inputDoubles.MovingAverage(period);
// Or, use a selector to filter T into a list of doubles
IEnumerable<Point> inputPoints; // assuming you have initialised this
IEnumerable<double> smoothedYValues = inputPoints.MovingAverage(pt => pt.Y, period);

You already have an answer showing you how you can use LINQ but frankly I wouldn't use LINQ here as it will most likely perform poorly compared to your current solution and your existing code already is clear.
However instead of calculating the total of the previous period elements on every step, you can keep a running total and adjust it on each iteration. That is, change this:
total = 0;
for (int x = i; x > (i - period); x--)
total += series.Values[x];
to this:
if (i >= period) {
total -= series.Values[i - period];
}
total += series.Values[i];
This will mean that your code will take the same amount of time to execute regardless of the size of period.

This block
double total = 0;
for (int x = i; x > (i - period); x--)
total += series.Values[x];
double average = total / period;
can be rewritten as:
double average = series.Values.Skip(i - period + 1).Take(period).Sum() / period;
Your method may look like:
series.Skip(period - 1)
.Select((item, index) =>
new
{
item.Key,
series.Values.Skip(index).Take(period).Sum() / period
});
As you can see, linq is very expressive. I recommend to start with some tutorial like Introducing LINQ and 101 LINQ Samples.

To do this in a more functional way, you'd need a Scan method which exists in Rx but not in LINQ.
Let's look how it would look like if we'd have a scan method
var delta = 3;
var series = new [] {1.1, 2.5, 3.8, 4.8, 5.9, 6.1, 7.6};
var seed = series.Take(delta).Average();
var smas = series
.Skip(delta)
.Zip(series, Tuple.Create)
.Scan(seed, (sma, values)=>sma - (values.Item2/delta) + (values.Item1/delta));
smas = Enumerable.Repeat(0.0, delta-1).Concat(new[]{seed}).Concat(smas);
And here's the scan method, taken and adjusted from here:
public static IEnumerable<TAccumulate> Scan<TSource, TAccumulate>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> accumulator
)
{
if (source == null) throw new ArgumentNullException("source");
if (seed == null) throw new ArgumentNullException("seed");
if (accumulator == null) throw new ArgumentNullException("accumulator");
using (var i = source.GetEnumerator())
{
if (!i.MoveNext())
{
throw new InvalidOperationException("Sequence contains no elements");
}
var acc = accumulator(seed, i.Current);
while (i.MoveNext())
{
yield return acc;
acc = accumulator(acc, i.Current);
}
yield return acc;
}
}
This should have better performance than the brute force method since we are using a running total to calculate the SMA.
What's going on here?
To start we need to calculate the first period which we call seed here. Then, every subsequent value we calculate from the accumulated seed value. To do that we need the old value (that is t-delta) and the newest value for which we zip together the series, once from the beginning and once shifted by the delta.
At the end we do some cleanup by adding zeroes for the length of the first period and adding the initial seed value.

Another option is to use MoreLINQ's Windowed method, which simplifies the code significantly:
var averaged = mySeries.Windowed(period).Select(window => window.Average(keyValuePair => keyValuePair.Value));

I use this code to calculate SMA:
private void calculateSimpleMA(decimal[] values, out decimal[] buffer)
{
int period = values.Count(); // gets Period (assuming Period=Values-Array-Size)
buffer = new decimal[period]; // initializes buffer array
var sma = SMA(period); // gets SMA function
for (int i = 0; i < period; i++)
buffer[i] = sma(values[i]); // fills buffer with SMA calculation
}
static Func<decimal, decimal> SMA(int p)
{
Queue<decimal> s = new Queue<decimal>(p);
return (x) =>
{
if (s.Count >= p)
{
s.Dequeue();
}
s.Enqueue(x);
return s.Average();
};
}

Here is an extension method:
public static IEnumerable<double> MovingAverage(this IEnumerable<double> source, int period)
{
if (source is null)
{
throw new ArgumentNullException(nameof(source));
}
if (period < 1)
{
throw new ArgumentOutOfRangeException(nameof(period));
}
return Core();
IEnumerable<double> Core()
{
var sum = 0.0;
var buffer = new double[period];
var n = 0;
foreach (var x in source)
{
n++;
sum += x;
var index = n % period;
if (n >= period)
{
sum -= buffer[index];
yield return sum / period;
}
buffer[index] = x;
}
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Looking for a Histogram Binning algorithm for decimal data - c#

Related

Random() times 2^32 - 1 always returns even last digit

How to use Shuffling to improve rng?

Get n distinct random numbers between two values whose sum is equal to a given number

Rounding a value to only a list of certain values in C#

LINQ to calculate a moving average of a SortedList<dateTime,double>

Categories

Resources