Calculate Percentile using LINQ - c#

All,
Having reviewed StackOverflow and the wider internet, I am still struggling to efficiently calculate Percentiles using LINQ.
Where a percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. The below example attempts to convert a list of values, to an array where each (unique) value is represented with is associated percentile.
The min() and max() of the list are necessarily the 0% and 100% of the returned array percentiles.
Using LINQPad, the below code generates the required output a VP[]:
This can be interpreted as:
- At 0% the minimum value is 1
- At 100% the maximum value is 3
- At 50% between the minimum and maximum the value is 2
void Main()
{
var list = new List<double> {1,2,3};
double denominator = list.Count - 1;
var answer = list.Select(x => new VP
{
Value = x,
Percentile = list.Count(y => x > y) / denominator
})
//.GroupBy(grp => grp.Value) --> commented out until attempted duplicate solution
.ToArray();
answer.Dump();
}
public struct VP
{
public double Value;
public double Percentile;
}
However, this returns an incorrect VP[] when the "list" contains duplicate entries (e.g. 1,2,**2,**3) :
My attempts to group by unique values in the list (by including ".GroupBy(grp => grp.Value)") have failed to yield the desired result (Value =2, & Percentile = 0.666) :
All suggestions are welcome. Including whether this is an efficient approach given the repeated iteration with "list.Count(y => x > y)".
As always, thanks
Shannon

I'm not sure I understand the requirements of this question. When I ran the accepted answer's code I got this result:
But if I change the input to this:
var dataSet = new List<double> { 1, 1, 1, 1, 2, 3, 3, 3, 2 };
...I then get this result:
With the line "The min() and max() of the list are necessarily the 0% and 100% of the returned array percentiles." it seems to me the OP is asking for the values to be from 0 to 1, but the updated result goes beyond 1.
It also seems wrong to me that the first value should be 0% as I'm not sure what that means in context to the data.
After reading the linked Wikipedia page it seems that the OP is actually trying to do the reverse calculation to computing the percentile value. In fact the article says that the percentile for 0 is undefined. That makes sense because a percentile of 0 would be the empty set of values - and what is the maximum value of an empty set?
The OP seems to be computing the percentile from the values. So, in that sense, and knowing that 0 is undefined, it seems that the most appropriate value to compute is the percentage of values that are equal to or below each distinct value in the set.
Now, if I use the Microsoft's Reactive Framework Team's Interactive Extensions (NuGet "Ix-Main") then I can run this code:
var dataSet = new List<double> { 1, 1, 1, 1, 2, 3, 3, 3, 2 };
var result =
dataSet
.GroupBy(x => x)
.Scan(
new VP()
{
Value = double.MinValue, Proportion = 0.0
},
(a, x) =>
new VP()
{
Value = x.Key,
Proportion = a.Proportion + (double)x.Count() / dataSet.Count
});
I get this result:
This tells me that approximately 44% of the values are 1; that approximately 67% of the values are 1 or 2; and 100% of the values are either 1, 2, or 3.
This seems to me to be the most logical computation for the requirements.

This is how I did it. I changed a few of the variable names to make the context clearer.
var dataSet = new List<double> { 1, 2, 3, 2 };
double denominator = dataSet.Count - 1;
var uniqueValues = dataSet.Distinct();
var vp = dataSet.Select(value => new VP
{
Value = value,
Proportion = dataSet.Count(datum => value > datum) / denominator
});
var answer = uniqueValues.Select(u => new VP{
Value = u,
Proportion = vp.Where(v => v.Value == u).Select(x => x.Proportion).Sum()
});

void Main()
{
var list = new List<double> {1,2,3};
double denominator = list.Count - 1;
var answer = list.OrderBy(x => x).Select(x => new VP
{
Value = x,
Proportion = list.IndexOf(x) / denominator
})
.ToArray();
answer.Dump();
}
public struct VP
{
public double Value;
public double Proportion;
}

Related

Should I use Sum method and Count/Length find the element of array that is the closest to the middle value of all elements?

If I have arr=[1,3,4,-7,9,11], the average value is (1+3+4-7+9+11) /6 = 3.5, then elements 3 and 4 are equally distant to 3.5, but smaller of them is 3 so 3 is the result.
You need to find out what the average is first. That involves a cycle either implemented explicitly, or invoked implicitly. So, let's assume that you already know what the average value is, because your question refers to the way some values related to the average can be obtained. Let's implement a comparison function:
protected bool isBetter(double a, double b, double avg) {
double absA = Abs(a - avg);
double absB = Abs(b - avg);
if (absA < absB) return a;
else if (absA > absB) return b;
return (a < b) ? a : b;
}
And now you can iterate your array, always compare via isBetter the current value with the best so far and if it's better, then it will be the new best. Whatever number ended up to be the best will be the result.
Assuming you have worked out the average (avg below) then you can get the diff for each item, then order by the diff, and get the first item. This will give you the closest item in the array
var nearestDiff = arr.Select(x => new { Value=x, Diff = Math.Abs(avg-x)})
.OrderBy(x => x.Diff)
.First().Value;
Live example: https://dotnetfiddle.net/iKvmhp
If instead you must get the item lower than the average
var lowerDiff = arr.Where(x => x<avg)
.OrderByDescending(x =>x)
.First();
You'll need using System.Linq for either of the above to work
Using GroupBy is a good way to do it
var arr = new int[] { 1, 4, 3, -7, 9, 11 };
var avg = arr.Average();
var result = arr.GroupBy(x=>Math.Abs(avg-x)).OrderBy(g=>g.Key).First().OrderBy(x=>x).First();
Original Array
[1,4,3,-7,9,11]
After grouping, key is abs distance from average, items are grouped according to that
[2.5, [1]]
[0.5, [4, 3]]
[5.5, [9]]
[7.5, [11]]
[10.5, [-7]]
Order by group keys
[0.5, [4, 3]]
[2.5, [1]]
[5.5, [9]]
[7.5, [11]]
[10.5, [-7]]
Take first group
[4, 3]
Order group items
[3, 4]
Take first item
3
changed array to [1,4,3,-7,9,11], reversing order of 3 and 4 because they are naturally ordered according to the output originally, and this is necessary to prove the last step

How can calculate the mode using array. Im only a beginner in programming. its for our finals project [duplicate]

This question already has answers here:
Find character with most occurrences in string?
(12 answers)
Closed 7 years ago.
I want to find the Mode in an Array. I know that I have to do nested loops to check each value and see how often the element in the array appears. Then I have to count the number of times the second element appears. The code below doesn't work, can anyone help me please.
for (int i = 0; i < x.length; i ++)
{
x[i]++;
int high = 0;
for (int i = 0; i < x.length; i++)
{
if (x[i] > high)
high = x[i];
}
}
Using nested loops is not a good way to solve this problem. It will have a run time of O(n^2) - much worse than the optimal O(n).
You can do it with LINQ by grouping identical values and then finding the group with the largest count:
int mode = x.GroupBy(v => v)
.OrderByDescending(g => g.Count())
.First()
.Key;
This is both simpler and faster. But note that (unlike LINQ to SQL) LINQ to Objects currently doesn't optimize the OrderByDescending when only the first result is needed. It fully sorts the entire result set which is an O(n log n) operation.
You might want this O(n) algorithm instead. It first iterates once through the groups to find the maximum count, and then once more to find the first corresponding key for that count:
var groups = x.GroupBy(v => v);
int maxCount = groups.Max(g => g.Count());
int mode = groups.First(g => g.Count() == maxCount).Key;
You could also use the MaxBy extension from MoreLINQ method to further improve the solution so that it only requires iterating through all elements once.
A non LINQ solution:
int[] x = new int[] { 1, 2, 1, 2, 4, 3, 2 };
Dictionary<int, int> counts = new Dictionary<int, int>();
foreach( int a in x ) {
if ( counts.ContainsKey(a) )
counts[a] = counts[a]+1
else
counts[a] = 1
}
int result = int.MinValue;
int max = int.MinValue;
foreach (int key in counts.Keys) {
if (counts[key] > max) {
max = counts[key];
result = key;
}
}
Console.WriteLine("The mode is: " + result);
As a beginner, this might not make too much sense, but it's worth providing a LINQ based solution.
x
.GroupBy(i => i) //place all identical values into groups
.OrderByDescending(g => g.Count()) //order groups by the size of the group desc
.Select(g => g.Key) //key of the group is representative of items in the group
.First() //first in the list is the most frequent (modal) value
Say, x array has items as below:
int[] x = { 1, 2, 6, 2, 3, 8, 2, 2, 3, 4, 5, 6, 4, 4, 4, 5, 39, 4, 5 };
a. Getting highest value:
int high = x.OrderByDescending(n => n).First();
b. Getting modal:
int mode = x.GroupBy(i => i) //Grouping same items
.OrderByDescending(g => g.Count()) //now getting frequency of a value
.Select(g => g.Key) //selecting key of the group
.FirstOrDefault(); //Finally, taking the most frequent value

Min-Max DataPoint Normilization

I have a list of DataPoint such as
List<DataPoint> newpoints=new List<DataPoint>();
where the DataPoint is a class consists of nine double features from A to I , and the
newpoints.count=100000 double points (i.e each point consists of nine double features from A to I)
I need to apply the normilization for List newpoints using Min-Max normilization method and the scale_range between 0 and 1.
I have implemneted so far the following steps
each DataPoints feature is assigned to one dimensional array. for example, the code for feature A
for (int i = 0; i < newpoints.Count; i++)
{ array_A[i] = newpoints[i].A;} and so on for all nine double features
I have applied the max-min normilization method. for example, the code for feature A:
normilized_featureA= (((array_A[i] - array_A.Min()) * (1 - 0)) /
(array_A.Max() - array_A.Min()))+0;
the method is succssfuly done but it takes more time (i.e. 3 minutes and 45 seconds)
how can i apply the Max_min normilization using the LINQ code in C# to reduce my time to a few seconds?
I found this question in Stackoverflow How to normalize a list of int values but my problem is
double valueMax = list.Max(); // I need Max point for feature A for all 100000
double valueMin = list.Min(); //I need Min point for feature A for all 100000
and so on for all others nine features
your help will be highly appreciated.
As an alternative to modelling your 9 features as double properties on a class "DataPoint", you could also model a datapoint of 9 doubles as an array, with the benefit being that you can do all 9 calculations in one pass, again, using LINQ:
var newpoints = new List<double[]>
{
new []{1.23, 2.34, 3.45, 4.56, 5.67, 6.78, 7.89, 8.90, 9.12},
new []{2.34, 3.45, 4.56, 5.67, 6.78, 7.89, 8.90, 9.12, 12.23},
new []{3.45, 4.56, 5.67, 6.78, 7.89, 8.90, 9.12, 12.23, 13.34},
new []{4.56, 5.67, 6.78, 7.89, 8.90, 9.12, 12.23, 13.34, 15.32}
};
var featureStats = newpoints
// We make the assumption that all 9 data points are present on each row.
.First()
// 2 Anon Projections - first to determine min / max as a function of column
.Select((np, idx) => new
{
Idx = idx,
Max = newpoints.Max(x => x[idx]),
Min = newpoints.Min(x => x[idx])
})
// Second to add in the dynamic Range
.Select(x => new {
x.Idx,
x.Max,
x.Min,
Range = x.Max - x.Min
})
// Back to array for O(1) lookups.
.ToArray();
// Do the normalizaton for the columns, for each row.
var normalizedFeatures = newpoints
.Select(np => np.Select(
(i, idx) => (i - featureStats[idx].Min) / featureStats[idx].Range));
foreach(var datapoint in normalizedFeatures)
{
Console.WriteLine(string.Join(",", datapoint.Select(x => x.ToString("0.00"))));
}
Result:
0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
0.33,0.33,0.33,0.33,0.34,0.47,0.23,0.05,0.50
0.67,0.67,0.67,0.67,0.69,0.91,0.28,0.75,0.68
1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00
Stop recalculating the maximum/minimum over and over again, it doesn't change.
double maxInFeatureA = array_A.Max();
double minInFeatureA = array_A.Min();
// somewher in the loop:
normilized_featureA= (((array_A[i] - minInFeatureA ) * (1 - 0)) /
(maxInFeatureA - minInFeatureA ))+0;
Max/Min is really expensive for array when used in foreach/for with many elements.
I suggest you to take this code: Array data normalization
and use it as
var normalizedPoints = newPoints.Select(x => x.A)
.NormalizeData(1, 1)
.ToList();
double min = newpoints.Min(p => p.A);
double max = newpoints.Max(p => p.A);
double readonly normalizer = 1 / (max - min);
var normalizedFeatureA = newpoints.Select(p => (p.A - min) * normalizer);

How to find data inside array that is most closest to a condition

I have an array of double Double[] array = new Double[5];
For example, if the array contains data like this:
{0.5 , 1.5 , 1.1 , 0.6 , 2}
How do I find the number that is closest to 1? The output should be 1.1, because it's the one that is closest to 1 in this case.
var result = source.OrderBy(x => Math.Abs(1 - x)).First();
Requires using System.Linq; at the top of a file. It's O(n log(n)) solution.
Update
If you're really afraid about performance and want O(n) solution, you can use MinBy() extension method from moreLINQ library.
Or you could use Aggregate() method:
var result = source.Aggregate(
new { val = 0d, abs = double.MaxValue },
(a, i) => Math.Abs(1 - i) > a.abs ? a : new { val = i, abs = Math.Abs(1 - i) },
a => a.val);
You can achieve this in a simple way using LINQ:
var closestTo1 = array.OrderBy(x => Math.Abs(x - 1)).First();
Something like this should be easy to understand by any programmer and has O(n) complexity (non-LINQ):
double minValue = array[0];
double minDifference = Math.Abs(array[0] - 1);
foreach (double val in array)
{
int dif = Math.Abs(x - 1);
if (dif < minValue)
{
minDifference = dif;
minValue = val;
}
}
After this code executes, minValue will have your required value.
Code summary:
It will set the minimum value as the first element of the array. Then the difference will be the absolute value of the first element minus 1.
This loop will linear search the array and find the minimum value of the array. If the difference is less than the minimum value it will set a new minimum difference and minimum value.

C#: Loop to find minima of function

I currently have this function:
public double Max(double[] x, double[] y)
{
//Get min and max of x array as integer
int xMin = Convert.ToInt32(x.Min());
int xMax = Convert.ToInt32(x.Max());
// Generate a list of x values for input to Lagrange
double i = 2;
double xOld = Lagrange(xMin,x,y);
double xNew = xMax;
do
{
xOld = xNew;
xNew = Lagrange(i,x,y);
i = i + 0.01;
} while (xOld > xNew);
return i;
}
This will find the minimum value on a curve with decreasing slope...however, given this curve, I need to find three minima.
How can I find the three minima and output them as an array or individual variables? This curve is just an example--it could be inverted--regardless, I need to find multiple variables. So once the first min is found it needs to know how to get over the point of inflection and find the next... :/
*The Lagrange function can be found here.** For all practical purposes, the Lagrange function will give me f(x) when I input x...visually, it means the curve supplied by wolfram alpha.
*The math-side of this conundrum can be found here.**
Possible solution?
Generate an array of input, say x[1,1.1,1.2,1.3,1.4...], get an array back from the Lagrange function. Then find the three lowest values of this function? Then get the keys corresponding to the values? How would I do this?
It's been a while since I've taken a numerical methods class, so bear with me. In short there are a number of ways to search for the root(s) of a function, and depending on what your your function is (continuous? differentiable?), you need to choose one that is appropriate.
For your problem, I'd probably start by trying to use Newton's Method to find the roots of the second degree Lagrange polynomial for your function. I haven't tested out this library, but there is a C# based numerical methods package on CodePlex that implements Newton's Method that is open source. If you wanted to dig through the code you could.
The majority of root finding methods have cousins in the broader CS topic of 'search'. If you want a really quick and dirty approach, or you have a very large search space, consider something like Simulated Annealing. Finding all of your minima isn't guaranteed but it's fast and easy to code.
Assuming you're just trying to "brute force" calculate this to a certain level of prcision, you need your algorithm to basically find any value where both neighbors are greater than the current value of your loop.
To simplify this, let's just say you have an array of numbers, and you want to find the indices of the three local minima. Here's a simple algorithm to do it:
public void Test()
{
var ys = new[] { 1, 2, 3, 4, 5, 4, 3, 2, 1, 2, 3, 4, 5, 4, 3, 4, 5, 4 };
var indices = GetMinIndices(ys);
}
public List<int> GetMinIndices(int[] ys)
{
var minIndices = new List<int>();
for (var index = 1; index < ys.Length; index++)
{
var currentY = ys[index];
var previousY = ys[index - 1];
if (index < ys.Length - 1)
{
var neytY = ys[index + 1];
if (previousY > currentY && neytY > currentY) // neighbors are greater
minIndices.Add(index); // add the index to the list
}
else // we're at the last index
{
if (previousY > currentY) // previous is greater
minIndices.Add(index);
}
}
return minIndices;
}
So, basically, you pass in your array of function results (ys) that you calculated for an array of inputs (xs) (not shown). What you get back from this function is the minimum indices. So, in this example, you get back 8, 14, and 17.

Categories