calculate average function of several functions

calculate average function of several functions - c#

I have several ordered List of X/Y Pairs and I want to calculate a ordered List of X/Y Pairs representing the average of these Lists.
All these Lists (including the "average list") will then be drawn onto a chart (see example picture below).
I have several problems:
The different lists don't have the same amount of values
The X and Y values can increase and decrease and increase (and so on) (see example picture below)
I need to implement this in C#, altought I guess that's not really important for the algorithm itself.
Sorry, that I can't explain my problem in a more formal or mathematical way.
EDIT: I replaced the term "function" with "List of X/Y Pairs" which is less confusing.

I would use the method Justin proposes, with one adjustment. He suggests using a mappingtable with fractional indices, though I would suggest integer indices. This might sound a little mathematical, but it's no shame to have to read the following twice(I'd have to too). Suppose the point at index i in a list of pairs A has searched for the closest points in another list B, and that closest point is at index j. To find the closest point in B to A[i+1] you should only consider points in B with an index equal to or larger than j. It will probably by j + 1, but could be j or j + 2, j + 3 etc, but never below j. Even if the point closest to A[i+1] has an index smaller than j, you still shouldn't use that point to interpolate with, since that would result in an unexpected average and graph. I'll take a moment now to create some sample code for you. I hope you see that this optimalization makes sense.
EDIT: While implementing this, I realised that j is not only bounded from below(by the method described above), but also bounded from above. When you try the distance from A[i+1] to B[j], B[j+1], B[j+2] etc, you can stop comparing when the distance A[i+1] to B[j+...] stops decreasing. There's no point in searching further in B. The same reasoning applies as when j was bounded from below: even if some point elsewhere in B would be closer, that's probably not the point you want to interpolate with. Doing so would result in an unexpected graph, probably less smooth than you'd expect. And an extra bonus of this second bound is the improved performance. I've created the following code:
IEnumerable<Tuple<double, double>> Average(List<Tuple<double, double>> A, List<Tuple<double, double>> B)
{
if (A == null || B == null || A.Any(p => p == null) || B.Any(p => p == null)) throw new ArgumentException();
Func<double, double> square = d => d * d;//squares its argument
Func<int, int, double> euclidianDistance = (a, b) => Math.Sqrt(square(A[a].Item1 - B[b].Item1) + square(A[a].Item2 - B[b].Item2));//computes the distance from A[first argument] to B[second argument]
int previousIndexInB = 0;
for (int i = 0; i < A.Count; i++)
{
double distance = euclidianDistance(i, previousIndexInB);//distance between A[i] and B[j - 1], initially
for (int j = previousIndexInB + 1; j < B.Count; j++)
{
var distance2 = euclidianDistance(i, j);//distance between A[i] and B[j]
if (distance2 < distance)//if it's closer than the previously checked point, keep searching. Otherwise stop the search and return an interpolated point.
{
distance = distance2;
previousIndexInB = j;
}
else
{
break;//don't place the yield return statement here, because that could go wrong at the end of B.
}
}
yield return LinearInterpolation(A[i], B[previousIndexInB]);
}
}
Tuple<double, double> LinearInterpolation(Tuple<double, double> a, Tuple<double, double> b)
{
return new Tuple<double, double>((a.Item1 + b.Item1) / 2, (a.Item2 + b.Item2) / 2);
}
For your information, the function Average returns the same amount of interpolated points the list A contains, which is probably fine, but you should think about this for your specific application. I've added some comments in it to clarify some details, and I've described all aspects of this code in the text above. I hope it's clear, and otherwise feel free to ask questions.
SECOND EDIT: I misread and thought you had only two lists of points. I have created a generalised function of that above accepting multiple lists. It still uses only those principles explained above.
IEnumerable<Tuple<double, double>> Average(List<List<Tuple<double, double>>> data)
{
if (data == null || data.Count < 2 || data.Any(list => list == null || list.Any(p => p == null))) throw new ArgumentException();
Func<double, double> square = d => d * d;
Func<Tuple<double, double>, Tuple<double, double>, double> euclidianDistance = (a, b) => Math.Sqrt(square(a.Item1 - b.Item1) + square(a.Item2 - b.Item2));
var firstList = data[0];
for (int i = 0; i < firstList.Count; i++)
{
int[] previousIndices = new int[data.Count];//the indices of points which are closest to the previous point firstList[i - 1].
//(or zero if i == 0). This is kept track of per list, except the first list.
var closests = new Tuple<double, double>[data.Count];//an array of points used for caching, of which the average will be yielded.
closests[0] = firstList[i];
for (int listIndex = 1; listIndex < data.Count; listIndex++)
{
var list = data[listIndex];
double distance = euclidianDistance(firstList[i], list[previousIndices[listIndex]]);
for (int j = previousIndices[listIndex] + 1; j < list.Count; j++)
{
var distance2 = euclidianDistance(firstList[i], list[j]);
if (distance2 < distance)//if it's closer than the previously checked point, keep searching. Otherwise stop the search and return an interpolated point.
{
distance = distance2;
previousIndices[listIndex] = j;
}
else
{
break;
}
}
closests[listIndex] = list[previousIndices[listIndex]];
}
yield return new Tuple<double, double>(closests.Select(p => p.Item1).Average(), closests.Select(p => p.Item2).Average());
}
}
Actually that I did the specific case for 2 lists separately might have been a good thing: it is easily explained and offers a step before understanding the generalised version. Furthermore, the square root could be taken out, since it doesn't change the order of the distances when sorted, just the lengths.
THIRD EDIT: In the comments it became clear there might be a bug. I think there are none, aside from the mentioned small bug, which shouldn't make any difference except for at the end of the graphs. As a proof that it actually works, this is the result of it(the dotted line is the average):

I'll use a metaphor of your functions being cars racing down a curvy racetrack, where you want to extract the center-line of the track given the cars' positions. Each car's position can be described as a function of time:
p1(t) = (x1(t), y1(t))
p2(t) = (x2(t), y2(t))
p3(t) = (x3(t), y3(t))
The crucial problem is that the cars are racing at different speeds, which means that p1(10) could be twice as far down the race track as p2(10). If you took a naive average of these two points, and there was a sharp curve in the track between the cars, the average may be far from the track.
If you could just transform your functions to no longer be a function of time, but a function of the distance along the track, then you would be able to do what you want.
One way you could do this would be to choose the slowest car (i.e., the one with the greatest number of samples). Then, for each sample of the slowest car's position, look at all of the other cars' paths, find the two closest points, and choose the point on the interpolated line which is closest to the slowest car's position. Then average these points together. Once you do this for all of the slow car's samples, you have an average path.
I'm assuming that all of the cars start and end in roughly the same places; if any of the cars just race a small portion of the track, you will need to add some more logic to detect that.
A possible improvement (for both performance and accuracy), is to keep track of the most recent sample you are using for each car and the speed of each car (the relative sampling rate). For your slowest car, it would be a simple map: 1 => 1, 2 => 2, 3 => 3, ... For the other cars, though, it could be more like: 1 => 0.3, 2 => 0.7, 3 => 1.6 (fractional values are due to interpolation). The speed would be the inverse of the change in sample number (e.g., the slow car would have speed 1, and the other car would have speed 1/(1.6-0.7)=1.11). You could then ensure that you don't accidentally backtrack on any of the cars. You could also improve the calculation speed because you don't have to search through the whole set of all points on each path; instead, you can assume that the next sample will be somewhere close to the current sample plus 1/speed.

As these are not y=f(x) functions, are they perhaps something like (x,y)=f(t)?
If so, you could interpolate along t, and calculate avg(x) and avg(y) for each t.
EDIT This of course assumes that t can be made available to your code - so that you have an ordered list of T/X/Y triples.

There are several ways this can be done. One is to combine all of your data into one single set of points, and do a best-fit curve through the combined set.

you have e.g. 2 "functions" with
fc1 = { {1,0.3} {2, 0.5} {3, 0.1} }
fc1 = { {1,0.1} {2, 0.8} {3, 0.4} }
You want the arithmetic mean (slang: "average") of the two functions. To do this you just calculate the pointwise arithmetic mean:
fc3 = { {1, (0.3+0.1)/2} ... }
Optimization:
If you have large numbers of points you should first convert your "ordered List of X/Y Pairs" into a Matrix OR at least store the points column-wise like so:
{0.3, 0.1}, {0.5, 0.8}, {0.1, 0.4}

Related

Distribute quantities into buckets - Not evenly

I've been searching around for a solution to this, but I think because of how I'm thinking about it, my search phrases might be a bit loaded in favor of topics that aren't completely relevant.
I have a number, say 950,000. This represents an inventory of [widgets] within an entire system. I have about 200 "buckets" that should each receive a portion of this inventory such that there are no widgets left over.
What I would like to happen is for each bucket to receive different amounts. I don't have any solid code to show right now, but here's some pesudo code to illustrate what I've been thinking:
//List<BucketObject> _buckets is a collection of "buckets", each of which has a "quantity" property for holding these numbers.
int _widgetCnt = 950000;
int _bucketCnt = _buckets.Count; //LINQ
//To start, each bucket receives (_widgetCnt / _bucketCnt) or 4750.
for (int _b = 0; b< _bucketCnt - 1; i++)
{
int _rndAmt = _rnd.Next(1, _buckets[i].Quantity/2); //Take SOME from this bucket...
int _rndBucket = _rnd.Next(0,_bucketCnt - 1); //Get a random bucket index from the List<BucketObject> collection.
_buckets.ElementAt(_rndBucket).Quantity += _rndAmt;
_buckets.ElementAt(i).Quantity -= _rndAmt;
}
Is this a statistically/mathematically proper way to handle this, or is there a distribution formula out there that handles this? The kicker is that while this pseudo code would run 200 times (so each bucket has a chance to alter its quantities) it would have to run X number of times depending on the TYPE of widget (which currently stands at just 11 flavors, but is expected to expand significantly in the future).
{EDIT}
This system is for a commodity trading game. Quantities at the 200 shops must differ because the inventory will determine the price at that station. The distro can't be even because that would make all prices the same. Over time, prices will naturally get out of balance, but the inventory must start out off-balance. And all inventories have to be at least similar in scope (ie, no one shop can have 1 item, and another have 900,000)

Sure, there is a solution. You could use Dirichlet Distribution for such task. Property of the distribution is that
Sumi xi = 1
So solution would be to sample 200 (equal the number of buckets) random values from Dirichlet, and then multiply each value by 950,000 (or whatever total inventory is) and that would give you number of items per bucket. If you want non-uniform sampling, you could tweak alpha in the Dirichlet sampling.
Items per bucket shall be rounded up/down, of course, but that is pretty trivial
I have Dirichlet sampling in C# somewhere, if you struggle to implement it - tell me and I would dig it out
UPDATE
I found some code, .NET Core 2, below is the excerpt. I used to sample Dirichlet RNs with the sample alphas, making all of them different is trivial.
//
// Dirichlet sampling, using Gamma sampling from Math .NET
//
using MathNet.Numerics.Distributions;
using MathNet.Numerics.Random;
static void SampleDirichlet(double alpha, double[] rn)
{
if (rn == null)
throw new ArgumentException("SampleDirichlet:: Results placeholder is null");
if (alpha <= 0.0)
throw new ArgumentException($"SampleDirichlet:: alpha {alpha} is non-positive");
int n = rn.Length;
if (n == 0)
throw new ArgumentException("SampleDirichlet:: Results placeholder is of zero size");
var gamma = new Gamma(alpha, 1.0);
double sum = 0.0;
for(int k = 0; k != n; ++k) {
double v = gamma.Sample();
sum += v;
rn[k] = v;
}
if (sum <= 0.0)
throw new ApplicationException($"SampleDirichlet:: sum {sum} is non-positive");
// normalize
sum = 1.0 / sum;
for(int k = 0; k != n; ++k) {
rn[k] *= sum;
}
}

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.

Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Alternatives to nested Select in Linq

Working on a clustering project, I stumbled upon this, and I'm trying to figure out if there's a better solution than the one I've come up with.
PROBLEM : Given a List<Point> Points of points in R^n ( you can think at every Point as a double array fo dimension n), a double minDistance and a distance Func<Point,Point,double> dist , write a LINQ expression that returns, for each point, the set of other points in the list that are closer to him than minDistance according to dist.
My solution is the following:
var lst = Points.Select(
x => Points.Where(z => dist(x, z) < minDistance)
.ToList() )
.ToList();
So, after noticing that
Using LINQ is probably not the best idea, because you get to calculate every distance twice
The problem doesn't have much practical use
My code, even if bad looking, works
I have the following questions:
Is it possible to translate my code in query expression? and if so, how?
Is there a better way to solve this in dot notation?

The problem definition, that you want "for each point, the set of other points" makes it impossible to solve without the inner query - you could just disguise it in clever manner. If you could change your data storage policy, and don't stick to LINQ then, in general, there are many approaches to Nearest Neighbour Search problem. You could for example hold the points sorted according to their values on one axis, which can speed-up the queries for neighbours by eliminating early some candidates without full distance calculation. Here is the paper with this approach: Flexible Metric Nearest Neighbor Classification.

Because Points is a List you can take advantage of the fact that you can access each item by its index. So you can avoid comparing each item twice with something like this:
var lst =
from i in Enumerable.Range(0, Points.Length)
from j in Enumerable.Range(i + 1, Points.Length - i - 1)
where dist(Points[i], Points[j]) < minDistance
select new
{
x = Points[i], y = Points[j]
};
This will return a set composed of all points within minDistance of each other, but not exactly what the result you wanted. If you want to turn it into some kind of Lookup so you can see which points are close to a given point you can do this:
var lst =
(from i in Enumerable.Range(0, Points.Length)
from j in Enumerable.Range(i + 1, Points.Length - i - 1)
where dist(Points[i], Points[j]) < minDistance
select new { x = Points[i], y = Points[j] })
.SelectMany(pair => new[] { pair, { x = pair.y, y = pair.x })
.ToLookup(pair => pair.x, pair => pair.y);

I think you could add some bool Property to your Point class to mark it's has been browsed to prevent twice calling to dist, something like this:
public class Point {
//....
public bool IsBrowsed {get;set;}
}
var lst = Points.Select(
x => {
var list = Points.Where(z =>!z.IsBrowsed&&dist(x, z) < minDistance).ToList();
x.IsBrowsed = true;
return list;
})
.ToList();

Looking for a way to optimize this algorithm for parsing a very large string

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.

In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");

I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..

Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)

There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.

The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

how to search a list for items with a distance lower than F to P without searching the entire list?

I have to search a list of structs for every item whose XZPos is closer to Vector2 (or PointF) P. The list is ordered by XZPos' x and y. It'll look something like this:
Item 1 (XZPos: 0,0)Item 2 (XZPos: 0,1)Item 3 (XZPos: 0,2)...Item 12 (XZPos: 1,0)Item 13 (XZPos: 1,1)Item 14 (XZPos: 1,2)...2.249.984 elements later...
Now I have a point P (4,4) and I want a list of structs in the above list of every item closer to P than 5,66f. My algorithm searches every item in the list like this:
List<Node> res = new List<Node>();
for (int i = 0; i < Map.Length; i++)
{
Vector2 xzpos = new Vector2(Map[i].X, Map[i].Z);//Map is the list containing 2.250.000 elements
//neighbourlength = 5,66f in our example
if ((xzpos - pos).Length() <= neighbourlength && xzpos != pos)//looking for every item except the item which is P itself, therefore checking whether "xzpos != pos"
{
res.Add(new Node() { XZPos = xzpos, /*Other fields*/ });
}
}
return res.ToArray();
My problem is that it takes way too long to complete, and now I'm looking for a way to find the fields I'm looking for without searching the entire list. 22 seconds for a search is TOO LONG. If someone could help me get it to 1 or 2 seconds that would be very nice.
Thanks for your help, Alex

Your list is sorted, and you can use that to shrink the problem space. Instead of searching the whole list, search the subset of the list that spans x values within 5,66f, since anything that is farther than the maximum on one coordinate will be farther than you want no matter what the other coordinate is. Then if you store the start positions of each value in the list (i.e. in your example, "0" elements start at 1, and "1" elements start at 12), you can quickly get to the part of the list you care about. So instead of iterating through items 0 to 2 million, you could instead be iterating through items 175839 to 226835.
Example:
The List
1: (0,1)
2: (0,2)
3: (1,0)
4: (1,1)
5: (2,1)
6: (2,3)
7: (3,1)
8: (3,5)
9: (4,2)
10: (4,5)
11: (5,1)
12: (5,2)
13: (6,1)
14: (6,2)
15: (6,3)
16: (7,1)
17: (7,2)
Start Locations
(0,1)
(1,3)
(2,5)
(3,7)
(4,9)
(5,11)
(6,13)
(7,16)
If I have a point (3,5) and I want to search the list for points within 2 of it, I only need to iterate through the points where x is between 1 and 5. So I look at my start locations, and see that 1 starts at position 3 in the list, and 5 ends at position (13 - 1). So instead of iterating from 1 to 17, I only need to iterate from 3 to 12. If the range of values in your data is large but the distance to check is short, this will reduce the number of entries you need to iterate across greatly.

To the below solution there's a few assumptions. These assumptions are not stated in your question an the solution might need adjustment. If the assumptions hold this solution is extremely fast but requires you to keep the data in a different structure. However if you can change the structure then this will have a look of for each valid point in (almost) constant time. That is the time required to find M valid points in a set containing M point in total depends only on N.
The assumptions are:
Only positive integers are used for
values of X and Y
No duplicate points in the list
`private IEnumerable> GetPointsByDistance(int xOrigin, int yOrigin, double maxDistance){
var maxDist = (int) Math.Floor(maxDistance);
//Find the lowest possible value for X
var minX = Math.Min(0, xOrigin - maxDist);
//Find the highest possible value for X
var maxX = Math.Min(MaxX, xOrigin + maxDist);
for (var x = minX; x <= maxX; ++x){
//Get the possible values for Y with the current X
var ys = points[x];
if (ys.Length > 0){
//Calculate the max delta for Y
var maxYDist =(int)Math.Floor(Math.Sqrt(maxDistance*maxDistance - x*x));
//Find the lowest possible Y for the current X
var minY = Math.Min(0, yOrigin - maxYDist);
//Find the highest possible Y for the current X
var maxY = Math.Min(ys.Length, yOrigin + maxYDist);
for (var y = minY; y <= maxY; ++y){
//The value in the array will be true if a point with the combination (x,y,) exists
if (ys[y]){
yield return new KeyValuePair<int, int>(x, y);
}
}
}
}
}`
The internal representation of the point in this code is bool[][]. The value will be true if the indices of the two arrays is a point included in the set. E.g. points[0][1] will be true for if the set was the six points you've included in the post and points[0][3] would be false.
If the assumptions are not met the same idea can still be used but will need tweeking. If you post what the assumptions should be I'll update the answer to the question.
An assumption that does not have to be met is that the points are close to sequential. If they are not this will be rather greedy when it comes to memory. Changes can be made if the points are not (close) to sequential and again post the invariants if my assumptions are wrong and I'll update.

I have no idea what your original code is doing. I see you've defined x and y and never use them and you refer to pos but it's not defined. So, instead I'm going to take the verbal description of the problem your trying to solve
Now I have a point P (4,4) and I want a list of structs in the above list of every item closer to P than 5,66f.
and solve it. This is easy:
var nearbyNeighbors = Map.Where(
point => (point.X - 4) * (point.X - 4) +
(point.Z - 4) * (point.Z - 4) <
5.66f * 5.66f
);
foreach(var nearbyNeighbor in nearbyNeighbors) {
// do something with nearbyNeighbor
}
Here, I am assuming that you are using the standard Euclidean norm.
To utilize the fact that the collection is lexicographically sorted:
Binary search until |point.X - 4| >= 5.66. This splits the list in two contiguous sublists. Throw out the sublist with |point.X - 4| >= 5.66. From the remaining sublist, binary search until |point.Y - 4| >= 5.66 This splits the remaining sublist into two contiguous sublists. Throw out the sublist with |point.Y - 4| >= 5.66. Then search linearly through the remaining sublist.

This problem is related to this:
http://en.wikipedia.org/wiki/Nearest_neighbor_search
Take a look for sub-linear solutions.
Also take a look at this question
which data structure is appropriate to query "all points within distance d from point p"

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.