Problems with Coordinates measure with C# linq - c#

I'm currently developing server side to Application,
in the App I have a lot of interest point in big area (over 1000 points)
and I want to find the nearest points to user device.
I've try to used:
.GetDistanceTo(GeoCoordinate);
from the libary:
System.Device.Location;
example query:
from point in db.Points
where ((new GeoCoordinate(point.lat,point.lng)).GetDistanceTo(new GeoCoordinate(coordinates[0],coordinates[1]))<1000))
select point
but in the Linq Query it's not supported and if I try to use it on a List<> or an Array it takes too long...
How can I do it better and faster?
Thanks

I assume that you are performing this computation often, otherwise iterating over a thousand points should not take a large amount of time - certainly under a second.
Consider caching the points in memory as GeoCoordinates, since I am guessing the bulk of the time may be spent allocating memory and instantiating the objects, rather than computing the distance. From the existing list of GeoCoordinates, you could then do a computation against an existing Geocoordinate that is already instantiated.
Ex:
On the application load, store all points into memory, possibly on a background thread.
List<GeoCoordinate> points = from point in db.Points select new GeoCoordinate(point.lat, point.lng);
Then, take your point you are trying to search and loop over points
var gcSearch = new GeoCoordinate(coordinates[0], coordinates[1]);
var searchDistance = 1000;
var results = from pSearch in points
where pSearch.GetDistanceTo(gcSearch) < searchDistance
select pSearch;
Iif that still isn't fast enough, consider caching the last searched point and returing a known list if the new search is within the same bounds.
// in class definition
static GeoCoordinate lastSearchedPoint = null;
static List<GeoCoordinate> lastSearchedResults = null
const searchFudgeDistance = 100;
//in search method
var gcSearch = new GeoCoordinate(coordinates[0], coordinates[1]);
if (lastSearchedPoint != null && gcSearch.GetDistanceTo(lastSearchedPoint) < searchFudgeDistance)
return lastSearchedResults;
lastSearchedPoint = gcSearch;
var searchDistance = 1000;
var results = from pSearch in points
where pSearch.GetDistanceTo(gcSearch) < searchDistance
select pSearch;
//store the results for future searches
lastSearchedResults = results;

Related

Get sample set from large dataset

I have an in memory dataset that I'm trying to get an evenly distributed sample using LINQ. From what I've seen, there isn't anything that does this out of the box, so I'm trying to come up with some kind of composition or extension that will perform the sampling.
What I'm hoping for is something that I can use like this:
var sample = dataset.Sample(100);
var smallSample = smallDataset.Sample(100);
Assert.IsTrue(dataset.Count() > 100);
Assert.IsTrue(sample.Count() == 100);
Assert.IsTrue(smallDataset.Count() < 100);
Assert.IsTrue(smallSample .Count() == smallDataset.Count());
The composition I started with, but only works some of the time is this:
var sample = dataset
.Select((v,i) => new Tuple<string, int>(v,i))
.Where(t => t.Item2 / (double)(dataset.Count() / SampleSize) % 1 != 0)
.Select(t => t.Item1);
This works when the dataset and the sample size share a common devisor and the sample size is greater than 50% of the dataset size. Or something like that.
Any help would be excellent!
Update: So I have the following non-LINQ logic that works, but I'm trying to figure out if this can be "LINQ'd" somehow.
var sample = new List<T>();
double sampleRatio = dataset.Count() / sampleSize;
for (var i = 0; i < dataset.Count(); i++)
{
if ((sample.Count() * sampleRatio) <= i)
sample.Add(dataset.Skip(i).FirstOrDefault();
}
I can't find a satisfactory LINQ solution, mainly because iterating LINQ statements are not aware of the length of the sequence they work on -- which is OK: it totally fits LINQ's deferred-execution and streaming approach. Of course it's possible to store the length in a variable and use this in a Where statement, but that's not in line with LINQ's functional (stateless) paradigm, so I always try to avoid that.
The Aggregate statement can be stateless and length-aware, but I tend to find solutions using Aggregate rather contrived and hard to read. It's nothing but a covert stateful loop; for and foreach take some more lines, but are far easier to follow.
I can offer you an extension method that does what you want:
public static IEnumerable<T> TakeProrated<T>(this IEnumerable<T> sequence, int numberOfItems)
{
var local = sequence.ToList();
var n = Math.Min(local.Count, numberOfItems);
var dist = (decimal)local.Count / n;
for (int i = 0; i < n; i++)
{
var index = (int)(Math.Ceiling(i * dist));
yield return local[index];
}
}
The idea is that the required distance between items is first calculated. Then the requested number of items is returned, each time roughly skipping this distance, sometimes more, sometimes less, but evenly distributed. Using Math.Ceiling or Math.Floor is arbitrary, they either introduce a bias toward items higher in the sequence, or lower.
I think I understand what you're looking for. From what I understand, you're looking to return only a certain quantity of entities in a dataset. As my comment to your original post asks, have you tried using the Take operator? What you're looking for is something like this.
// .Skip is optional, but you can use it with it.
// Just ensure that instead of .FirstOrDefault(), you use Take(quantity)
var sample = dataSet.Skip(amt).Take(dataSet.Count() / desiredSampleSize);

Where is likely the performance bug here? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Many of the test cases are timing out. I've made sure I'm using lazy evaluation everywhere, linear (or better) routines, etc. I'm shocked that this is still not meeting the performance benchmarks.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
class Mine
{
public int Distance { get; set; } // from river
public int Gold { get; set; } // in tons
}
class Solution
{
static void Main(String[] args)
{
// helper function for reading lines
Func<string, int[]> LineToIntArray = (line) => Array.ConvertAll(line.Split(' '), Int32.Parse);
int[] line1 = LineToIntArray(Console.ReadLine());
int N = line1[0], // # of mines
K = line1[1]; // # of pickup locations
// Populate mine info
List<Mine> mines = new List<Mine>();
for(int i = 0; i < N; ++i)
{
int[] line = LineToIntArray(Console.ReadLine());
mines.Add(new Mine() { Distance = line[0], Gold = line[1] });
}
// helper function for cost of a move
Func<Mine, Mine, int> MoveCost = (mine1, mine2) =>
Math.Abs(mine1.Distance - mine2.Distance) * mine1.Gold;
int sum = 0; // running total of move costs
// all move combinations on the current list of mines,
// given by the indicies of the mines
var indices = Enumerable.Range(0, N);
var moves = from i1 in indices
from i2 in indices
where i1 != i2
select new int[] { i1, i2 };
while(N != K) // while number of mines hasn't been consolidated to K
{
// get move with the least cost
var cheapest = moves.Aggregate(
(prev, cur) => MoveCost(mines[prev[0]],mines[prev[1]])
< MoveCost(mines[cur[0]], mines[cur[1]])
? prev : cur
);
int i = cheapest[0], // index of source mine of cheapest move
j = cheapest[1]; // index of destination mine of cheapest move
// add cost to total
sum += MoveCost(mines[i], mines[j]);
// move gold from source to destination
mines[j].Gold += mines[i].Gold;
// remove from moves any that had the i-th mine as a destination or source
moves = from move in moves
where move[0] == i || move[1] == i
select move;
// update size number of mines after consolidation
--N;
}
Console.WriteLine(sum);
}
}
Lazy evaluation will not make bad algorithms perform better. It will just delay when those performance problems will affect you. What lazy evaluation can help with is space complexity, i.e. reducing the amount of memory you need to execute your algorithm. Since the data is generated lazily, you will not (necessarily) have all the data in the memory at the same time.
However, relying on lazy evaluation to fix your space (or time) complexity problems can easily shoot you in the foot. Look the following example code:
var moves = Enumerable.Range(0, 5).Select(x => {
Console.WriteLine("Generating");
return x;
});
var aggregate = moves.Aggregate((p, c) => {
Console.WriteLine("Aggregating");
return p + c;
});
var newMoves = moves.Where(x => {
Console.WriteLine("Filtering");
return x % 2 == 0;
});
newMoves.ToList();
As you can see, both the aggregate and the newMoves rely on the lazily evaluated moves enumerable. Since the original count of moves is 5, we will see 4 “Aggregating” lines in the output, and 5 “Filtering” lines. But how often do you expect “Generating” to appear in the console?
The answer is 10. This is because moves is a generator and is being evaluated lazily. When multiple places request it, an iterator will be created for each, which ultimately means that the generator will execute multiple times (to generate independent results).
This is not necessarily a problem, but in your case, it very quickly becomes one. Assume that we continue above example code with another round of aggregating. That second aggregate will consume newMoves which in turns will consume the original moves. So to aggregate, we will re-run the original moves generator, and the newMoves generator. And if we were to add another level of filtering, the next round of aggregating would run three interlocked generators, again rerunning the original moves generator.
Since your original moves generator creates an enumerable of quadratic size, and has an actual time complexity of O(n²), this is an actual problem. With each iteration, you add another round of filtering which will be linear to the size of the moves enumerable, and you actually consume the enumerable completely for the aggregation. So you end up with O(n^2 + n^3 + n^4 + n^5 + …) which will eventually be the sum of n^j for j starting at 2 up to N-K. That is a very bad time complexity, and all just because you were trying to save memory by evaluating the moves lazily.
So the first step to make this better is to avoid lazy evaluation. You are constantly iterating moves and filtering it, so you should have it in memory. Yes, that means that you have an array of quadratic size, but you won’t actually need more than that. This also limits the time complexity you have. Yes, you still need to filter the list in linear time (so O(n²) since the size is n²) and you do that inside a loop, so you will end up with cubic time (O(n³)) but that would already be your upper limit (iterating the list a constant amount of times within the loop will only increase the time complexity by a constant, and that doesn’t matter).
Once you have done that, you should consider your original problem, think about what you are actually doing. I believe you could probably reduce the computational complexity further if you use the information you have better, and use data structures (e.g. hash sets, or some graph where the move cost is already stored within) that aid you in the filtering and aggregation. I can’t give you exact ideas since I don’t know your original problem, but I’m sure there is something you can do.
Finally, if you have performance problems, always remember to profile your code. The profiler will tell you what parts of your code is the most expensive, so you can get a clear idea what you should try to optimize and what you don’t need to focus on when optimizing (since optimizing already fast parts will not help you get any faster).

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Segmenting GPS path data

My Problem
I have a data stream coming from a program that connects to a GPS device and an inclinometer (they are actually both stand alone devices, not a cellphone) and logs the data while the user drives around in a car. The essential data that I receive are:
Latitude/Longitude - from GPS, with a resolution of about +-5 feet,
Vehicle land-speed - from GPS, in knots, which I convert to MPH
Sequential record index - from the database, it's an auto-incrementing integer and nothing ever gets deleted,
some other stuff that isn't pertinent to my current problem.
This data gets stored in a database and read back from the database into an array. From start to finish, the recording order is properly maintained, so even though the timestamp that is recorded from the GPS device is only to 1 second precision and we sample at 5hz, the absolute value of the time is of no interest and the insertion order suffices.
In order to aid in analyzing the data, a user performs a very basic data input task of selecting the "start" and "end" of curves on the road from the collected path data. I get a map image from Google and I draw the curve data on top of it. The user zooms into a curve of interest, based on their own knowledge of the area, and clicks two points on the map. Google is actually very nice and reports where the user clicked in Latitude/Longitude rather than me having to try to backtrack it from pixel values, so the issue of where the user clicked in relation to the data is covered.
The zooming in on the curve clips the data: I only retrieve data that falls in the Lat/Lng window defined by the zoom level. Most of the time, I'm dealing with fewer than 300 data points, when a single driving session could result in over 100k data points.
I need to find the subsegment of the curve data that falls between those to click points.
What I've Tried
Originally, I took the two points that are closest to each click point and the curve was anything that fell between them. That worked until we started letting the drivers make multiple passes over the road. Typically, a driver will make 2 back-and-forth runs over an interesting piece of road, giving us 4 total passes. If you take the two closest points to the two click points, then you might end up with the first point corresponding to a datum on one pass, and the second point corresponding to a datum on a completely different pass. The points in the sequence between these two points would then extend far beyond the curve. And, even if you got lucky and all the data points found were both on the same pass, that would only give you one of the passes, and we need to collect all passes.
For a while, I had a solution that worked much better. I calculated two new sequences representing the distance from each data point to each of the click points, then the approximate second derivative of that distance, looking for the inflection points of the distance from the click point over the data points. I reasoned that the inflection point meant that the points previous to the inflection were getting closer to the click point and the points after the inflection were getting further away from the click point. Doing this iteratively over the data points, I could group the curves as I came to them.
Perhaps some code is in order (this is C#, but don't worry about replying in kind, I'm capable of reading most languages):
static List<List<LatLngPoint>> GroupCurveSegments(List<LatLngPoint> dataPoints, LatLngPoint start, LatLngPoint end)
{
var withDistances = dataPoints.Select(p => new
{
ToStart = p.Distance(start),
ToEnd = p.Distance(end),
DataPoint = p
}).ToArray();
var set = new List<List<LatLngPoint>>();
var currentSegment = new List<LatLngPoint>();
for (int i = 0; i < withDistances.Length - 2; ++i)
{
var a = withDistances[i];
var b = withDistances[i + 1];
var c = withDistances[i + 2];
// the edge of the map can clip the data, so the continuity of
// the data is not exactly mapped to the continuity of the array.
var ab = b.DataPoint.RecordID - a.DataPoint.RecordID;
var bc = c.DataPoint.RecordID - b.DataPoint.RecordID;
var inflectStart = Math.Sign(a.ToStart - b.ToStart) * Math.Sign(b.ToStart - c.ToStart);
var inflectEnd = Math.Sign(a.ToEnd - b.ToEnd) * Math.Sign(b.ToEnd - c.ToEnd);
// if we haven't started a segment yet and we aren't obviously between segments
if ((currentSegment.Count == 0 && (inflectStart == -1 || inflectEnd == -1)
// if we have started a segment but we haven't changed directions away from it
|| currentSegment.Count > 0 && (inflectStart == 1 && inflectEnd == 1))
// and we're continuous on the data collection path
&& ab == 1
&& bc == 1)
{
// extend the segment
currentSegment.Add(b.DataPoint);
}
else if (
// if we have a segment collected
currentSegment.Count > 0
// and we changed directions away from one of the points
&& (inflectStart == -1
|| inflectEnd == -1
// or we lost data continuity
|| ab > 1
|| bc > 1))
{
// clip the segment and start a new one
set.Add(currentSegment);
currentSegment = new List<LatLngPoint>();
}
}
return set;
}
This worked great until we started advising the drivers to drive around 15MPH through turns (supposedly, it helps reduce sensor error. I'm personally not entirely convinced what we're seeing at higher speed is error, but I'm probably not going to win that argument). A car traveling at 15MPH is traveling at 22fps. Sampling this data at 5hz means that each data point is about four and a half feet apart. However, our GPS unit's precision is only about 5 feet. So, just the jitter of the GPS data itself could cause an inflection point in the data at such low speeds and high sample rates (technically, at this sample rate, you'd have to go at least 35MPH to avoid this problem, but it seems to work okay at 25MPH in practice).
Also, we're probably bumping up sampling rate to 10 - 15 Hz pretty soon. You'd need to drive at about 45MPH to avoid my inflection problem, which isn't safe on most of the curves of interest. My current procedure ends up splitting the data into dozens of subsegments, over road sections that I know had only 4 passes. One section that only had 300 data points came out to 35 subsegments. The rendering of the indication of the start and end of each pass (a small icon) indicated quite clearly that each real pass was getting chopped up into several pieces.
Where I'm Thinking of Going
Find the minimum distance of all points to both the start and end click points
Find all points that are within +10 feet of that distance.
Group each set of points by data continuity, i.e. each group should be continuous in the database, because more than one point on a particular pass could fall within the distance radius.
Take the data mid-point of each of those groups for each click point as the representative start and end for each pass.
Pair up points in the two sets per click point by those that would minimize the record index distance between each "start" and "end".
Halp?!
But I had tried this once before and it didn't work very well. Step #2 can return an unreasonably large number of points if the user doesn't click particularly close to where they intend. It can return too few points if the user clicks very, particularly close to where they intend. I'm not sure just how computationally intensive step #3 will be. And step #5 will fail if the driver were to drive over a particularly long curve and immediately turn around just after the start and end to perform the subsequent passes. We might be able to train the drivers to not do this, but I don't like taking chances on such things. So I could use some help figuring out how to clip and group this path that doubles back over itself into subsegments for passes over the curve.
Okay, so here is what I ended up doing, and it seems to work well for now. I like that it is a little simpler to follow than before. I decided that Step #4 from my question was not necessary. The exact point used as the start and end isn't critical, so I just take the first point that is within the desired radius of the first click point and the last point within the desired radius of the second point and take everything in the middle.
protected static List<List<T>> GroupCurveSegments<T>(List<T> dbpoints, LatLngPoint start, LatLngPoint end) where T : BBIDataPoint
{
var withDistances = dbpoints.Select(p => new
{
ToStart = p.Distance(start),
ToEnd = p.Distance(end),
DataPoint = p
}).ToArray();
var minToStart = withDistances.Min(p => p.ToStart) + 10;
var minToEnd = withDistances.Min(p => p.ToEnd) + 10;
bool startFound = false,
endFound = false,
oldStartFound = false,
oldEndFound = false;
var set = new List<List<T>>();
var cur = new List<T>();
foreach(var a in withDistances)
{
// save the previous values, because they
// impact the future values.
oldStartFound = startFound;
oldEndFound = endFound;
startFound =
!oldStartFound && a.ToStart <= minToStart
|| oldStartFound && !oldEndFound
|| oldStartFound && oldEndFound
&& (a.ToStart <= minToStart || a.ToEnd <= minToEnd);
endFound =
!oldEndFound && a.ToEnd <= minToEnd
|| !oldStartFound && oldEndFound
|| oldStartFound && oldEndFound
&& (a.ToStart <= minToStart || a.ToEnd <= minToEnd);
if (startFound || endFound)
{
cur.Add(a.DataPoint);
}
else if (cur.Count > 0)
{
set.Add(cur);
cur = new List<T>();
}
}
// if a data stream ended near the end of the curve,
// then the loop will not have saved it the pass.
if (cur.Count > 0)
{
cur = new List<T>();
}
return set;
}

How to avoid OrderBy - memory usage problems

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.
Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:
N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
.OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
.ElementAt((int) pointList.Count * (1 - N)).Z;
List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();
But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).
Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?
The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.
Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.
public static IEnumerable<TSource> TakeLargest<TSource, TKey>
(this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
var set = new SortedDictionary<TKey, List<TSource>>();
var resultCount = 0;
var first = default(KeyValuePair<TKey, List<TSource>>);
foreach (var item in items)
{
// If the key is already smaller than the smallest
// item in the set, we can ignore this item
var key = selector(item);
if (first.Value == null ||
resultCount < count ||
Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
{
// Add next item to set
if (!set.ContainsKey(key))
{
set[key] = new List<TSource>();
}
set[key].Add(item);
if (first.Value == null)
{
first = set.First();
}
// Remove smallest item from set
resultCount++;
if (resultCount - first.Value.Count >= count)
{
set.Remove(first.Key);
resultCount -= first.Value.Count;
first = set.First();
}
}
}
return set.Values.SelectMany(values => values);
}
That will include more than count elements if there are ties, as your implementation does now.
You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...
pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();
If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed
You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.
If you combine the two there is a chance a little less work will be done:
List<Point> selectedPoints = pointList
.OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
.Take((int) pointList.Count * N);
But basically this kind of ranking requires sorting, your biggest cost.
A few more ideas:
if you use a class Point (instead of a struct Point) there will be much less copying.
you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.
If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):
pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order
Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.
double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
.ToList();
Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.
To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.
double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);
That out of the way, there are three cases where memory use might actually be a problem:
Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.
Update:
Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.
I'd do it by implementing "half" a quicksort.
Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.
Choose a pivot x in P.
Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.
If N = |U| then you're done.
If N < |U| then recurse with P := U.
Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.
If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.
Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:
Make a traversal of P, finding the least and greatest items, A and Z, respectively.
Let M be the mean of A and Z (remember, we're only considering Z coordinates here).
Count how many items there are in the range [M, Z], call this Q.
If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.
If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.
Repeat until we find an M such that Q = N.
Now traverse P, removing all items greater than or equal to M.
That's definitely O(n log n) and creates no extra data structures (except for the result).
Howzat?
You might use something like this:
pointList.Sort(); // Use you own compare here if needed
// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z;
// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);
If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.
int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();
Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.
You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.
Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:
Sort With Linq: 5.3 seconds
Use DataViews: 0.01 seconds
Give it a try.
[TestClass]
public class UnitTest1 {
class MyTable : TypedTableBase<MyRow> {
public MyTable() {
Columns.Add("Col1", typeof(int));
Columns.Add("Col2", typeof(int));
}
protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
return new MyRow(builder);
}
}
class MyRow : DataRow {
public MyRow(DataRowBuilder builder) : base(builder) {
}
public int Col1 { get { return (int)this["Col1"]; } }
public int Col2 { get { return (int)this["Col2"]; } }
}
DataView _viewCol1Asc;
DataView _viewCol2Desc;
MyTable _table;
int _countToTake;
[TestMethod]
public void MyTestMethod() {
_table = new MyTable();
int count = 50000;
for (int i = 0; i < count; i++) {
_table.Rows.Add(i, i);
}
_countToTake = _table.Rows.Count / 30;
Console.WriteLine("SortWithLinq");
RunTest(SortWithLinq);
Console.WriteLine("Use DataViews");
RunTest(UseSoredDataViews);
}
private void RunTest(Action method) {
int iterations = 100;
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
method();
}
watch.Stop();
Console.WriteLine(" {0}", watch.Elapsed);
}
private void UseSoredDataViews() {
if (_viewCol1Asc == null) {
_viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
_viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
}
var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
}
private void SortWithLinq() {
var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
IterateRows(rows);
rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
IterateRows(rows);
}
private void IterateRows(IEnumerable<MyRow> rows) {
foreach (var row in rows)
if (row == null)
throw new Exception("????");
}
}

Categories