Get sample set from large dataset - c#

I have an in memory dataset that I'm trying to get an evenly distributed sample using LINQ. From what I've seen, there isn't anything that does this out of the box, so I'm trying to come up with some kind of composition or extension that will perform the sampling.
What I'm hoping for is something that I can use like this:
var sample = dataset.Sample(100);
var smallSample = smallDataset.Sample(100);
Assert.IsTrue(dataset.Count() > 100);
Assert.IsTrue(sample.Count() == 100);
Assert.IsTrue(smallDataset.Count() < 100);
Assert.IsTrue(smallSample .Count() == smallDataset.Count());
The composition I started with, but only works some of the time is this:
var sample = dataset
.Select((v,i) => new Tuple<string, int>(v,i))
.Where(t => t.Item2 / (double)(dataset.Count() / SampleSize) % 1 != 0)
.Select(t => t.Item1);
This works when the dataset and the sample size share a common devisor and the sample size is greater than 50% of the dataset size. Or something like that.
Any help would be excellent!
Update: So I have the following non-LINQ logic that works, but I'm trying to figure out if this can be "LINQ'd" somehow.
var sample = new List<T>();
double sampleRatio = dataset.Count() / sampleSize;
for (var i = 0; i < dataset.Count(); i++)
{
if ((sample.Count() * sampleRatio) <= i)
sample.Add(dataset.Skip(i).FirstOrDefault();
}

I can't find a satisfactory LINQ solution, mainly because iterating LINQ statements are not aware of the length of the sequence they work on -- which is OK: it totally fits LINQ's deferred-execution and streaming approach. Of course it's possible to store the length in a variable and use this in a Where statement, but that's not in line with LINQ's functional (stateless) paradigm, so I always try to avoid that.
The Aggregate statement can be stateless and length-aware, but I tend to find solutions using Aggregate rather contrived and hard to read. It's nothing but a covert stateful loop; for and foreach take some more lines, but are far easier to follow.
I can offer you an extension method that does what you want:
public static IEnumerable<T> TakeProrated<T>(this IEnumerable<T> sequence, int numberOfItems)
{
var local = sequence.ToList();
var n = Math.Min(local.Count, numberOfItems);
var dist = (decimal)local.Count / n;
for (int i = 0; i < n; i++)
{
var index = (int)(Math.Ceiling(i * dist));
yield return local[index];
}
}
The idea is that the required distance between items is first calculated. Then the requested number of items is returned, each time roughly skipping this distance, sometimes more, sometimes less, but evenly distributed. Using Math.Ceiling or Math.Floor is arbitrary, they either introduce a bias toward items higher in the sequence, or lower.

I think I understand what you're looking for. From what I understand, you're looking to return only a certain quantity of entities in a dataset. As my comment to your original post asks, have you tried using the Take operator? What you're looking for is something like this.
// .Skip is optional, but you can use it with it.
// Just ensure that instead of .FirstOrDefault(), you use Take(quantity)
var sample = dataSet.Skip(amt).Take(dataSet.Count() / desiredSampleSize);

Related

How to speed up string operations and avoid slow loops

I am writing a code which makes a lot of combinations (Combinations might not be the right word here, sequences of string in the order they are actually present in the string) that already exist in a string. The loop starts adding combinations to a List<string> but unfortunately, my loop takes a lot of time when dealing with any file over 200 bytes. I want to be able to work with hundreds of MBs here.
Let me explain what I actually want in the simplest of ways.
Lets say I have a string that is "Afnan is awesome" (-> main string), what I would want is a list of string which encompasses different substring sequences of the main string. For example-> A,f,n,a,n, ,i,s, ,a,w,e,s,o,m,e. Now this is just the first iteration of the loop. With each iteration, my substring length increases, yielding these results for the second iteration -> Af,fn,na,n , i,is,s , a,aw,we,es,so,om,me. The third iteration would look like this: Afn,fna,nan,an ,n i, is,is ,s a, aw, awe, wes, eso, som, ome. This will keep going on until my substring length reaches half the length of my main string.
My code is as follows:
string data = File.ReadAllText("MyFilePath");
//Creating my dictionary
List<string> dictionary = new List<string>();
int stringLengthIncrementer = 1;
for (int v = 0; v < (data.Length / 2); v++)
{
for (int x = 0; x < data.Length; x++)
{
if ((x + stringLengthIncrementer) > data.Length) break; //So index does not go out of bounds
if (dictionary.Contains(data.Substring(x, stringLengthIncrementer)) == false) //So no repetition takes place
{
dictionary.Add(data.Substring(x, stringLengthIncrementer)); //To add the substring to my List<string> -> dictionary
}
}
stringLengthIncrementer++; //To increase substring length with each iteration
}
I use data.Length / 2 because I only need combinations at most half the length of the entire string. Note that I search the entire string for combinations, not half of it.
To further simplify what I am trying to do -> Suppose I have an input string =
"abcd"
the output would be =
a, b, c, d, ab, bc, cd, This rest will be cut out as it is longer than half the length of my primary string -> //abc, bcd, abcd
I was hoping if some regex method may help me achieve this. Anything that doesn't consist of loops. Anything that is exponentially faster than this? Some simple code with less complexity which is more efficient?
Update
When I used Hashset instead of List<string> for my dictionary, I did not experience any change of performance and also got an OutOfMemoryException:
You can use linq to simplify the code and very easily parallelize it, but it's not going to be orders of magnitude faster, as you would need to run it on files of 100s of MBs (that's very likely impossible).
var data = File.ReadAllText("MyFilePath");
var result = Enumerable.Range(1, data.Length / 2)
.AsParallel()
.Select(len => new HashSet<string>(
Enumerable.Range(0, data.Length - len + 1) //Adding the +1 here made it work perfectly
.Select(x => data.Substring(x, len))))
.SelectMany(t=>t)
.ToList();
General improvements, that you can do in your code to improve the performance (I don't consider if there're other more optimal solutions).
calculate data.Substring(x, stringLengthIncrementer) only once
as you do search, use SortedList, it will be faster.
initialize the List (or SortedList, or whatever) with calculated number of items. Like new List(CalucatedCapacity).
or you can try to write an algorithm that produces combinations without checking for duplicates.
You may be able to use HashSet combined with MoreLINQ's Batch feature (available on NuGet) to simplify the code a little.
public static void Main()
{
string data = File.ReadAllText("MyFilePath");
//string data = "Afnan is awesome";
var dictionary = new HashSet<string>();
for (var stringLengthIncrementer = 1; stringLengthIncrementer <= (data.Length / 2); stringLengthIncrementer++)
{
foreach (var skipper in Enumerable.Range(0, stringLengthIncrementer))
{
var batched = data.Skip(skipper).Batch(stringLengthIncrementer);
foreach (var batch in batched)
{
dictionary.Add(new string(batch.ToArray()));
}
}
}
Console.WriteLine(dictionary);
dictionary.ForEach(z => Console.WriteLine(z));
Console.ReadLine();
}
For this input:
"Afnan is awesome askdjkhaksjhd askjdhaksjsdhkajd asjsdhkajshdkjahsd asksdhkajshdkjashd aksjdhkajsshd98987ad asdhkajsshd98xcx98asdjaksjsd askjdakjshcc98z98asdsad"
performance is roughly 10x faster than your current code.

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

C# fastest intersection of 2 sets of sorted numbers

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.
I've tried a bunch of simple options and am currently using this:
foreach (var index in firstSet)
{
if (secondSet.BinarySearch(index) < 0)
continue;
//do stuff
}
Both firstSet and secondSet are of type List.
I've also tried using LINQ:
var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();
and then looping through intersection.
But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.
Please help me guys as I don't have a lot of time to get this thing done. Thanks.
NOTE: I'm doing this about 5.3 million times. So every microsecond counts.
If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.
Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.
The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.
Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):
(There's more text at the end of the answer!)
static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
IEnumerable<T> sequence2,
IComparer<T> comparer)
{
using (var cursor1 = sequence1.GetEnumerator())
using (var cursor2 = sequence2.GetEnumerator())
{
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
var value1 = cursor1.Current;
var value2 = cursor2.Current;
while (true)
{
int comparison = comparer.Compare(value1, value2);
if (comparison < 0)
{
if (!cursor1.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
}
else if (comparison > 0)
{
if (!cursor2.MoveNext())
{
yield break;
}
value2 = cursor2.Current;
}
else
{
yield return value1;
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
value2 = cursor2.Current;
}
}
}
}
EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.
Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).
This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.
var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();
while (firstIndex < firstCount && secondIndex < secondCount)
{
var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
if (comp < 0) {
++firstIndex;
}
else if (comp > 0) {
++secondIndex;
}
else {
intersection.Add(firstSet[firstIndex]);
++firstIndex;
++secondIndex;
}
}
The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.
You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.
var intersection = firstSet.Intersect(secondSet);
Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).
I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.
I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:
public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
List<T> sequence2,
IComparer<T> comparer)
{
List<T> smallList = null;
List<T> largeList = null;
if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
{
smallList = sequence1;
largeList = sequence2;
}
else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
{
smallList = sequence2;
largeList = sequence1;
}
if (smallList != null)
{
foreach (var item in smallList)
{
if (largeList.BinarySearch(item, comparer) >= 0)
{
yield return item;
}
}
}
else
{
//Use Jon's method
}
}
I am still unsure about the point at which you break even, need to do some more testing
try
firstSet.InterSect (secondSet).ToList ()
or
firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

Divide List in Equal Parts

I have a big List that may have some 50,000 or more items and i have to do operation against each item.takes some X time now if i use conventional method and do operation in sequential manner it is definitely take X * 50,000 on average.
I planned to optimize and save some time and decided to use Background Worker as there is no dependency among them.Plan was to divide the List in 4 parts and use each in separate Background Worker.
I want to ASk
1.is this method DUMB?
2.Is there any other Better Method?
3.Suggest a nice and clean method to divide List in 4 equal Parts?
Thanks
If you can use .Net 4.0, then use the Task Parallel library and have a look at
Parallel.ForEach()
Parallel ForEach How-to.
Everything is basically the same as a traditional for loop, but you work with parallelism implicitly.
You can also really split it to groups.
I didn't see a built-in sequence method for it, so here's the low level way. point out any blunders please. I am learning.
static List<T[]> groups<T>(IList<T> original, uint n)
{
Debug.Assert(n > 0);
var listlist = new List<T[]>();
var list = new List<T>();
for (int i = 0; i < original.Count(); i++)
{
var item = original[i];
list.Add(item);
if ((i+1) % n == 0 || i == original.Count() - 1)
{
listlist.Add(list.ToArray());
list.Clear();
}
}
return listlist;
}
Another version, based on linq.
public static List<T[]> groups<T>(IList<T> original, uint n)
{
var almost_grouped = original.Select((row, i) => new { Item = row, GroupIndex = i / n });
var groups = almost_grouped.GroupBy(a => a.GroupIndex, a => a.Item);
var grouped = groups.Select(a => a.ToArray()).ToList();
return grouped;
}
This is a good method for optimizing similar, independent, operations on a large collection. However, you should look at the Parallel.For method in .NET 4.0. It does all the heavy lifting for you:
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.for.aspx

How to avoid OrderBy - memory usage problems

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.
Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:
N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
.OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
.ElementAt((int) pointList.Count * (1 - N)).Z;
List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();
But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).
Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?
The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.
Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.
public static IEnumerable<TSource> TakeLargest<TSource, TKey>
(this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
var set = new SortedDictionary<TKey, List<TSource>>();
var resultCount = 0;
var first = default(KeyValuePair<TKey, List<TSource>>);
foreach (var item in items)
{
// If the key is already smaller than the smallest
// item in the set, we can ignore this item
var key = selector(item);
if (first.Value == null ||
resultCount < count ||
Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
{
// Add next item to set
if (!set.ContainsKey(key))
{
set[key] = new List<TSource>();
}
set[key].Add(item);
if (first.Value == null)
{
first = set.First();
}
// Remove smallest item from set
resultCount++;
if (resultCount - first.Value.Count >= count)
{
set.Remove(first.Key);
resultCount -= first.Value.Count;
first = set.First();
}
}
}
return set.Values.SelectMany(values => values);
}
That will include more than count elements if there are ties, as your implementation does now.
You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...
pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();
If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed
You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.
If you combine the two there is a chance a little less work will be done:
List<Point> selectedPoints = pointList
.OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
.Take((int) pointList.Count * N);
But basically this kind of ranking requires sorting, your biggest cost.
A few more ideas:
if you use a class Point (instead of a struct Point) there will be much less copying.
you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.
If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):
pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order
Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.
double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
.ToList();
Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.
To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.
double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);
That out of the way, there are three cases where memory use might actually be a problem:
Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.
Update:
Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.
I'd do it by implementing "half" a quicksort.
Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.
Choose a pivot x in P.
Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.
If N = |U| then you're done.
If N < |U| then recurse with P := U.
Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.
If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.
Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:
Make a traversal of P, finding the least and greatest items, A and Z, respectively.
Let M be the mean of A and Z (remember, we're only considering Z coordinates here).
Count how many items there are in the range [M, Z], call this Q.
If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.
If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.
Repeat until we find an M such that Q = N.
Now traverse P, removing all items greater than or equal to M.
That's definitely O(n log n) and creates no extra data structures (except for the result).
Howzat?
You might use something like this:
pointList.Sort(); // Use you own compare here if needed
// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z;
// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);
If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.
int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();
Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.
You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.
Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:
Sort With Linq: 5.3 seconds
Use DataViews: 0.01 seconds
Give it a try.
[TestClass]
public class UnitTest1 {
class MyTable : TypedTableBase<MyRow> {
public MyTable() {
Columns.Add("Col1", typeof(int));
Columns.Add("Col2", typeof(int));
}
protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
return new MyRow(builder);
}
}
class MyRow : DataRow {
public MyRow(DataRowBuilder builder) : base(builder) {
}
public int Col1 { get { return (int)this["Col1"]; } }
public int Col2 { get { return (int)this["Col2"]; } }
}
DataView _viewCol1Asc;
DataView _viewCol2Desc;
MyTable _table;
int _countToTake;
[TestMethod]
public void MyTestMethod() {
_table = new MyTable();
int count = 50000;
for (int i = 0; i < count; i++) {
_table.Rows.Add(i, i);
}
_countToTake = _table.Rows.Count / 30;
Console.WriteLine("SortWithLinq");
RunTest(SortWithLinq);
Console.WriteLine("Use DataViews");
RunTest(UseSoredDataViews);
}
private void RunTest(Action method) {
int iterations = 100;
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
method();
}
watch.Stop();
Console.WriteLine(" {0}", watch.Elapsed);
}
private void UseSoredDataViews() {
if (_viewCol1Asc == null) {
_viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
_viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
}
var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
}
private void SortWithLinq() {
var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
IterateRows(rows);
rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
IterateRows(rows);
}
private void IterateRows(IEnumerable<MyRow> rows) {
foreach (var row in rows)
if (row == null)
throw new Exception("????");
}
}

Categories