Compare Two Ordered Lists in C#

Compare Two Ordered Lists in C# - c#

The problem is that I have two lists of strings. One list is an approximation of the other list, and I need some way of measuring the accuracy of the approximation.
As a makeshift way of scoring the approximation, I have bucketed each list(the approximation and the answer) into 3 partitions (high, medium low) after sorting based on a numeric value that corresponds to the string. I then compare all of the elements in the approximation to see if the string exists in the same partition of the correct list.
I sum the number of correctly classified strings and divide it by the total number of strings. I understand that this is a very crude way of measuring the accuracy of the estimate, and was hoping that better alternatives were available. This is a very small component of a larger piece of work, and was hoping to not have to reinvent the wheel.
EDIT:
I think I wasn't clear enough. I don't need the two lists to be exactly equal, I need some sort of measure that shows the lists are similar. For example, The High-Medium-Low (H-M-L) approach we have taken shows that the estimated list is sufficiently similar. The downside of this approach is that if the estimated list has an item at the bottom of the "High" bracket, and in the actual list, the item is at the top of the medium set, the score algorithm fails to deliver.
It could potentially be that in addition to the H-M-L approach, the bottom 20% of each partition is compared to the top 20% of the next partition or something along those lines.
Thanks all for your help!!

So, we're taking a sequence of items and grouping it into partitions with three categories, high, medium, and low. Let's first make an object to represent those three partitions:
public class Partitions<T>
{
public IEnumerable<T> High { get; set; }
public IEnumerable<T> Medium { get; set; }
public IEnumerable<T> Low { get; set; }
}
Next to make an estimate we want to take two of these objects, one for the actual and one for the estimate. For each priority level we want to see how many of the items are in both collections; this is an "Intersection"; we want to sum up the counts of the intersection of each set.
Then just divide that count by the total:
public static double EstimateAccuracy<T>(Partitions<T> actual
, Partitions<T> estimate)
{
int correctlyCategorized =
actual.High.Intersect(estimate.High).Count() +
actual.Medium.Intersect(estimate.Medium).Count() +
actual.Low.Intersect(estimate.Low).Count();
double total = actual.High.Count()+
actual.Medium.Count()+
actual.Low.Count();
return correctlyCategorized / total;
}
Of course, if we generalize this to not 3 priorities, but rather a sequence of sequences, in which each sequence corresponds to some bucket (i.e. there are N buckets, not just 3) the code actually gets easier:
public static double EstimateAccuracy<T>(
IEnumerable<IEnumerable<T>> actual
, IEnumerable<IEnumerable<T>> estimate)
{
var query = actual.Zip(estimate, (a, b) => new
{
valid = a.Intersect(b).Count(),
total = a.Count()
}).ToList();
return query.Sum(pair => pair.valid) /
(double)query.Sum(pair => pair.total);
}

Nice question. Well, I think you could use the following method to compare your lists:
public double DetermineAccuracyPercentage(int numberOfEqualElements, int yourListsLength)
{
return ((double)numberOfEqualElements / (double)yourListsLength) * 100.0;
}
The number returned should determine how much equality exists between your two lists.
If numberOfEqualElements = yourLists.Length (Count) so they are absolutely equal.
The accuracy of the approximation = (numberOfEqualElements / yourLists.Length)
1 = completely equal , 0 = absolutely different, and the values between 0 and 1 determine the level of equality. In my sample a percentage.
If you compare these 2 lists, you will retrieve a 75% of equality, the same that 3 of 4 equal elements (3/4).
IList<string> list1 = new List<string>();
IList<string> list2 = new List<string>();
list1.Add("Dog");
list1.Add("Cat");
list1.Add("Fish");
list1.Add("Bird");
list2.Add("Dog");
list2.Add("Cat");
list2.Add("Fish");
list2.Add("Frog");
int resultOfComparing = list1.Intersect(list2).Count();
double accuracyPercentage = DetermineAccuracyPercentage(resultOfComparing, list1.Count);
I hope it helps.

I would take both List<String>s and combine each element into a IEnumerable<Boolean>:
public IEnumerable<Boolean> Combine<Ta, Tb>(List<Ta> seqA, List<Tb> seqB)
{
if (seqA.Count != seqB.Count)
throw new ArgumentException("Lists must be the same size...");
for (int i = 0; i < seqA.Count; i++)
yield return seqA[i].Equals(seqB[i]));
}
And then use Aggregate() to verify which strings match and keep a running total:
var result = Combine(a, b).Aggregate(0, (acc, t)=> t ? acc + 1 : acc) / a.Count;

Related

Group List into Ranges By Specific Amount

Let's say I have a List of items in which look like this:
Number Amount
1 10
2 12
5 5
6 9
9 4
10 3
11 1
I need it so that the method takes in any number even as a decimal and use that number to group the list into ranges based on that number. So let's say my number was 1 the following output would be...
Ranges Total
1-2 22
5-6 14
9-11 8
Because it basically grouped the numbers that are 1 away from each other into ranges. What's the most efficient way I can convert my list to look like the output?

There are a couple of approaches to this. Either you can partition the data and then sum on the partitions, or you can roll the whole thing into a single method.
Since partitioning is based on the gaps between the Number values you won't be able to work on unordered lists. Building the partition list on the fly isn't going to work if the list isn't ordered, so make sure you sort the list on the partition field before you start.
Partitioning
Once the lists is ordered (or if it was pre-ordered) you can partition. I use this kind of extension method fairly often for breaking up ordered sequences into useful blocks, like when I need to grab sequences of entries from a log file.
public static partial class Ext
{
public static IEnumerable<T[]> PartitionStream<T>(this IEnumerable<T> source, Func<T, T, bool> partitioner)
{
var partition = new List<T>();
T prev = default;
foreach (var next in source)
{
if (partition.Count > 0 && !partitioner(prev, next))
{
new { p = partition.ToArray(), prev, next }.Dump();
yield return partition.ToArray();
partition.Clear();
}
partition.Add(prev = next);
}
if (partition.Count > 0)
yield return partition.ToArray();
}
}
The partitioner parameter compares two objects and returns true if they belong in the same partition. The extension method just collects all the members of the partition together and returns them as an array once it finds something for the next partition.
From there you can just do simple summing on the partition arrays:
var source = new (int n, int v)[] { (1,10),(2,12),(5,5),(6,9),(9,4),(10,3),(11,1) };
var maxDifference = 2;
var aggregate =
from part in source.PartitionStream((l, r) => (r.n - l.n) <= maxDifference)
let low = grp.Min(g => g.n)
let high = grp.Max(g => g.n)
select new { Ranges = $"{low}-{high}", Total = grp.Sum(g => g.v) };
This gives the same output as your example.
Stream Aggregation
The second option is both simpler and more efficient since it does barely any memory allocations. The downside - if you can call it that - is that it's a lot less generic.
Rather than partitioning and aggregating over the partitions, this just walks through the list and aggregates as it goes, spitting out results when the partitioning criteria is reached:
IEnumerable<(string Ranges, int Total)> GroupSum(IEnumerable<(int n, int v)> source, int maxDistance)
{
int low = int.MaxValue;
int high = 0;
int total = 0;
foreach (var (n, v) in source)
{
// check partition boundary
if (n < low || (n - high) > maxDistance)
{
if (n > low)
yield return ($"{low}-{high}", total);
low = high = n;
total = v;
}
else
{
high = n;
total += v;
}
}
if (total > 0)
yield return ($"{low}-{high}", total);
}
(Using ValueTuple so I don't have to declare types.)
Output is the same here, but with a lot less going on in the background to slow it down. No allocated arrays, etc.

Subset Sum algorithm efficiency

We have a number of payments (Transaction) that come into our business each day. Each Transaction has an ID and an Amount. We have the requirement to match a number of these transactions to a specific amount. Example:
Transaction Amount
1 100
2 200
3 300
4 400
5 500
If we wanted to find the transactions that add up to 600 you would have a number of sets (1,2,3),(2,4),(1,5).
I found an algorithm that I have adapted, that works as defined below. For 30 transactions it takes 15ms. But the number of transactions average around 740 and have a maximum close to 6000. Is the a more efficient way to perform this search?
sum_up(TransactionList, remittanceValue, ref MatchedLists);
private static void sum_up(List<Transaction> transactions, decimal target, ref List<List<Transaction>> matchedLists)
{
sum_up_recursive(transactions, target, new List<Transaction>(), ref matchedLists);
}
private static void sum_up_recursive(List<Transaction> transactions, decimal target, List<Transaction> partial, ref List<List<Transaction>> matchedLists)
{
decimal s = 0;
foreach (Transaction x in partial) s += x.Amount;
if (s == target)
{
matchedLists.Add(partial);
}
if (s > target)
return;
for (int i = 0; i < transactions.Count; i++)
{
List<Transaction> remaining = new List<Transaction>();
Transaction n = new Transaction(0, transactions[i].ID, transactions[i].Amount);
for (int j = i + 1; j < transactions.Count; j++) remaining.Add(transactions[j]);
List<Transaction> partial_rec = new List<Transaction>(partial);
partial_rec.Add(new Transaction(n.MatchNumber, n.ID, n.Amount));
sum_up_recursive(remaining, target, partial_rec, ref matchedLists);
}
}
With Transaction defined as:
class Transaction
{
public int ID;
public decimal Amount;
public int MatchNumber;
public Transaction(int matchNumber, int id, decimal amount)
{
ID = id;
Amount = amount;
MatchNumber = matchNumber;
}
}

As already mentioned your problem can be solved by pseudo polynomial algorithm in O(n*G) with n - number of items and G - your targeted sum.
The first part question: is it possible to achieve the targeted sum G. The following pseudo/python code solves it (have no C# on my machine):
def subsum(values, target):
reached=[False]*(target+1) # initialize as no sums reached at all
reached[0]=True # with 0 elements we can only achieve the sum=0
for val in values:
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target: # if subsum=s can be reached, that we can add the current value to this sum and build an new sum
reached[s+val]=True
return reached[target]
What is the idea? Let's consider values [1,2,3,6] and target sum 7:
We start with an empty set - the possible sum is obviously 0.
Now we look at the first element 1 and have to options to take or not to take. That leaves as with possible sums {0,1}.
Now looking at the next element 2: leads to possible sets {0,1} (not taking)+{2,3} (taking).
Until now not much difference to your approach, but now for element 3 we have possible sets a. for not taking {0,1,2,3} and b. for taking {3,4,5,6} resulting in {0,1,2,3,4,5,6} as possible sums. The difference to your approach is that there are two way to get to 3 and your recursion will be started twice from that (which is not needed). Calculating basically the same staff over and over again is the problem of your approach and why the proposed algorithm is better.
As last step we consider 6 and get {0,1,2,3,4,5,6,7} as possible sums.
But you also need the subset which leads to the targeted sum, for this we just remember which element was taken to achieve the current sub sum. This version returns a subset which results in the target sum or None otherwise:
def subsum(values, target):
reached=[False]*(target+1)
val_ids=[-1]*(target+1)
reached[0]=True # with 0 elements we can only achieve the sum=0
for (val_id,val) in enumerate(values):
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target:
reached[s+val]=True
val_ids[s+val]=val_id
#reconstruct the subset for target:
if not reached[target]:
return None # means not possible
else:
result=[]
current=target
while current!=0:# search backwards jumping from predecessor to predecessor
val_id=val_ids[current]
result.append(val_id)
current-=values[val_id]
return result
As an another approach you could use memoization to speed up your current solution remembering for the state (subsum, number_of_elements_not considered) whether it is possible to achieve the target sum. But I would say the standard dynamic programming is a less error prone possibility here.

Yes.
I can't provide full code at the moment, but instead of iterating each list of transactions twice until finding matches (O squared), try this concept:
setup a hashtable with the existing transaction amounts as entries, as well as the summation of each set of two transactions assuming each value is made of a max of two transactions (weekend credit card processing).
for each total, reference into the hashtable - the sets of transactions in that slot are the list of matching transactions.
Instead of O^2, you can get it down to 4*O, which would make a noticeable difference in speed.
Good luck!

Dynamic programming can solve this problem efficiently:
Assume you have n transactions and the max amount of transactions is m.
we can solve it just in the complexity of O(nm).
learn it at Knapsack problem.
for this problem we can define for pre i transactions the numbers of subset, add up to sum: dp[i][sum].
the equation:
for i 1 to n:
dp[i][sum] = dp[i - 1][sum - amount_i]
the dp[n][sum] is the numbers of you need, and you need to add some tricks to get what are all the subsets.
Blockquote

You have a couple of practical assumptions here that would make brute force with smartish branch pruning feasible:
items are unique, hence you wouldn't be getting combinatorial blow up of valid subsets (i.e. (1,1,1,1,1,1,1,1,1,1,1,1,1) adding up to 3)
if the number of resulting feasible sets is still huge, you would run out of memory collecting them before running into total runtime issues.
ordering input ascending would allow for an easy early stop check - if your remaining sum is smaller then the current element, then none of the yet unexamined items could possibly be in a result (as current and subsequent items would only get bigger)
keeping running sums would speed up each step, as you wouldn't be recalculating it over and over again
Here's a bit of code:
public static List<T[]> SubsetSums<T>(T[] items, int target, Func<T, int> amountGetter)
{
Stack<T> unusedItems = new Stack<T>(items.OrderByDescending(amountGetter));
Stack<T> usedItems = new Stack<T>();
List<T[]> results = new List<T[]>();
SubsetSumsRec(unusedItems, usedItems, target, results, amountGetter);
return results;
}
public static void SubsetSumsRec<T>(Stack<T> unusedItems, Stack<T> usedItems, int targetSum, List<T[]> results, Func<T,int> amountGetter)
{
if (targetSum == 0)
results.Add(usedItems.ToArray());
if (targetSum < 0 || unusedItems.Count == 0)
return;
var item = unusedItems.Pop();
int currentAmount = amountGetter(item);
if (targetSum >= currentAmount)
{
// case 1: use current element
usedItems.Push(item);
SubsetSumsRec(unusedItems, usedItems, targetSum - currentAmount, results, amountGetter);
usedItems.Pop();
// case 2: skip current element
SubsetSumsRec(unusedItems, usedItems, targetSum, results, amountGetter);
}
unusedItems.Push(item);
}
I've run it against 100k input that yields around 1k results in under 25 millis, so it should be able to handle your 740 case with ease.

How to use an array to select a range of numbers and then the desired output?

For example, if I wanted to create a program that gives a person a discount based on how many items they have. If they were to purchase 0-5 items they do not get a discount. If they purchase 5-10 items they get a 5% discount, if they get 10-20 items they get a 10% discount and so forth. How can I use an array to sort this out instead of many "If" statements?

How about starting with a structure to store the bounds and discount:
public struct DiscountSpec
{
public int MinItems{get;set;}
public int MaxItems{get;set;}
public double Discount{get;set;}
}
put it in an array
DiscountSpec[] discounts = new DiscountSpec[]
{
new DiscountSpec(){MinItems=0,MaxItems=5,Discount=0},
new DiscountSpec(){MinItems=5,MaxItems=10,Discount=0.05},
new DiscountSpec(){MinItems=10,MaxItems=20,Discount=0.10},
}
And then the magic
int numItemsPurchased=7;
var discount = discounts.Where(
d => d.MinItems<numItemsPurchased && d.MaxItems>=numItemsPurchased)
.Select(d => d.Discount)
.FirstOrDefault();
Now, discount will contain either 0 (no discount) or 0.05 (5% discount) or 0.1 (10% discount). This can be extended with other discount brackets if need be.
Live example: http://rextester.com/YDOWS85239

You can maintain an array to represent the ranges you have given. I am talking about storing like....
say array name is array then array[0]=5 ie the max value of the first interval. then array[1]=10 max value of second interval and proceed in the same way. since this array you just maintained will only contain small number of value so no performance issue with linear search.
Now if the numberOfOrderedItems are less than value_of_array you can break the loop and decide the discount you want to give.
If you are maintaining huge number of discount intervals then go for binary search instead of linear search.

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.

Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.

There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

How to avoid OrderBy - memory usage problems

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.
Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:
N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
.OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
.ElementAt((int) pointList.Count * (1 - N)).Z;
List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();
But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).
Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?
The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.

Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.
public static IEnumerable<TSource> TakeLargest<TSource, TKey>
(this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
var set = new SortedDictionary<TKey, List<TSource>>();
var resultCount = 0;
var first = default(KeyValuePair<TKey, List<TSource>>);
foreach (var item in items)
{
// If the key is already smaller than the smallest
// item in the set, we can ignore this item
var key = selector(item);
if (first.Value == null ||
resultCount < count ||
Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
{
// Add next item to set
if (!set.ContainsKey(key))
{
set[key] = new List<TSource>();
}
set[key].Add(item);
if (first.Value == null)
{
first = set.First();
}
// Remove smallest item from set
resultCount++;
if (resultCount - first.Value.Count >= count)
{
set.Remove(first.Key);
resultCount -= first.Value.Count;
first = set.First();
}
}
}
return set.Values.SelectMany(values => values);
}
That will include more than count elements if there are ties, as your implementation does now.

You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...
pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();
If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed

You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.

If you combine the two there is a chance a little less work will be done:
List<Point> selectedPoints = pointList
.OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
.Take((int) pointList.Count * N);
But basically this kind of ranking requires sorting, your biggest cost.
A few more ideas:
if you use a class Point (instead of a struct Point) there will be much less copying.
you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.

If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):
pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order
Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.
double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
.ToList();

Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.
To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.
double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);
That out of the way, there are three cases where memory use might actually be a problem:
Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.
Update:
Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.

I'd do it by implementing "half" a quicksort.
Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.
Choose a pivot x in P.
Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.
If N = |U| then you're done.
If N < |U| then recurse with P := U.
Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.
If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.
Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:
Make a traversal of P, finding the least and greatest items, A and Z, respectively.
Let M be the mean of A and Z (remember, we're only considering Z coordinates here).
Count how many items there are in the range [M, Z], call this Q.
If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.
If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.
Repeat until we find an M such that Q = N.
Now traverse P, removing all items greater than or equal to M.
That's definitely O(n log n) and creates no extra data structures (except for the result).
Howzat?

You might use something like this:
pointList.Sort(); // Use you own compare here if needed
// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z;
// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);

If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.

int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();
Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.

You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.
Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:
Sort With Linq: 5.3 seconds
Use DataViews: 0.01 seconds
Give it a try.
[TestClass]
public class UnitTest1 {
class MyTable : TypedTableBase<MyRow> {
public MyTable() {
Columns.Add("Col1", typeof(int));
Columns.Add("Col2", typeof(int));
}
protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
return new MyRow(builder);
}
}
class MyRow : DataRow {
public MyRow(DataRowBuilder builder) : base(builder) {
}
public int Col1 { get { return (int)this["Col1"]; } }
public int Col2 { get { return (int)this["Col2"]; } }
}
DataView _viewCol1Asc;
DataView _viewCol2Desc;
MyTable _table;
int _countToTake;
[TestMethod]
public void MyTestMethod() {
_table = new MyTable();
int count = 50000;
for (int i = 0; i < count; i++) {
_table.Rows.Add(i, i);
}
_countToTake = _table.Rows.Count / 30;
Console.WriteLine("SortWithLinq");
RunTest(SortWithLinq);
Console.WriteLine("Use DataViews");
RunTest(UseSoredDataViews);
}
private void RunTest(Action method) {
int iterations = 100;
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
method();
}
watch.Stop();
Console.WriteLine(" {0}", watch.Elapsed);
}
private void UseSoredDataViews() {
if (_viewCol1Asc == null) {
_viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
_viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
}
var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
}
private void SortWithLinq() {
var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
IterateRows(rows);
rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
IterateRows(rows);
}
private void IterateRows(IEnumerable<MyRow> rows) {
foreach (var row in rows)
if (row == null)
throw new Exception("????");
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.