Subset Sum algorithm efficiency - c#

We have a number of payments (Transaction) that come into our business each day. Each Transaction has an ID and an Amount. We have the requirement to match a number of these transactions to a specific amount. Example:
Transaction Amount
1 100
2 200
3 300
4 400
5 500
If we wanted to find the transactions that add up to 600 you would have a number of sets (1,2,3),(2,4),(1,5).
I found an algorithm that I have adapted, that works as defined below. For 30 transactions it takes 15ms. But the number of transactions average around 740 and have a maximum close to 6000. Is the a more efficient way to perform this search?
sum_up(TransactionList, remittanceValue, ref MatchedLists);
private static void sum_up(List<Transaction> transactions, decimal target, ref List<List<Transaction>> matchedLists)
{
sum_up_recursive(transactions, target, new List<Transaction>(), ref matchedLists);
}
private static void sum_up_recursive(List<Transaction> transactions, decimal target, List<Transaction> partial, ref List<List<Transaction>> matchedLists)
{
decimal s = 0;
foreach (Transaction x in partial) s += x.Amount;
if (s == target)
{
matchedLists.Add(partial);
}
if (s > target)
return;
for (int i = 0; i < transactions.Count; i++)
{
List<Transaction> remaining = new List<Transaction>();
Transaction n = new Transaction(0, transactions[i].ID, transactions[i].Amount);
for (int j = i + 1; j < transactions.Count; j++) remaining.Add(transactions[j]);
List<Transaction> partial_rec = new List<Transaction>(partial);
partial_rec.Add(new Transaction(n.MatchNumber, n.ID, n.Amount));
sum_up_recursive(remaining, target, partial_rec, ref matchedLists);
}
}
With Transaction defined as:
class Transaction
{
public int ID;
public decimal Amount;
public int MatchNumber;
public Transaction(int matchNumber, int id, decimal amount)
{
ID = id;
Amount = amount;
MatchNumber = matchNumber;
}
}

As already mentioned your problem can be solved by pseudo polynomial algorithm in O(n*G) with n - number of items and G - your targeted sum.
The first part question: is it possible to achieve the targeted sum G. The following pseudo/python code solves it (have no C# on my machine):
def subsum(values, target):
reached=[False]*(target+1) # initialize as no sums reached at all
reached[0]=True # with 0 elements we can only achieve the sum=0
for val in values:
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target: # if subsum=s can be reached, that we can add the current value to this sum and build an new sum
reached[s+val]=True
return reached[target]
What is the idea? Let's consider values [1,2,3,6] and target sum 7:
We start with an empty set - the possible sum is obviously 0.
Now we look at the first element 1 and have to options to take or not to take. That leaves as with possible sums {0,1}.
Now looking at the next element 2: leads to possible sets {0,1} (not taking)+{2,3} (taking).
Until now not much difference to your approach, but now for element 3 we have possible sets a. for not taking {0,1,2,3} and b. for taking {3,4,5,6} resulting in {0,1,2,3,4,5,6} as possible sums. The difference to your approach is that there are two way to get to 3 and your recursion will be started twice from that (which is not needed). Calculating basically the same staff over and over again is the problem of your approach and why the proposed algorithm is better.
As last step we consider 6 and get {0,1,2,3,4,5,6,7} as possible sums.
But you also need the subset which leads to the targeted sum, for this we just remember which element was taken to achieve the current sub sum. This version returns a subset which results in the target sum or None otherwise:
def subsum(values, target):
reached=[False]*(target+1)
val_ids=[-1]*(target+1)
reached[0]=True # with 0 elements we can only achieve the sum=0
for (val_id,val) in enumerate(values):
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target:
reached[s+val]=True
val_ids[s+val]=val_id
#reconstruct the subset for target:
if not reached[target]:
return None # means not possible
else:
result=[]
current=target
while current!=0:# search backwards jumping from predecessor to predecessor
val_id=val_ids[current]
result.append(val_id)
current-=values[val_id]
return result
As an another approach you could use memoization to speed up your current solution remembering for the state (subsum, number_of_elements_not considered) whether it is possible to achieve the target sum. But I would say the standard dynamic programming is a less error prone possibility here.

Yes.
I can't provide full code at the moment, but instead of iterating each list of transactions twice until finding matches (O squared), try this concept:
setup a hashtable with the existing transaction amounts as entries, as well as the summation of each set of two transactions assuming each value is made of a max of two transactions (weekend credit card processing).
for each total, reference into the hashtable - the sets of transactions in that slot are the list of matching transactions.
Instead of O^2, you can get it down to 4*O, which would make a noticeable difference in speed.
Good luck!

Dynamic programming can solve this problem efficiently:
Assume you have n transactions and the max amount of transactions is m.
we can solve it just in the complexity of O(nm).
learn it at Knapsack problem.
for this problem we can define for pre i transactions the numbers of subset, add up to sum: dp[i][sum].
the equation:
for i 1 to n:
dp[i][sum] = dp[i - 1][sum - amount_i]
the dp[n][sum] is the numbers of you need, and you need to add some tricks to get what are all the subsets.
Blockquote

You have a couple of practical assumptions here that would make brute force with smartish branch pruning feasible:
items are unique, hence you wouldn't be getting combinatorial blow up of valid subsets (i.e. (1,1,1,1,1,1,1,1,1,1,1,1,1) adding up to 3)
if the number of resulting feasible sets is still huge, you would run out of memory collecting them before running into total runtime issues.
ordering input ascending would allow for an easy early stop check - if your remaining sum is smaller then the current element, then none of the yet unexamined items could possibly be in a result (as current and subsequent items would only get bigger)
keeping running sums would speed up each step, as you wouldn't be recalculating it over and over again
Here's a bit of code:
public static List<T[]> SubsetSums<T>(T[] items, int target, Func<T, int> amountGetter)
{
Stack<T> unusedItems = new Stack<T>(items.OrderByDescending(amountGetter));
Stack<T> usedItems = new Stack<T>();
List<T[]> results = new List<T[]>();
SubsetSumsRec(unusedItems, usedItems, target, results, amountGetter);
return results;
}
public static void SubsetSumsRec<T>(Stack<T> unusedItems, Stack<T> usedItems, int targetSum, List<T[]> results, Func<T,int> amountGetter)
{
if (targetSum == 0)
results.Add(usedItems.ToArray());
if (targetSum < 0 || unusedItems.Count == 0)
return;
var item = unusedItems.Pop();
int currentAmount = amountGetter(item);
if (targetSum >= currentAmount)
{
// case 1: use current element
usedItems.Push(item);
SubsetSumsRec(unusedItems, usedItems, targetSum - currentAmount, results, amountGetter);
usedItems.Pop();
// case 2: skip current element
SubsetSumsRec(unusedItems, usedItems, targetSum, results, amountGetter);
}
unusedItems.Push(item);
}
I've run it against 100k input that yields around 1k results in under 25 millis, so it should be able to handle your 740 case with ease.

Related

How can I create a random number generator that can not generate the last value it generated?

I want to create a random number generator that can not return the last value it generated.
So far I have created a simple generator using
Random.Range(0, 4),
however this could return a value two times in a row, which is unwanted for my project.
How can I improve on this code?
What you should do is create an integer representing the last random number returned by the number generator. Then you can run the generator in a while loop such that it never returns the same number it returned in the previous run.
int lastRandomNumber = -1;
int GiveRandomNumber()
{
int randomNumber = lastRandomNumber;
while(randomNumber==lastRandomNumber)
randomNumber = Random.Range(0, 4);
lastRandomNumber = randomNumber;
return randomNumber;
}
There are several methods which can generate uniformly distributed values without repeating two values in a row. Disclaimer: I’m using pseudo-code below because don’t program C# — I’m interested in this question based on my 40+ year career using random numbers in the context of computer simulation modeling.
The first one is to use acceptance/rejection techniques, as described by obieFM. The following assumes that previousValue is accessible in the function and that the range arguments are [inclusive, exclusive). To generate from a set containing k distinct values:
previousValue = Random.Range(0, k)
function nonRepeatedRandom {
newValue = Random.Range(0, k)
while newValue == previousValue {
newValue = Random.Range(0, k)
}
previousValue = newValue
return newValue
}
The probability of terminating the loop on any given iteration is (k-1)/k. If the calls to Random.Range() produce independent values then the number of iterations has a Geometric distribution, and the average number of iterations is k/(k-1). The worst-case average is when k is 2, and will take an average of 2 iterations to generate each value. As k increases, the average number of iterations quickly converges towards 1. While there is no theoretical upper bound for the number of iterations, the probability of requiring more than n iterations is (1/k)n — you have to have duplicated the previousValue, with probability 1/k each time, n times in a row. For example, if k is 4 then the probability of needing more than 50 iterations to generate a value is 2-100, a ridiculously small number — your odds of getting hit by lightning and then later killed by a meteor are about 18 orders of magnitude higher than this.
But fear not, my friends, if you have a morbid fear of probabilistic algorithms there’s a simple deterministic alternative available:
previousValue = Random.Range(0, k)
function nonRepeatedRandom {
newValue = Random.Range(0, k-1)
if newValue >= previousValue {
newValue = newValue + 1
}
previousValue = newValue
return newValue
}
For each value after the first one, you want to generate from a set of k-1 eligible values. We accept values below previousValue as-is, and promote values in the range [previousValue, k-2] by 1. The result spans the range [0, k-1], inclusive, excluding previousValue. It's trivial to extend this to work for non-zero lower bounds on the range.
If the values to be generated without sequential duplicates are non-contiguous numeric values or categorical values, enumerate them in an array and use the deterministic algorithm given above to generate the index of the next value to be returned.
You could also use shuffling of array elements or swapping the most recent value to the end of the array, but I think the mechanisms already provided are more efficient and effective for sequential numerical ranges so I won't go into further detail on these alternatives.
To code something more elegant and avoid an unwanted loop, you'll need to make an array where you'll remove the last number.
If you want to simply remove one number which is the last random one, you can either reset (if the list is small enough like 0-4) or re-add the number in your list if it's very big.
In other words, if you want to re-create a list each time, it should look something like this:
var rand = new Random();
int[] randoms = new int[5];
int lastRandom = 2; // For exemple/start
int iter = 0;
for (int i = 0; i < randoms.Length; i++)
{
if (iter == lastRandom) iter++;
randoms[i] = iter++;
}
int newRandom = randoms[rand.Next(randoms.Length)];
Console.WriteLine(newRandom);
It may look overkill, but it's the most flexible and best way to do it. As like you said, if we simply re-roll the random, it can roll on it over and over which can make some lag (on large scale).

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Compare Two Ordered Lists in C#

The problem is that I have two lists of strings. One list is an approximation of the other list, and I need some way of measuring the accuracy of the approximation.
As a makeshift way of scoring the approximation, I have bucketed each list(the approximation and the answer) into 3 partitions (high, medium low) after sorting based on a numeric value that corresponds to the string. I then compare all of the elements in the approximation to see if the string exists in the same partition of the correct list.
I sum the number of correctly classified strings and divide it by the total number of strings. I understand that this is a very crude way of measuring the accuracy of the estimate, and was hoping that better alternatives were available. This is a very small component of a larger piece of work, and was hoping to not have to reinvent the wheel.
EDIT:
I think I wasn't clear enough. I don't need the two lists to be exactly equal, I need some sort of measure that shows the lists are similar. For example, The High-Medium-Low (H-M-L) approach we have taken shows that the estimated list is sufficiently similar. The downside of this approach is that if the estimated list has an item at the bottom of the "High" bracket, and in the actual list, the item is at the top of the medium set, the score algorithm fails to deliver.
It could potentially be that in addition to the H-M-L approach, the bottom 20% of each partition is compared to the top 20% of the next partition or something along those lines.
Thanks all for your help!!
So, we're taking a sequence of items and grouping it into partitions with three categories, high, medium, and low. Let's first make an object to represent those three partitions:
public class Partitions<T>
{
public IEnumerable<T> High { get; set; }
public IEnumerable<T> Medium { get; set; }
public IEnumerable<T> Low { get; set; }
}
Next to make an estimate we want to take two of these objects, one for the actual and one for the estimate. For each priority level we want to see how many of the items are in both collections; this is an "Intersection"; we want to sum up the counts of the intersection of each set.
Then just divide that count by the total:
public static double EstimateAccuracy<T>(Partitions<T> actual
, Partitions<T> estimate)
{
int correctlyCategorized =
actual.High.Intersect(estimate.High).Count() +
actual.Medium.Intersect(estimate.Medium).Count() +
actual.Low.Intersect(estimate.Low).Count();
double total = actual.High.Count()+
actual.Medium.Count()+
actual.Low.Count();
return correctlyCategorized / total;
}
Of course, if we generalize this to not 3 priorities, but rather a sequence of sequences, in which each sequence corresponds to some bucket (i.e. there are N buckets, not just 3) the code actually gets easier:
public static double EstimateAccuracy<T>(
IEnumerable<IEnumerable<T>> actual
, IEnumerable<IEnumerable<T>> estimate)
{
var query = actual.Zip(estimate, (a, b) => new
{
valid = a.Intersect(b).Count(),
total = a.Count()
}).ToList();
return query.Sum(pair => pair.valid) /
(double)query.Sum(pair => pair.total);
}
Nice question. Well, I think you could use the following method to compare your lists:
public double DetermineAccuracyPercentage(int numberOfEqualElements, int yourListsLength)
{
return ((double)numberOfEqualElements / (double)yourListsLength) * 100.0;
}
The number returned should determine how much equality exists between your two lists.
If numberOfEqualElements = yourLists.Length (Count) so they are absolutely equal.
The accuracy of the approximation = (numberOfEqualElements / yourLists.Length)
1 = completely equal , 0 = absolutely different, and the values between 0 and 1 determine the level of equality. In my sample a percentage.
If you compare these 2 lists, you will retrieve a 75% of equality, the same that 3 of 4 equal elements (3/4).
IList<string> list1 = new List<string>();
IList<string> list2 = new List<string>();
list1.Add("Dog");
list1.Add("Cat");
list1.Add("Fish");
list1.Add("Bird");
list2.Add("Dog");
list2.Add("Cat");
list2.Add("Fish");
list2.Add("Frog");
int resultOfComparing = list1.Intersect(list2).Count();
double accuracyPercentage = DetermineAccuracyPercentage(resultOfComparing, list1.Count);
I hope it helps.
I would take both List<String>s and combine each element into a IEnumerable<Boolean>:
public IEnumerable<Boolean> Combine<Ta, Tb>(List<Ta> seqA, List<Tb> seqB)
{
if (seqA.Count != seqB.Count)
throw new ArgumentException("Lists must be the same size...");
for (int i = 0; i < seqA.Count; i++)
yield return seqA[i].Equals(seqB[i]));
}
And then use Aggregate() to verify which strings match and keep a running total:
var result = Combine(a, b).Aggregate(0, (acc, t)=> t ? acc + 1 : acc) / a.Count;

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.
Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.
There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

How to avoid OrderBy - memory usage problems

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.
Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:
N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
.OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
.ElementAt((int) pointList.Count * (1 - N)).Z;
List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();
But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).
Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?
The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.
Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.
public static IEnumerable<TSource> TakeLargest<TSource, TKey>
(this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
var set = new SortedDictionary<TKey, List<TSource>>();
var resultCount = 0;
var first = default(KeyValuePair<TKey, List<TSource>>);
foreach (var item in items)
{
// If the key is already smaller than the smallest
// item in the set, we can ignore this item
var key = selector(item);
if (first.Value == null ||
resultCount < count ||
Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
{
// Add next item to set
if (!set.ContainsKey(key))
{
set[key] = new List<TSource>();
}
set[key].Add(item);
if (first.Value == null)
{
first = set.First();
}
// Remove smallest item from set
resultCount++;
if (resultCount - first.Value.Count >= count)
{
set.Remove(first.Key);
resultCount -= first.Value.Count;
first = set.First();
}
}
}
return set.Values.SelectMany(values => values);
}
That will include more than count elements if there are ties, as your implementation does now.
You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...
pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();
If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed
You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.
If you combine the two there is a chance a little less work will be done:
List<Point> selectedPoints = pointList
.OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
.Take((int) pointList.Count * N);
But basically this kind of ranking requires sorting, your biggest cost.
A few more ideas:
if you use a class Point (instead of a struct Point) there will be much less copying.
you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.
If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):
pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order
Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.
double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
.ToList();
Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.
To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.
double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);
That out of the way, there are three cases where memory use might actually be a problem:
Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.
Update:
Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.
I'd do it by implementing "half" a quicksort.
Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.
Choose a pivot x in P.
Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.
If N = |U| then you're done.
If N < |U| then recurse with P := U.
Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.
If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.
Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:
Make a traversal of P, finding the least and greatest items, A and Z, respectively.
Let M be the mean of A and Z (remember, we're only considering Z coordinates here).
Count how many items there are in the range [M, Z], call this Q.
If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.
If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.
Repeat until we find an M such that Q = N.
Now traverse P, removing all items greater than or equal to M.
That's definitely O(n log n) and creates no extra data structures (except for the result).
Howzat?
You might use something like this:
pointList.Sort(); // Use you own compare here if needed
// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z;
// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);
If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.
int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();
Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.
You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.
Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:
Sort With Linq: 5.3 seconds
Use DataViews: 0.01 seconds
Give it a try.
[TestClass]
public class UnitTest1 {
class MyTable : TypedTableBase<MyRow> {
public MyTable() {
Columns.Add("Col1", typeof(int));
Columns.Add("Col2", typeof(int));
}
protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
return new MyRow(builder);
}
}
class MyRow : DataRow {
public MyRow(DataRowBuilder builder) : base(builder) {
}
public int Col1 { get { return (int)this["Col1"]; } }
public int Col2 { get { return (int)this["Col2"]; } }
}
DataView _viewCol1Asc;
DataView _viewCol2Desc;
MyTable _table;
int _countToTake;
[TestMethod]
public void MyTestMethod() {
_table = new MyTable();
int count = 50000;
for (int i = 0; i < count; i++) {
_table.Rows.Add(i, i);
}
_countToTake = _table.Rows.Count / 30;
Console.WriteLine("SortWithLinq");
RunTest(SortWithLinq);
Console.WriteLine("Use DataViews");
RunTest(UseSoredDataViews);
}
private void RunTest(Action method) {
int iterations = 100;
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
method();
}
watch.Stop();
Console.WriteLine(" {0}", watch.Elapsed);
}
private void UseSoredDataViews() {
if (_viewCol1Asc == null) {
_viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
_viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
}
var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
}
private void SortWithLinq() {
var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
IterateRows(rows);
rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
IterateRows(rows);
}
private void IterateRows(IEnumerable<MyRow> rows) {
foreach (var row in rows)
if (row == null)
throw new Exception("????");
}
}

Categories