Take pages and combine to list of pages - c#

I have a list, let's say it contains 1000 items. I want to end up with a list of 10 times 100 items with something like:
myList.Select(x => x.y).Take(100) (until list is empty)
So I want Take(100) to run ten times, since the list contains 1000 items, and end up with list containing 10 lists which each contains 100 items.

You need to Skip the number of records you have already taken, you can keep track of this number and use it when you query
alreadyTaken = 0;
while (alreadyTaken < 1000) {
var pagedList = myList.Select(x => x.y).Skip(alreadyTaken).Take(100);
...
alreadyTaken += 100;
}

This can be achieved with a simple paging extension method.
public static List<T> GetPage<T>(this List<T> dataSource, int pageIndex, int pageSize = 100)
{
return dataSource.Skip(pageIndex * pageSize)
.Take(pageSize)
.ToList();
}
Of course, you can extend it to accept and/or return any kind of IEnumerable<T>.

As already posted you can use a for loop and Skip some elements and Take some elements. In this way you create a new query in every for loop. But a problem raises if you also want to go through each of those queries, because this will be very inefficient. Lets assume you just have 50 entries and you want to go through your list with ten elements every loop. You will have 5 loops doing
.Skip(0).Take(10)
.Skip(10).Take(10)
.Skip(20).Take(10)
.Skip(30).Take(10)
.Skip(40).Take(10)
Here two problem raises.
Skiping elements can still lead to computation. In your first query you just calculate the needed 10 elements, but in your second loop you calculated 20 elements and throwing 10 away, and so on. If you sum all 5 loops together you already computed 10 + 20 + 30 + 40 + 50 = 150 elements even you only had 50 elements. This result in an O(n^2) performance.
Not every IEnumerable does the above thing. Some IEnumerable like a database for example can optimize a Skip, for example they use an Offset (MySQL) definition in the SQL query. But that still doesn't solve the problem. The main problem you still have is that you will create 5 different Queries and execute all 5 of them. Those five queries will now take the most time. Because a simple Query to a database is even a lot slower than just Skipping some in-memory elements or some computations.
Because of all these problems it makes sense to not use a for loop with multiple .Skip(x).Take(y) if you also want to evaluate every query in every loop. Instead your algorithm should only go through your IEnumerable once, executing the query once, and on the first iteration return the first 10 elements. The next iteration returns the next 10 elements and so on, until it runs out of elements.
The following Extension Method does exactly this.
public static IEnumerable<IReadOnlyList<T>> Combine<T>(this IEnumerable<T> source, int amount) {
var combined = new List<T>();
var counter = 0;
foreach ( var entry in source ) {
combined.Add(entry);
if ( ++counter >= amount ) {
yield return combined;
combined = new List<T>();
counter = 0;
}
}
if ( combined.Count > 0 )
yield return combined;
}
With this you can just do
someEnumerable.Combine(100)
and you get a new IEnumerable<IReadOnlyList<T>> that goes through your enumeration just once slicing everything into chunks with a maximum of 100 elements.
Just to show how much difference the performance could be:
var numberCount = 100000;
var combineCount = 100;
var nums = Enumerable.Range(1, numberCount);
var count = 0;
// Bechmark with Combine() Extension
var swCombine = Stopwatch.StartNew();
var sumCombine = 0L;
var pages = nums.Combine(combineCount);
foreach ( var page in pages ) {
sumCombine += page.Sum();
count++;
}
swCombine.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Combine: {2}", count, sumCombine, swCombine.Elapsed);
// Doing it with .Skip(x).Take(y)
var swTakes = Stopwatch.StartNew();
count = 0;
var sumTaken = 0L;
var alreadyTaken = 0;
while ( alreadyTaken < numberCount ) {
sumTaken += nums.Skip(alreadyTaken).Take(combineCount).Sum();
alreadyTaken += combineCount;
count++;
}
swTakes.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Takes: {2}", count, sumTaken, swTakes.Elapsed);
The usage with the Combine() Extension Methods runs in 3 milliseconds on my computer (i5 # 4Ghz) while the for loop already needs 178 milliseconds
If you have a lot more elements or the slicing is smaller it gets even more worse. For example if combineCount is set to 10 instead of 100 the runtime changes to 4 milliseconds and 1800 milliseconds (1.8 seconds)
Now you could possibly say that you don't have so much elements or your slicing never gets so small. But remember, in this this example i just generated a sequence of numbers that has nearly zero computation time. The whole overhead from 4 milliseconds to 178 milliseconds is only caused of the re-evaluation and Skiping of values. If you have some more complex stuff going on behind the scenes the Skipping creates the most overhead, and also if an IEnumerable can implement Skip, like a database as explained above, that example will still get more worse, because the most overhead will be the execution of the query itself.
And the amount of queries can go really fast up. With 100.000 elements and a slicing/chunking of 100 you already will execute 1.000 queries. The Combine Extension provided above on the other hand will always execute your query once. And will never suffer of any of those problems described above.
All of that doesn't mean that Skip and Take should be avoided. They have their place. But if you really plan to go through every element you should avoid using Skip and Take to get your slicing done.
If the only thing you want is just to slice everything into pages with 100 elements, and you just want to fetch the third page, for example. You just should calculate how much elements you need to Skip.
var pageCount = 100;
var pageNumberToGet = 3;
var thirdPage = yourEnumerable.Skip(pageCount * (pageNumberToGet-1)).take(pageCount);
In this way you will get the elements from 200 to 300 in a single query. Also an IEnumerable with a databse can optimize that and you just have a single-query. So, if you only want a specific range of elements from your IEnumerable than you should use Skip and Take and do it like above instead of using the Combine Extension Method that i provided.

Related

Why is Queue consuming so much memory?

Basically I was doing a code kata on codewars site to kinda of 'warm up' before starting to code, and noticed a problem that I don't know if its because of my code, or just regular thing.
public static string WhoIsNext(string[] names, long n)
{
Queue<string> fifo = new Queue<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = fifo.Dequeue();
fifo.Enqueue(name);
fifo.Enqueue(name);
}
return fifo.Peek();
}
And Is called like this:
// Test 1
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 1;
var nth = CodeKata.WhoIsNext(names, n); // n = 1 Should return sheldon.
// test 2
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 52;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Penny.
// test 3
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 7230702951;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Leonard.
In this code When I put the long n with the value 7230702951 (a really high number...), it throws an out of memory exception. Is the number that high, or is the queue just not optimized for such numbers.
I say this because I tried using a List and the list memory usage stayed under 500 MB (the plateu was around 327MB btw), and this running for about 2/3min, whereas the queue throwed the exception in a matter of seconds, and went over 2GB in just that time alone.
Can someone explain to me the why of this happening, I just curious?
edit 1
I forgot to add the List code:
public static string WhoIsNext(string[] names, long n)
{
List<string> test = new List<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = test[0];
test.RemoveAt(0);
test.Add(name);
test.Add(name);
}
return test[0];
}
edit 2
For those saying that the code doubles the names and is inneficient, I already know that, the code isn't made to be useful, is just a kata. (I updated the link now!)
My question is as to why is Queue so much more inneficient thatn List with high count numbers.
Part of the reason is that the queue code is way faster than the List code, because queues are optimised for deletes due to the fact that they are a circular buffer. Lists aren't - the list copies the array contents every time you remove that first element.
Change the input value to 72307000 for example. On my machine, the queue finishes that in less than a second. The list is still chugging away minutes (and at this rate, hours) later. In 4 minutes i is now at 752408 - it has done almost 1% of the work).
Thus, I am not sure the queue is less memory efficient. It is just so fast that you run into the memory issue sooner. The list almost certainly has the same issue (the way that List and Queue do array size doubling is very similar) - it will just likely take days to run into it.
To a certain extent, you could predict this even without running your code. A queue with 7230702951 entries in it (running 64-bit) will take a minimum of 8 bytes per entry. So 57845623608 bytes. Which is larger than 50GB. Clearly your machine is going to struggle to fit that in RAM (plus .NET won't let you have an array that large)...
Additionally, your code has a subtle bug. The loop can't ever end (if n is greater than int.MaxValue). Your loop variable is an int but the parameter is a long. Your int will overflow (from int.MaxValue to int.MinValue with i++). So the loop will never exit, for large values of n (meaning the queue will grow forever). You likely should change the type of i to long.

How to speed up string operations and avoid slow loops

I am writing a code which makes a lot of combinations (Combinations might not be the right word here, sequences of string in the order they are actually present in the string) that already exist in a string. The loop starts adding combinations to a List<string> but unfortunately, my loop takes a lot of time when dealing with any file over 200 bytes. I want to be able to work with hundreds of MBs here.
Let me explain what I actually want in the simplest of ways.
Lets say I have a string that is "Afnan is awesome" (-> main string), what I would want is a list of string which encompasses different substring sequences of the main string. For example-> A,f,n,a,n, ,i,s, ,a,w,e,s,o,m,e. Now this is just the first iteration of the loop. With each iteration, my substring length increases, yielding these results for the second iteration -> Af,fn,na,n , i,is,s , a,aw,we,es,so,om,me. The third iteration would look like this: Afn,fna,nan,an ,n i, is,is ,s a, aw, awe, wes, eso, som, ome. This will keep going on until my substring length reaches half the length of my main string.
My code is as follows:
string data = File.ReadAllText("MyFilePath");
//Creating my dictionary
List<string> dictionary = new List<string>();
int stringLengthIncrementer = 1;
for (int v = 0; v < (data.Length / 2); v++)
{
for (int x = 0; x < data.Length; x++)
{
if ((x + stringLengthIncrementer) > data.Length) break; //So index does not go out of bounds
if (dictionary.Contains(data.Substring(x, stringLengthIncrementer)) == false) //So no repetition takes place
{
dictionary.Add(data.Substring(x, stringLengthIncrementer)); //To add the substring to my List<string> -> dictionary
}
}
stringLengthIncrementer++; //To increase substring length with each iteration
}
I use data.Length / 2 because I only need combinations at most half the length of the entire string. Note that I search the entire string for combinations, not half of it.
To further simplify what I am trying to do -> Suppose I have an input string =
"abcd"
the output would be =
a, b, c, d, ab, bc, cd, This rest will be cut out as it is longer than half the length of my primary string -> //abc, bcd, abcd
I was hoping if some regex method may help me achieve this. Anything that doesn't consist of loops. Anything that is exponentially faster than this? Some simple code with less complexity which is more efficient?
Update
When I used Hashset instead of List<string> for my dictionary, I did not experience any change of performance and also got an OutOfMemoryException:
You can use linq to simplify the code and very easily parallelize it, but it's not going to be orders of magnitude faster, as you would need to run it on files of 100s of MBs (that's very likely impossible).
var data = File.ReadAllText("MyFilePath");
var result = Enumerable.Range(1, data.Length / 2)
.AsParallel()
.Select(len => new HashSet<string>(
Enumerable.Range(0, data.Length - len + 1) //Adding the +1 here made it work perfectly
.Select(x => data.Substring(x, len))))
.SelectMany(t=>t)
.ToList();
General improvements, that you can do in your code to improve the performance (I don't consider if there're other more optimal solutions).
calculate data.Substring(x, stringLengthIncrementer) only once
as you do search, use SortedList, it will be faster.
initialize the List (or SortedList, or whatever) with calculated number of items. Like new List(CalucatedCapacity).
or you can try to write an algorithm that produces combinations without checking for duplicates.
You may be able to use HashSet combined with MoreLINQ's Batch feature (available on NuGet) to simplify the code a little.
public static void Main()
{
string data = File.ReadAllText("MyFilePath");
//string data = "Afnan is awesome";
var dictionary = new HashSet<string>();
for (var stringLengthIncrementer = 1; stringLengthIncrementer <= (data.Length / 2); stringLengthIncrementer++)
{
foreach (var skipper in Enumerable.Range(0, stringLengthIncrementer))
{
var batched = data.Skip(skipper).Batch(stringLengthIncrementer);
foreach (var batch in batched)
{
dictionary.Add(new string(batch.ToArray()));
}
}
}
Console.WriteLine(dictionary);
dictionary.ForEach(z => Console.WriteLine(z));
Console.ReadLine();
}
For this input:
"Afnan is awesome askdjkhaksjhd askjdhaksjsdhkajd asjsdhkajshdkjahsd asksdhkajshdkjashd aksjdhkajsshd98987ad asdhkajsshd98xcx98asdjaksjsd askjdakjshcc98z98asdsad"
performance is roughly 10x faster than your current code.

Subset Sum algorithm efficiency

We have a number of payments (Transaction) that come into our business each day. Each Transaction has an ID and an Amount. We have the requirement to match a number of these transactions to a specific amount. Example:
Transaction Amount
1 100
2 200
3 300
4 400
5 500
If we wanted to find the transactions that add up to 600 you would have a number of sets (1,2,3),(2,4),(1,5).
I found an algorithm that I have adapted, that works as defined below. For 30 transactions it takes 15ms. But the number of transactions average around 740 and have a maximum close to 6000. Is the a more efficient way to perform this search?
sum_up(TransactionList, remittanceValue, ref MatchedLists);
private static void sum_up(List<Transaction> transactions, decimal target, ref List<List<Transaction>> matchedLists)
{
sum_up_recursive(transactions, target, new List<Transaction>(), ref matchedLists);
}
private static void sum_up_recursive(List<Transaction> transactions, decimal target, List<Transaction> partial, ref List<List<Transaction>> matchedLists)
{
decimal s = 0;
foreach (Transaction x in partial) s += x.Amount;
if (s == target)
{
matchedLists.Add(partial);
}
if (s > target)
return;
for (int i = 0; i < transactions.Count; i++)
{
List<Transaction> remaining = new List<Transaction>();
Transaction n = new Transaction(0, transactions[i].ID, transactions[i].Amount);
for (int j = i + 1; j < transactions.Count; j++) remaining.Add(transactions[j]);
List<Transaction> partial_rec = new List<Transaction>(partial);
partial_rec.Add(new Transaction(n.MatchNumber, n.ID, n.Amount));
sum_up_recursive(remaining, target, partial_rec, ref matchedLists);
}
}
With Transaction defined as:
class Transaction
{
public int ID;
public decimal Amount;
public int MatchNumber;
public Transaction(int matchNumber, int id, decimal amount)
{
ID = id;
Amount = amount;
MatchNumber = matchNumber;
}
}
As already mentioned your problem can be solved by pseudo polynomial algorithm in O(n*G) with n - number of items and G - your targeted sum.
The first part question: is it possible to achieve the targeted sum G. The following pseudo/python code solves it (have no C# on my machine):
def subsum(values, target):
reached=[False]*(target+1) # initialize as no sums reached at all
reached[0]=True # with 0 elements we can only achieve the sum=0
for val in values:
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target: # if subsum=s can be reached, that we can add the current value to this sum and build an new sum
reached[s+val]=True
return reached[target]
What is the idea? Let's consider values [1,2,3,6] and target sum 7:
We start with an empty set - the possible sum is obviously 0.
Now we look at the first element 1 and have to options to take or not to take. That leaves as with possible sums {0,1}.
Now looking at the next element 2: leads to possible sets {0,1} (not taking)+{2,3} (taking).
Until now not much difference to your approach, but now for element 3 we have possible sets a. for not taking {0,1,2,3} and b. for taking {3,4,5,6} resulting in {0,1,2,3,4,5,6} as possible sums. The difference to your approach is that there are two way to get to 3 and your recursion will be started twice from that (which is not needed). Calculating basically the same staff over and over again is the problem of your approach and why the proposed algorithm is better.
As last step we consider 6 and get {0,1,2,3,4,5,6,7} as possible sums.
But you also need the subset which leads to the targeted sum, for this we just remember which element was taken to achieve the current sub sum. This version returns a subset which results in the target sum or None otherwise:
def subsum(values, target):
reached=[False]*(target+1)
val_ids=[-1]*(target+1)
reached[0]=True # with 0 elements we can only achieve the sum=0
for (val_id,val) in enumerate(values):
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target:
reached[s+val]=True
val_ids[s+val]=val_id
#reconstruct the subset for target:
if not reached[target]:
return None # means not possible
else:
result=[]
current=target
while current!=0:# search backwards jumping from predecessor to predecessor
val_id=val_ids[current]
result.append(val_id)
current-=values[val_id]
return result
As an another approach you could use memoization to speed up your current solution remembering for the state (subsum, number_of_elements_not considered) whether it is possible to achieve the target sum. But I would say the standard dynamic programming is a less error prone possibility here.
Yes.
I can't provide full code at the moment, but instead of iterating each list of transactions twice until finding matches (O squared), try this concept:
setup a hashtable with the existing transaction amounts as entries, as well as the summation of each set of two transactions assuming each value is made of a max of two transactions (weekend credit card processing).
for each total, reference into the hashtable - the sets of transactions in that slot are the list of matching transactions.
Instead of O^2, you can get it down to 4*O, which would make a noticeable difference in speed.
Good luck!
Dynamic programming can solve this problem efficiently:
Assume you have n transactions and the max amount of transactions is m.
we can solve it just in the complexity of O(nm).
learn it at Knapsack problem.
for this problem we can define for pre i transactions the numbers of subset, add up to sum: dp[i][sum].
the equation:
for i 1 to n:
dp[i][sum] = dp[i - 1][sum - amount_i]
the dp[n][sum] is the numbers of you need, and you need to add some tricks to get what are all the subsets.
Blockquote
You have a couple of practical assumptions here that would make brute force with smartish branch pruning feasible:
items are unique, hence you wouldn't be getting combinatorial blow up of valid subsets (i.e. (1,1,1,1,1,1,1,1,1,1,1,1,1) adding up to 3)
if the number of resulting feasible sets is still huge, you would run out of memory collecting them before running into total runtime issues.
ordering input ascending would allow for an easy early stop check - if your remaining sum is smaller then the current element, then none of the yet unexamined items could possibly be in a result (as current and subsequent items would only get bigger)
keeping running sums would speed up each step, as you wouldn't be recalculating it over and over again
Here's a bit of code:
public static List<T[]> SubsetSums<T>(T[] items, int target, Func<T, int> amountGetter)
{
Stack<T> unusedItems = new Stack<T>(items.OrderByDescending(amountGetter));
Stack<T> usedItems = new Stack<T>();
List<T[]> results = new List<T[]>();
SubsetSumsRec(unusedItems, usedItems, target, results, amountGetter);
return results;
}
public static void SubsetSumsRec<T>(Stack<T> unusedItems, Stack<T> usedItems, int targetSum, List<T[]> results, Func<T,int> amountGetter)
{
if (targetSum == 0)
results.Add(usedItems.ToArray());
if (targetSum < 0 || unusedItems.Count == 0)
return;
var item = unusedItems.Pop();
int currentAmount = amountGetter(item);
if (targetSum >= currentAmount)
{
// case 1: use current element
usedItems.Push(item);
SubsetSumsRec(unusedItems, usedItems, targetSum - currentAmount, results, amountGetter);
usedItems.Pop();
// case 2: skip current element
SubsetSumsRec(unusedItems, usedItems, targetSum, results, amountGetter);
}
unusedItems.Push(item);
}
I've run it against 100k input that yields around 1k results in under 25 millis, so it should be able to handle your 740 case with ease.

Adding to List<t> becomes very slow over time

I'm parsing an html table that has about 1000 rows. I'm adding ~10 char string from one <td> in each row to a list<string> object. It's very quick for the first 200 or so loops but then becomes slower and slower over time.
This is the code i'm using:
List<string> myList = new List<string>();
int maxRows = numRows;
for (int i = 1; i < maxRows; i++)
{
TableRow newTable = myTable.TableRows[i];
string coll = string.Format("{0},{1},{2},{3},{4}",newTable.TableCells[0].Text,newTable.TableCells[1].Text,newTable.TableCells[2].Text,newTable.TableCells[3].Text,newTable.TableCells[4].Text);
myList.Add(coll);
label1.Text = i.ToString();
}
Should I use an array instead?
Edit: I threw the above code in a new method that gets run on a new Thread and then updated my label control with this code:
label1.Invoke((MethodInvoker)delegate
{
label1.Text = i.ToString();
});
Program runs at a consistent speed and doesn't block the UI.
If you roughly know the range (number of items) in your collection it is better to use an array.
Reason : Every time you add an element to the List if the list is full it allocates new block of memory to hold the double the current space and copies everything there and then keeps appending the additional entries till it becomes full, and one more allocation copy cycle.
Following is how it works AFAIK, start with 16 elements by default,
when you add 17th element to the list it allocates 32 elemnts and copies 16 there then continues for 17 to 32. and repeats this process, so it is slower but offer flexibility of not having to determine the length beforehand. This might be the reason you're seeing the drag.
Thanks #Dyppl
var list = new List<int>(1000); This is one elegant option too, as #Dyppl suggested it is best of both the worlds.
I tested adding strings to a list, and benchmarked it with a LIST_SIZE of 1000000 (one million) items and a LIST_SIZE of 100000 (one hundred thousands) items. This way we can compare how it scales.
I ran each test 5 times and averaged the running times.
var l = new List<string>();
for (var i = 0; i < LIST_SIZE; ++i) {
l.Add("i = " + i.ToString());
}
LIST_SIZE of 1000000 takes 1519 ms
LIST_SIZE of 100000 takes 96 ms
var l = new List<string>(LIST_SIZE);
for (var i = 0; i < LIST_SIZE; ++i) {
l.Add("i = " + i.ToString());
}
LIST_SIZE of 1000000 takes 1386 ms
LIST_SIZE of 100000 takes 65 ms
var l = new string[LIST_SIZE];
for (var i = 0; i < LIST_SIZE; ++i) {
l[i] = "i = " + i.ToString();
}
LIST_SIZE of 1000000 takes 1510 ms
LIST_SIZE of 100000 takes 66 ms
So, we can notice 2 things:
it really takes more time to add each items the longer the list gets larger
the difference shouldn't be noticeable in a 1000 items list
I would conclude then that the bottleneck is in one of the other methods you call.
Initialize the List with the capacity you expect it to consume:
List<string> myList = new List<string>(maxRows);
Sidenote: If you generate 'very' large lists, the internally increasing storage arrays over time sum up to twice the storage you really need. But if for 1000 entries you already slow down, I suggest investigating the true reason for it with a profiler. May the strings grow to large ?

How to avoid OrderBy - memory usage problems

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.
Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:
N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
.OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
.ElementAt((int) pointList.Count * (1 - N)).Z;
List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();
But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).
Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?
The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.
Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.
public static IEnumerable<TSource> TakeLargest<TSource, TKey>
(this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
var set = new SortedDictionary<TKey, List<TSource>>();
var resultCount = 0;
var first = default(KeyValuePair<TKey, List<TSource>>);
foreach (var item in items)
{
// If the key is already smaller than the smallest
// item in the set, we can ignore this item
var key = selector(item);
if (first.Value == null ||
resultCount < count ||
Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
{
// Add next item to set
if (!set.ContainsKey(key))
{
set[key] = new List<TSource>();
}
set[key].Add(item);
if (first.Value == null)
{
first = set.First();
}
// Remove smallest item from set
resultCount++;
if (resultCount - first.Value.Count >= count)
{
set.Remove(first.Key);
resultCount -= first.Value.Count;
first = set.First();
}
}
}
return set.Values.SelectMany(values => values);
}
That will include more than count elements if there are ties, as your implementation does now.
You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...
pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();
If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed
You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.
If you combine the two there is a chance a little less work will be done:
List<Point> selectedPoints = pointList
.OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
.Take((int) pointList.Count * N);
But basically this kind of ranking requires sorting, your biggest cost.
A few more ideas:
if you use a class Point (instead of a struct Point) there will be much less copying.
you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.
If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):
pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order
Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.
double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
.ToList();
Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.
To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.
double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);
That out of the way, there are three cases where memory use might actually be a problem:
Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.
Update:
Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.
I'd do it by implementing "half" a quicksort.
Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.
Choose a pivot x in P.
Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.
If N = |U| then you're done.
If N < |U| then recurse with P := U.
Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.
If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.
Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:
Make a traversal of P, finding the least and greatest items, A and Z, respectively.
Let M be the mean of A and Z (remember, we're only considering Z coordinates here).
Count how many items there are in the range [M, Z], call this Q.
If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.
If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.
Repeat until we find an M such that Q = N.
Now traverse P, removing all items greater than or equal to M.
That's definitely O(n log n) and creates no extra data structures (except for the result).
Howzat?
You might use something like this:
pointList.Sort(); // Use you own compare here if needed
// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z;
// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);
If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.
int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();
Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.
You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.
Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:
Sort With Linq: 5.3 seconds
Use DataViews: 0.01 seconds
Give it a try.
[TestClass]
public class UnitTest1 {
class MyTable : TypedTableBase<MyRow> {
public MyTable() {
Columns.Add("Col1", typeof(int));
Columns.Add("Col2", typeof(int));
}
protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
return new MyRow(builder);
}
}
class MyRow : DataRow {
public MyRow(DataRowBuilder builder) : base(builder) {
}
public int Col1 { get { return (int)this["Col1"]; } }
public int Col2 { get { return (int)this["Col2"]; } }
}
DataView _viewCol1Asc;
DataView _viewCol2Desc;
MyTable _table;
int _countToTake;
[TestMethod]
public void MyTestMethod() {
_table = new MyTable();
int count = 50000;
for (int i = 0; i < count; i++) {
_table.Rows.Add(i, i);
}
_countToTake = _table.Rows.Count / 30;
Console.WriteLine("SortWithLinq");
RunTest(SortWithLinq);
Console.WriteLine("Use DataViews");
RunTest(UseSoredDataViews);
}
private void RunTest(Action method) {
int iterations = 100;
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
method();
}
watch.Stop();
Console.WriteLine(" {0}", watch.Elapsed);
}
private void UseSoredDataViews() {
if (_viewCol1Asc == null) {
_viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
_viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
}
var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
}
private void SortWithLinq() {
var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
IterateRows(rows);
rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
IterateRows(rows);
}
private void IterateRows(IEnumerable<MyRow> rows) {
foreach (var row in rows)
if (row == null)
throw new Exception("????");
}
}

Categories