Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.
Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.
There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...
The method is:
List<Book> books = new List<Book>();
public List<Book> Shoot()
{
foreach(var b in books)
{
bool find = true;
foreach(var otherb in books)
{
if(otherb != b && otherb.Author == b.Author)
{
find = false;
}
}
if(find)
{
yield return b;
}
}
}
Normally, the time complexity will be O(books.Count^2), but there is a if(find)
statement in the outer loop and it may change the loop times.
So my questions are:
What is the time complexity of this method?
How did you calculate it?
I'm waiting online for your answer.
Thank you in advance.
You would go through each book in the outer loop (n) and for each outer book you would go through each otherb in the inner loop (n times) so the the time complexity would be O(n^2).
The yield return would not change the complexity of the algorithm, it creates an iterator pattern but if you traverse the whole list from the calling function, you go through all the iterations in your algo.
What is the yield keyword used for in C#?
To optimize the algorithm, as btilly mention, you could do two passes over the collection, on the first pass you store the number of books per author in a hash table and on the second pass you check if the author has more than one book using the hash table (assuming constant time for the lookup) and yield the book if it does:
public List<Book> Shoot()
{
var authors = new Dictionary<string, int>();
foreach(var b in books)
{
if(authors.ContainsKey(b.Author))
authors[b.Author] ++;
else
authors.Add(b.Author, 1);
}
foreach(var b in books)
{
if(authors[b.Author] == 1)
yield return b;
}
}
This way you have a linear time complexity of O(n), note that you would need O(n) extra space in this case.
Your worst case performance per yield is O(n * n). Your best case is O(n). If you assume that authors are randomly sorted, and a fixed portion only write one book, then the average case between yields is O(n) because the probability of getting to m iterations of the outer loop decreases exponentially as m increases. (Insert standard geometric series argument here.)
Generally (but not always!) people are most interested in the average case.
Incidentally the standard way of handling this problem would be to create a dictionary up front with all of the authors, and the count of how many books they wrote. That takes time O(n). And then your yields after that would just search through the keys of that dictionary looking for the next one with only 1 entry. The average time of subsequent yields would be O(1), the worst case O(n), and the amortized average time across all yields (assuming that a fixed proportion only wrote one book) will be O(1) per yield. As opposed to the current implementation which is O(n) per yield.
I have a rather specific question about how to exclude items of one list from another. Common approaches such as Except() won't do and here is why:
If the duplicate within a list has an "even" index - I need to remove THIS element and the NEXT element AFTER it.
if the duplicate within a list had an "odd" index - I need to remove THIS element AND one element BEFORE** it.
there might be many appearances of the same duplicate within a list. i.e. one might be with an "odd" index, another - "even".
I'm not asking for a solution since I've created one myself. However after performing this method many times - "ANTS performance profiler" shows that the method elapses 75% of whole execution time (30 seconds out of 40). The question is: Is there a faster method to perform the same operation? I've tried to optimize my current code but it still lacks performance. Here it is:
private void removedoubles(List<int> exclude, List<int> listcopy)
{
for (int j = 0; j < exclude.Count(); j++)
{
for (int i = 0; i < listcopy.Count(); i++)
{
if (listcopy[i] == exclude[j])
{
if (i % 2 == 0) // even
{
//listcopy.RemoveRange(i, i + 1);
listcopy.RemoveAt(i);
listcopy.RemoveAt(i);
i = i - 1;
}
else //odd
{
//listcopy.RemoveRange(i - 1, i);
listcopy.RemoveAt(i - 1);
listcopy.RemoveAt(i - 1);
i = i - 2;
}
}
}
}
}
where:
exclude - list that contains Duplicates only. This list might contain up to 30 elements.
listcopy - list that should be checked for duplicates. If duplicate from "exclude" is found -> perform removing operation. This list might contain up to 2000 elements.
I think that the LINQ might be some help but I don't understand its syntax well.
A faster way (O(n)) would be to do the following:
go through the exclude list and make it into a HashSet (O(n))
in the checks, check if the element being tested is in the set (again O(n)), since test for presence in a HashSet is O(1).
Maybe you can even change your algorithms so that the exclude collection will be a HashSet from the very beginning, this way you can omit step 1 and gain even more speed.
(Your current way is O(n^2).)
Edit:
Another idea is the following: you are perhaps creating a copy of some list and make this method modify it? (Guess based on the parameter name.) Then, you can change it to the following: you pass the original array to the method, and make the method allocate new array and return it (your method signature should be than something like private List<int> getWithoutDoubles(HashSet<int> exclude, List<int> original)).
Edit:
It could be even faster if you would reorganize the input data in the following way: As the items are always removed in pairs (even index + the following odd index), you should pair them in advance! So that your list if ints becomes list of pairs of ints. This way your method might be be something like that:
private List<Tuple<int, int>> getWithoutDoubles(
HashSet<int> exclude, List<Tuple<int, int>> original)
{
return original.Where(xy => (!exclude.Contains(xy.Item1) &&
!exclude.Contains(xy.Item2)))
.ToList();
}
(you remove the pairs where either the first or the second item is in the exclude collection). Instead of Tuple, perhaps you can pack the items into your custom type.
Here is yet another way to get the results.
var a = new List<int> {1, 2, 3, 4, 5};
var b = new List<int> {1, 2, 3};
var c = (from i in a let found = b.Any(j => j == i) where !found select i).ToList();
c will contain 4,5
Reverse your loops so they start at .Count - 1 and go to 0, so you don't have to change i in one of the cases and Count is only evaluated once per collection.
Can you convert the List to LinkedList and have a try? The List.RemoveAt() is more expensive than LinkedList.Remove().
I got a simple List of ints.
List<int> myInts = new List<int>();
myInts.Add(0);
myInts.Add(1);
myInts.Add(4);
myInts.Add(6);
myInts.Add(24);
My goal is to get the first unused (available) value from the List.
(the first positive value that's not already present in the collection)
In this case, the answer would be 2.
Here's my current code :
int GetFirstFreeInt()
{
for (int i = 0; i < int.MaxValue; ++i)
{
if(!myInts.Contains(i))
return i;
}
throw new InvalidOperationException("All integers are already used.");
}
Is there a better way? Maybe using LINQ? How would you do this?
Of course here I used ints for simplicity but my question could apply to any type.
You basically want the first element from the sequence 0..int.MaxValue that is not contained in myInts:
int? firstAvailable = Enumerable.Range(0, int.MaxValue)
.Except(myInts)
.FirstOrDefault();
Edit in response to comment:
There is no performance penalty here to iterate up to int.MaxValue. What Linq is going to to internally is create a hashtable for myInts and then begin iterating over the sequence created by Enumerable.Range() - once the first item not contained in the hashtable is found that integer is yielded by the Except() method and returned by FirstOrDefault() - after which the iteration stops. This means the overall effort is O(n) for creating the hashtable and then worst case O(n) for iterating over the sequence where n is the number of integers in myInts.
For more on Except() see i.e. Jon Skeet's EduLinq series: Reimplementing LINQ to Objects: Part 17 - Except
Well, if the list is ordered from smallest to largest and contains values from 0 to positive infinity, you could simply access the i-th element. if (myInts[i] != i) return i; which would be essentially the same, but doesn't necessitate iterating through the list for each and every Contains check (the Contains method iterates through the list, turning your algorithm into an O(n-squared) rather than O(n)).
Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.
Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:
N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
.OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
.ElementAt((int) pointList.Count * (1 - N)).Z;
List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();
But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).
Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?
The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.
Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.
public static IEnumerable<TSource> TakeLargest<TSource, TKey>
(this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
var set = new SortedDictionary<TKey, List<TSource>>();
var resultCount = 0;
var first = default(KeyValuePair<TKey, List<TSource>>);
foreach (var item in items)
{
// If the key is already smaller than the smallest
// item in the set, we can ignore this item
var key = selector(item);
if (first.Value == null ||
resultCount < count ||
Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
{
// Add next item to set
if (!set.ContainsKey(key))
{
set[key] = new List<TSource>();
}
set[key].Add(item);
if (first.Value == null)
{
first = set.First();
}
// Remove smallest item from set
resultCount++;
if (resultCount - first.Value.Count >= count)
{
set.Remove(first.Key);
resultCount -= first.Value.Count;
first = set.First();
}
}
}
return set.Values.SelectMany(values => values);
}
That will include more than count elements if there are ties, as your implementation does now.
You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...
pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();
If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed
You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.
If you combine the two there is a chance a little less work will be done:
List<Point> selectedPoints = pointList
.OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
.Take((int) pointList.Count * N);
But basically this kind of ranking requires sorting, your biggest cost.
A few more ideas:
if you use a class Point (instead of a struct Point) there will be much less copying.
you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.
If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):
pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order
Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.
double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
.ToList();
Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.
To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.
double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);
That out of the way, there are three cases where memory use might actually be a problem:
Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.
Update:
Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.
I'd do it by implementing "half" a quicksort.
Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.
Choose a pivot x in P.
Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.
If N = |U| then you're done.
If N < |U| then recurse with P := U.
Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.
If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.
Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:
Make a traversal of P, finding the least and greatest items, A and Z, respectively.
Let M be the mean of A and Z (remember, we're only considering Z coordinates here).
Count how many items there are in the range [M, Z], call this Q.
If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.
If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.
Repeat until we find an M such that Q = N.
Now traverse P, removing all items greater than or equal to M.
That's definitely O(n log n) and creates no extra data structures (except for the result).
Howzat?
You might use something like this:
pointList.Sort(); // Use you own compare here if needed
// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z;
// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);
If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.
int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();
Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.
You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.
Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:
Sort With Linq: 5.3 seconds
Use DataViews: 0.01 seconds
Give it a try.
[TestClass]
public class UnitTest1 {
class MyTable : TypedTableBase<MyRow> {
public MyTable() {
Columns.Add("Col1", typeof(int));
Columns.Add("Col2", typeof(int));
}
protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
return new MyRow(builder);
}
}
class MyRow : DataRow {
public MyRow(DataRowBuilder builder) : base(builder) {
}
public int Col1 { get { return (int)this["Col1"]; } }
public int Col2 { get { return (int)this["Col2"]; } }
}
DataView _viewCol1Asc;
DataView _viewCol2Desc;
MyTable _table;
int _countToTake;
[TestMethod]
public void MyTestMethod() {
_table = new MyTable();
int count = 50000;
for (int i = 0; i < count; i++) {
_table.Rows.Add(i, i);
}
_countToTake = _table.Rows.Count / 30;
Console.WriteLine("SortWithLinq");
RunTest(SortWithLinq);
Console.WriteLine("Use DataViews");
RunTest(UseSoredDataViews);
}
private void RunTest(Action method) {
int iterations = 100;
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
method();
}
watch.Stop();
Console.WriteLine(" {0}", watch.Elapsed);
}
private void UseSoredDataViews() {
if (_viewCol1Asc == null) {
_viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
_viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
}
var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
}
private void SortWithLinq() {
var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
IterateRows(rows);
rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
IterateRows(rows);
}
private void IterateRows(IEnumerable<MyRow> rows) {
foreach (var row in rows)
if (row == null)
throw new Exception("????");
}
}