How to avoid OrderBy - memory usage problems

How to avoid OrderBy - memory usage problems - c#

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.
Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:
N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
.OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
.ElementAt((int) pointList.Count * (1 - N)).Z;
List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();
But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).
Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?
The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.

Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.
public static IEnumerable<TSource> TakeLargest<TSource, TKey>
(this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
var set = new SortedDictionary<TKey, List<TSource>>();
var resultCount = 0;
var first = default(KeyValuePair<TKey, List<TSource>>);
foreach (var item in items)
{
// If the key is already smaller than the smallest
// item in the set, we can ignore this item
var key = selector(item);
if (first.Value == null ||
resultCount < count ||
Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
{
// Add next item to set
if (!set.ContainsKey(key))
{
set[key] = new List<TSource>();
}
set[key].Add(item);
if (first.Value == null)
{
first = set.First();
}
// Remove smallest item from set
resultCount++;
if (resultCount - first.Value.Count >= count)
{
set.Remove(first.Key);
resultCount -= first.Value.Count;
first = set.First();
}
}
}
return set.Values.SelectMany(values => values);
}
That will include more than count elements if there are ties, as your implementation does now.

You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...
pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();
If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed

You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.

If you combine the two there is a chance a little less work will be done:
List<Point> selectedPoints = pointList
.OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
.Take((int) pointList.Count * N);
But basically this kind of ranking requires sorting, your biggest cost.
A few more ideas:
if you use a class Point (instead of a struct Point) there will be much less copying.
you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.

If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):
pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order
Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.
double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
.ToList();

Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.
To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.
double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);
That out of the way, there are three cases where memory use might actually be a problem:
Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.
Update:
Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.

I'd do it by implementing "half" a quicksort.
Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.
Choose a pivot x in P.
Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.
If N = |U| then you're done.
If N < |U| then recurse with P := U.
Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.
If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.
Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:
Make a traversal of P, finding the least and greatest items, A and Z, respectively.
Let M be the mean of A and Z (remember, we're only considering Z coordinates here).
Count how many items there are in the range [M, Z], call this Q.
If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.
If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.
Repeat until we find an M such that Q = N.
Now traverse P, removing all items greater than or equal to M.
That's definitely O(n log n) and creates no extra data structures (except for the result).
Howzat?

You might use something like this:
pointList.Sort(); // Use you own compare here if needed
// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z;
// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);

If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.

int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();
Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.

You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.
Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:
Sort With Linq: 5.3 seconds
Use DataViews: 0.01 seconds
Give it a try.
[TestClass]
public class UnitTest1 {
class MyTable : TypedTableBase<MyRow> {
public MyTable() {
Columns.Add("Col1", typeof(int));
Columns.Add("Col2", typeof(int));
}
protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
return new MyRow(builder);
}
}
class MyRow : DataRow {
public MyRow(DataRowBuilder builder) : base(builder) {
}
public int Col1 { get { return (int)this["Col1"]; } }
public int Col2 { get { return (int)this["Col2"]; } }
}
DataView _viewCol1Asc;
DataView _viewCol2Desc;
MyTable _table;
int _countToTake;
[TestMethod]
public void MyTestMethod() {
_table = new MyTable();
int count = 50000;
for (int i = 0; i < count; i++) {
_table.Rows.Add(i, i);
}
_countToTake = _table.Rows.Count / 30;
Console.WriteLine("SortWithLinq");
RunTest(SortWithLinq);
Console.WriteLine("Use DataViews");
RunTest(UseSoredDataViews);
}
private void RunTest(Action method) {
int iterations = 100;
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
method();
}
watch.Stop();
Console.WriteLine(" {0}", watch.Elapsed);
}
private void UseSoredDataViews() {
if (_viewCol1Asc == null) {
_viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
_viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
}
var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
IterateRows(rows);
}
private void SortWithLinq() {
var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
IterateRows(rows);
rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
IterateRows(rows);
}
private void IterateRows(IEnumerable<MyRow> rows) {
foreach (var row in rows)
if (row == null)
throw new Exception("????");
}
}

Related

No memory leaks or errors but my code slows down exponentially C#

I am perplexed by this issue. I believe I'm just missing an easy problem right in front of my face but I'm at the point where I need a second opinion to point out anything obvious that I'm missing. I minimized my code and simplified it so it only shows a small part of what it does. The full code is just many different calculations added on to what I have below.
for (int h = 2; h < 200; h++)
{
var List1 = CalculateSomething(testValues, h);
var masterLists = await AddToRsquaredList("Calculation1", h, actualValuesList, List1, masterLists.Item1, masterLists.Item2);
var List2 = CalculateSomething(testValues, h);
masterLists = await AddToRsquaredList("Calculation2", h, actualValuesList, List2, masterLists.Item1, masterLists.Item2);
var List3 = CalculateSomething(testValues, h);
masterLists = await AddToRsquaredList("Calculation3", h, actualValues, List3, masterLists.Item1, masterLists.Item2);
}
public static async Task<(List<RSquaredValues3>, List<ValueClass>)> AddToRsquaredList(string valueName, int days,
IEnumerable<double> estimatedValuesList, IEnumerable<double> actualValuesList,
List<RSquaredValues3> rSquaredList, List<ValueClass> valueClassList)
{
try
{
RSquaredValues3 rSquaredValue = new RSquaredValues3
{
ValueName = valueName,
Days = days,
RSquared = GoodnessOfFit.CoefficientOfDetermination(estimatedValuesList, actualValuesList),
StdError = GoodnessOfFit.PopulationStandardError(estimatedValuesList, actualValuesList)
};
int comboSize = 15;
double max = 0;
var query = await rSquaredList.OrderBy(i => i.StdError - i.RSquared).DistinctBy(i => i.ValueName).Take(comboSize).ToListAsync().ConfigureAwait(false);
if (query.Count > 0)
{
max = query.Last().StdError - query.Last().RSquared;
}
else
{
max = 10000000;
}
if ((rSquaredValue.StdError - rSquaredValue.RSquared < max || query.Count < comboSize) && rSquaredList.Contains(rSquaredValue) == false)
{
rSquaredList.Add(rSquaredValue);
valueClassList.Add(new ValueClass { ValueName = rSquaredValue.ValueName, ValueList = estimatedValuesList, Days = days });
}
}
catch (Exception ex)
{
ThrowExceptionInfo(ex);
}
return (rSquaredList, valueClassList);
}

There is clearly a significance to StdError - RSquared, so change RSquaredValues3 to expose that value (i.e. calculate it once, on construction, since the values do not change) rather than recalculating it in multiple places during the processing loop.
The value in this new property is the way that the list is being sorted. Rather than sorting the list over and over again, consider keeping the items in the list in that order in the first place. You can do this by ensuring that each time an item gets added, it is inserted in the right place in the list. This is called an insertion sort. (I have assumed that SortedList<TKey,TValue> is inappropriate due to duplicate 'key's.)
Similar improvements can be made to avoid the need for DistinctBy(i => i.ValueName). If you are only interested in distinct value names, then consider avoiding inserting the item if it is not providing an improvement.
Your List needs to grow during your processing - under the hood, the list doubles every time it grows, so the number of growths is O(log(n)). You can specify a suggested capacity in construction. If you specify the expected size large enough at the start, then the list will not need to do this during your processing.
The await of the ToListAsync is not adding any advantage to this code, as far as I can see.
The check for rSquaredList.Contains(rSquaredValue) == false looks like a redundant check, since this is a reference comparison of a newly instantiated item which cannot have been inserted in the list. So you can remove it to make it run faster.

With all that use of Task and await, you are not actually gaining anything at the moment, since you have a single thread handling it and are waiting for execution sequentially, so it appears to all be overhead. I am not sure if you can parallelize this workload but the main loop from 2 to 200 seems like a prime candidate for a Parallel.For() loop instead. You should also look into using a System.Collections.Concurrent.ConcurrentBag() for your master list if you implement parallelism to avoid deadlock issues.

Where is likely the performance bug here? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Many of the test cases are timing out. I've made sure I'm using lazy evaluation everywhere, linear (or better) routines, etc. I'm shocked that this is still not meeting the performance benchmarks.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
class Mine
{
public int Distance { get; set; } // from river
public int Gold { get; set; } // in tons
}
class Solution
{
static void Main(String[] args)
{
// helper function for reading lines
Func<string, int[]> LineToIntArray = (line) => Array.ConvertAll(line.Split(' '), Int32.Parse);
int[] line1 = LineToIntArray(Console.ReadLine());
int N = line1[0], // # of mines
K = line1[1]; // # of pickup locations
// Populate mine info
List<Mine> mines = new List<Mine>();
for(int i = 0; i < N; ++i)
{
int[] line = LineToIntArray(Console.ReadLine());
mines.Add(new Mine() { Distance = line[0], Gold = line[1] });
}
// helper function for cost of a move
Func<Mine, Mine, int> MoveCost = (mine1, mine2) =>
Math.Abs(mine1.Distance - mine2.Distance) * mine1.Gold;
int sum = 0; // running total of move costs
// all move combinations on the current list of mines,
// given by the indicies of the mines
var indices = Enumerable.Range(0, N);
var moves = from i1 in indices
from i2 in indices
where i1 != i2
select new int[] { i1, i2 };
while(N != K) // while number of mines hasn't been consolidated to K
{
// get move with the least cost
var cheapest = moves.Aggregate(
(prev, cur) => MoveCost(mines[prev[0]],mines[prev[1]])
< MoveCost(mines[cur[0]], mines[cur[1]])
? prev : cur
);
int i = cheapest[0], // index of source mine of cheapest move
j = cheapest[1]; // index of destination mine of cheapest move
// add cost to total
sum += MoveCost(mines[i], mines[j]);
// move gold from source to destination
mines[j].Gold += mines[i].Gold;
// remove from moves any that had the i-th mine as a destination or source
moves = from move in moves
where move[0] == i || move[1] == i
select move;
// update size number of mines after consolidation
--N;
}
Console.WriteLine(sum);
}
}

Lazy evaluation will not make bad algorithms perform better. It will just delay when those performance problems will affect you. What lazy evaluation can help with is space complexity, i.e. reducing the amount of memory you need to execute your algorithm. Since the data is generated lazily, you will not (necessarily) have all the data in the memory at the same time.
However, relying on lazy evaluation to fix your space (or time) complexity problems can easily shoot you in the foot. Look the following example code:
var moves = Enumerable.Range(0, 5).Select(x => {
Console.WriteLine("Generating");
return x;
});
var aggregate = moves.Aggregate((p, c) => {
Console.WriteLine("Aggregating");
return p + c;
});
var newMoves = moves.Where(x => {
Console.WriteLine("Filtering");
return x % 2 == 0;
});
newMoves.ToList();
As you can see, both the aggregate and the newMoves rely on the lazily evaluated moves enumerable. Since the original count of moves is 5, we will see 4 “Aggregating” lines in the output, and 5 “Filtering” lines. But how often do you expect “Generating” to appear in the console?
The answer is 10. This is because moves is a generator and is being evaluated lazily. When multiple places request it, an iterator will be created for each, which ultimately means that the generator will execute multiple times (to generate independent results).
This is not necessarily a problem, but in your case, it very quickly becomes one. Assume that we continue above example code with another round of aggregating. That second aggregate will consume newMoves which in turns will consume the original moves. So to aggregate, we will re-run the original moves generator, and the newMoves generator. And if we were to add another level of filtering, the next round of aggregating would run three interlocked generators, again rerunning the original moves generator.
Since your original moves generator creates an enumerable of quadratic size, and has an actual time complexity of O(n²), this is an actual problem. With each iteration, you add another round of filtering which will be linear to the size of the moves enumerable, and you actually consume the enumerable completely for the aggregation. So you end up with O(n^2 + n^3 + n^4 + n^5 + …) which will eventually be the sum of n^j for j starting at 2 up to N-K. That is a very bad time complexity, and all just because you were trying to save memory by evaluating the moves lazily.
So the first step to make this better is to avoid lazy evaluation. You are constantly iterating moves and filtering it, so you should have it in memory. Yes, that means that you have an array of quadratic size, but you won’t actually need more than that. This also limits the time complexity you have. Yes, you still need to filter the list in linear time (so O(n²) since the size is n²) and you do that inside a loop, so you will end up with cubic time (O(n³)) but that would already be your upper limit (iterating the list a constant amount of times within the loop will only increase the time complexity by a constant, and that doesn’t matter).
Once you have done that, you should consider your original problem, think about what you are actually doing. I believe you could probably reduce the computational complexity further if you use the information you have better, and use data structures (e.g. hash sets, or some graph where the move cost is already stored within) that aid you in the filtering and aggregation. I can’t give you exact ideas since I don’t know your original problem, but I’m sure there is something you can do.
Finally, if you have performance problems, always remember to profile your code. The profiler will tell you what parts of your code is the most expensive, so you can get a clear idea what you should try to optimize and what you don’t need to focus on when optimizing (since optimizing already fast parts will not help you get any faster).

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.

Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.

There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

C# fastest intersection of 2 sets of sorted numbers

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.
I've tried a bunch of simple options and am currently using this:
foreach (var index in firstSet)
{
if (secondSet.BinarySearch(index) < 0)
continue;
//do stuff
}
Both firstSet and secondSet are of type List.
I've also tried using LINQ:
var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();
and then looping through intersection.
But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.
Please help me guys as I don't have a lot of time to get this thing done. Thanks.
NOTE: I'm doing this about 5.3 million times. So every microsecond counts.

If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.
Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.
The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.
Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):
(There's more text at the end of the answer!)
static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
IEnumerable<T> sequence2,
IComparer<T> comparer)
{
using (var cursor1 = sequence1.GetEnumerator())
using (var cursor2 = sequence2.GetEnumerator())
{
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
var value1 = cursor1.Current;
var value2 = cursor2.Current;
while (true)
{
int comparison = comparer.Compare(value1, value2);
if (comparison < 0)
{
if (!cursor1.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
}
else if (comparison > 0)
{
if (!cursor2.MoveNext())
{
yield break;
}
value2 = cursor2.Current;
}
else
{
yield return value1;
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
value2 = cursor2.Current;
}
}
}
}
EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.

Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).
This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.
var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();
while (firstIndex < firstCount && secondIndex < secondCount)
{
var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
if (comp < 0) {
++firstIndex;
}
else if (comp > 0) {
++secondIndex;
}
else {
intersection.Add(firstSet[firstIndex]);
++firstIndex;
++secondIndex;
}
}
The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.

You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.
var intersection = firstSet.Intersect(secondSet);
Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).

I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.
I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:
public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
List<T> sequence2,
IComparer<T> comparer)
{
List<T> smallList = null;
List<T> largeList = null;
if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
{
smallList = sequence1;
largeList = sequence2;
}
else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
{
smallList = sequence2;
largeList = sequence1;
}
if (smallList != null)
{
foreach (var item in smallList)
{
if (largeList.BinarySearch(item, comparer) >= 0)
{
yield return item;
}
}
}
else
{
//Use Jon's method
}
}
I am still unsure about the point at which you break even, need to do some more testing

try
firstSet.InterSect (secondSet).ToList ()
or
firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

When should I use a List vs a LinkedList

When is it better to use a List vs a LinkedList?

In most cases, List<T> is more useful. LinkedList<T> will have less cost when adding/removing items in the middle of the list, whereas List<T> can only cheaply add/remove at the end of the list.
LinkedList<T> is only at it's most efficient if you are accessing sequential data (either forwards or backwards) - random access is relatively expensive since it must walk the chain each time (hence why it doesn't have an indexer). However, because a List<T> is essentially just an array (with a wrapper) random access is fine.
List<T> also offers a lot of support methods - Find, ToArray, etc; however, these are also available for LinkedList<T> with .NET 3.5/C# 3.0 via extension methods - so that is less of a factor.

Thinking of a linked list as a list can be a bit misleading. It's more like a chain. In fact, in .NET, LinkedList<T> does not even implement IList<T>. There is no real concept of index in a linked list, even though it may seem there is. Certainly none of the methods provided on the class accept indexes.
Linked lists may be singly linked, or doubly linked. This refers to whether each element in the chain has a link only to the next one (singly linked) or to both the prior/next elements (doubly linked). LinkedList<T> is doubly linked.
Internally, List<T> is backed by an array. This provides a very compact representation in memory. Conversely, LinkedList<T> involves additional memory to store the bidirectional links between successive elements. So the memory footprint of a LinkedList<T> will generally be larger than for List<T> (with the caveat that List<T> can have unused internal array elements to improve performance during append operations.)
They have different performance characteristics too:
Append
LinkedList<T>.AddLast(item) constant time
List<T>.Add(item) amortized constant time, linear worst case
Prepend
LinkedList<T>.AddFirst(item) constant time
List<T>.Insert(0, item) linear time
Insertion
LinkedList<T>.AddBefore(node, item) constant time
LinkedList<T>.AddAfter(node, item) constant time
List<T>.Insert(index, item) linear time
Removal
LinkedList<T>.Remove(item) linear time
LinkedList<T>.Remove(node) constant time
List<T>.Remove(item) linear time
List<T>.RemoveAt(index) linear time
Count
LinkedList<T>.Count constant time
List<T>.Count constant time
Contains
LinkedList<T>.Contains(item) linear time
List<T>.Contains(item) linear time
Clear
LinkedList<T>.Clear() linear time
List<T>.Clear() linear time
As you can see, they're mostly equivalent. In practice, the API of LinkedList<T> is more cumbersome to use, and details of its internal needs spill out into your code.
However, if you need to do many insertions/removals from within a list, it offers constant time. List<T> offers linear time, as extra items in the list must be shuffled around after the insertion/removal.

Linked lists provide very fast insertion or deletion of a list member. Each member in a linked list contains a pointer to the next member in the list so to insert a member at position i:
update the pointer in member i-1 to point to the new member
set the pointer in the new member to point to member i
The disadvantage to a linked list is that random access is not possible. Accessing a member requires traversing the list until the desired member is found.

Edit
Please read the comments to this answer. People claim I did not do
proper tests. I agree this should not be an accepted answer. As I was
learning I did some tests and felt like sharing them.
Original answer...
I found interesting results:
// Temporary class to show the example
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
Linked list (3.9 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (2.4 seconds)
List<Temp> list = new List<Temp>(); // 2.4 seconds
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.Add(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Even if you only access data essentially it is much slower!! I say never use a linkedList.
Here is another comparison performing a lot of inserts (we plan on inserting an item at the middle of the list)
Linked List (51 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
var curNode = list.First;
for (var k = 0; k < i/2; k++) // In order to insert a node at the middle of the list we need to find it
curNode = curNode.Next;
list.AddAfter(curNode, a); // Insert it after
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (7.26 seconds)
List<Temp> list = new List<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.Insert(i / 2, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Linked List having reference of location where to insert (.04 seconds)
list.AddLast(new Temp(1,1,1,1));
var referenceNode = list.First;
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
list.AddBefore(referenceNode, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
So only if you plan on inserting several items and you also somewhere have the reference of where you plan to insert the item then use a linked list. Just because you have to insert a lot of items it does not make it faster because searching the location where you will like to insert it takes time.

My previous answer was not enough accurate.
As truly it was horrible :D
But now I can post much more useful and correct answer.
I did some additional tests. You can find it's source by the following link and reCheck it on your environment by your own: https://github.com/ukushu/DataStructuresTestsAndOther.git
Short results:
Array need to use:
So often as possible. It's fast and takes smallest RAM range for same amount information.
If you know exact count of cells needed
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
List need to use:
If needed to add cells to the end of list (often)
If needed to add cells in the beginning/middle of the list (NOT OFTEN)
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
LinkedList need to use:
If needed to add cells in the beginning/middle/end of the list (often)
If needed only sequential access (forward/backward)
If you need to save LARGE items, but items count is low.
Better do not use for large amount of items, as it's use additional memory for links.
More details:
Interesting to know:
LinkedList<T> internally is not a List in .NET. It's even does not implement IList<T>. And that's why there are absent indexes and methods related to indexes.
LinkedList<T> is node-pointer based collection. In .NET it's in doubly linked implementation. This means that prior/next elements have link to current element. And data is fragmented -- different list objects can be located in different places of RAM. Also there will be more memory used for LinkedList<T> than for List<T> or Array.
List<T> in .Net is Java's alternative of ArrayList<T>. This means that this is array wrapper. So it's allocated in memory as one contiguous block of data. If allocated data size exceeds 85000 bytes, it will be moved to Large Object Heap. Depending on the size, this can lead to heap fragmentation(a mild form of memory leak). But in the same time if size < 85000 bytes -- this provides a very compact and fast-access representation in memory.
Single contiguous block is preferred for random access performance and memory consumption but for collections that need to change size regularly a structure such as an Array generally need to be copied to a new location whereas a linked list only needs to manage the memory for the newly inserted/deleted nodes.

The difference between List and LinkedList lies in their underlying implementation. List is array based collection (ArrayList). LinkedList is node-pointer based collection (LinkedListNode). On the API level usage, both of them are pretty much the same since both implement same set of interfaces such as ICollection, IEnumerable, etc.
The key difference comes when performance matter. For example, if you are implementing the list that has heavy "INSERT" operation, LinkedList outperforms List. Since LinkedList can do it in O(1) time, but List may need to expand the size of underlying array. For more information/detail you might want to read up on the algorithmic difference between LinkedList and array data structures. http://en.wikipedia.org/wiki/Linked_list and Array
Hope this help,

The primary advantage of linked lists over arrays is that the links provide us with the capability to rearrange the items efficiently.
Sedgewick, p. 91

A common circumstance to use LinkedList is like this:
Suppose you want to remove many certain strings from a list of strings with a large size, say 100,000. The strings to remove can be looked up in HashSet dic, and the list of strings is believed to contain between 30,000 to 60,000 such strings to remove.
Then what's the best type of List for storing the 100,000 Strings? The answer is LinkedList. If the they are stored in an ArrayList, then iterating over it and removing matched Strings whould take up
to billions of operations, while it takes just around 100,000 operations by using an iterator and the remove() method.
LinkedList<String> strings = readStrings();
HashSet<String> dic = readDic();
Iterator<String> iterator = strings.iterator();
while (iterator.hasNext()){
String string = iterator.next();
if (dic.contains(string))
iterator.remove();
}

When you need built-in indexed access, sorting (and after this binary searching), and "ToArray()" method, you should use List.

Essentially, a List<> in .NET is a wrapper over an array. A LinkedList<> is a linked list. So the question comes down to, what is the difference between an array and a linked list, and when should an array be used instead of a linked list. Probably the two most important factors in your decision of which to use would come down to:
Linked lists have much better insertion/removal performance, so long as the insertions/removals are not on the last element in the collection. This is because an array must shift all remaining elements that come after the insertion/removal point. If the insertion/removal is at the tail end of the list however, this shift is not needed (although the array may need to be resized, if its capacity is exceeded).
Arrays have much better accessing capabilities. Arrays can be indexed into directly (in constant time). Linked lists must be traversed (linear time).

This is adapted from Tono Nam's accepted answer correcting a few wrong measurements in it.
The test:
static void Main()
{
LinkedListPerformance.AddFirst_List(); // 12028 ms
LinkedListPerformance.AddFirst_LinkedList(); // 33 ms
LinkedListPerformance.AddLast_List(); // 33 ms
LinkedListPerformance.AddLast_LinkedList(); // 32 ms
LinkedListPerformance.Enumerate_List(); // 1.08 ms
LinkedListPerformance.Enumerate_LinkedList(); // 3.4 ms
//I tried below as fun exercise - not very meaningful, see code
//sort of equivalent to insertion when having the reference to middle node
LinkedListPerformance.AddMiddle_List(); // 5724 ms
LinkedListPerformance.AddMiddle_LinkedList1(); // 36 ms
LinkedListPerformance.AddMiddle_LinkedList2(); // 32 ms
LinkedListPerformance.AddMiddle_LinkedList3(); // 454 ms
Environment.Exit(-1);
}
And the code:
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace stackoverflow
{
static class LinkedListPerformance
{
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
static readonly int start = 0;
static readonly int end = 123456;
static readonly IEnumerable<Temp> query = Enumerable.Range(start, end - start).Select(temp);
static Temp temp(int i)
{
return new Temp(i, i, i, i);
}
static void StopAndPrint(this Stopwatch watch)
{
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);
}
public static void AddFirst_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(0, temp(i));
watch.StopAndPrint();
}
public static void AddFirst_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddFirst(temp(i));
watch.StopAndPrint();
}
public static void AddLast_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Add(temp(i));
watch.StopAndPrint();
}
public static void AddLast_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddLast(temp(i));
watch.StopAndPrint();
}
public static void Enumerate_List()
{
var list = new List<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
public static void Enumerate_LinkedList()
{
var list = new LinkedList<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
//for the fun of it, I tried to time inserting to the middle of
//linked list - this is by no means a realistic scenario! or may be
//these make sense if you assume you have the reference to middle node
//insertion to the middle of list
public static void AddMiddle_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(list.Count / 2, temp(i));
watch.StopAndPrint();
}
//insertion in linked list in such a fashion that
//it has the same effect as inserting into the middle of list
public static void AddMiddle_LinkedList1()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
LinkedListNode<Temp> evenNode = null, oddNode = null;
for (int i = start; i < end; i++)
{
if (list.Count == 0)
oddNode = evenNode = list.AddLast(temp(i));
else
if (list.Count % 2 == 1)
oddNode = list.AddBefore(evenNode, temp(i));
else
evenNode = list.AddAfter(oddNode, temp(i));
}
watch.StopAndPrint();
}
//another hacky way
public static void AddMiddle_LinkedList2()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start + 1; i < end; i += 2)
list.AddLast(temp(i));
for (int i = end - 2; i >= 0; i -= 2)
list.AddLast(temp(i));
watch.StopAndPrint();
}
//OP's original more sensible approach, but I tried to filter out
//the intermediate iteration cost in finding the middle node.
public static void AddMiddle_LinkedList3()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
{
if (list.Count == 0)
list.AddLast(temp(i));
else
{
watch.Stop();
var curNode = list.First;
for (var j = 0; j < list.Count / 2; j++)
curNode = curNode.Next;
watch.Start();
list.AddBefore(curNode, temp(i));
}
}
watch.StopAndPrint();
}
}
}
You can see the results are in accordance with theoretical performance others have documented here. Quite clear - LinkedList<T> gains big time in case of insertions. I haven't tested for removal from the middle of list, but the result should be the same. Of course List<T> has other areas where it performs way better like O(1) random access.

Use LinkedList<> when
You don't know how many objects are coming through the flood gate. For example, Token Stream.
When you ONLY wanted to delete\insert at the ends.
For everything else, it is better to use List<>.

I do agree with most of the point made above. And I also agree that List looks like a more obvious choice in most of the cases.
But, I just want to add that there are many instance where LinkedList are far better choice than List for better efficiency.
Suppose you are traversing through the elements and you want to perform lot of insertions/deletion; LinkedList does it in linear O(n) time, whereas List does it in quadratic O(n^2) time.
Suppose you want to access bigger objects again and again, LinkedList become very more useful.
Deque() and queue() are better implemented using LinkedList.
Increasing the size of LinkedList is much easier and better once you are dealing with many and bigger objects.
Hope someone would find these comments useful.

In .NET, Lists are represented as Arrays. Therefore using a normal List would be quite faster in comparison to LinkedList.That is why people above see the results they see.
Why should you use the List?
I would say it depends. List creates 4 elements if you don't have any specified. The moment you exceed this limit, it copies stuff to a new array, leaving the old one in the hands of the garbage collector. It then doubles the size. In this case, it creates a new array with 8 elements. Imagine having a list with 1 million elements, and you add 1 more. It will essentially create a whole new array with double the size you need. The new array would be with 2Mil capacity however, you only needed 1Mil and 1. Essentially leaving stuff behind in GEN2 for the garbage collector and so on. So it can actually end up being a huge bottleneck. You should be careful about that.

I asked a similar question related to performance of the LinkedList collection, and discovered Steven Cleary's C# implement of Deque was a solution. Unlike the Queue collection, Deque allows moving items on/off front and back. It is similar to linked list, but with improved performance.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.