When should I use a List vs a LinkedList - c#
When is it better to use a List vs a LinkedList?
In most cases, List<T> is more useful. LinkedList<T> will have less cost when adding/removing items in the middle of the list, whereas List<T> can only cheaply add/remove at the end of the list.
LinkedList<T> is only at it's most efficient if you are accessing sequential data (either forwards or backwards) - random access is relatively expensive since it must walk the chain each time (hence why it doesn't have an indexer). However, because a List<T> is essentially just an array (with a wrapper) random access is fine.
List<T> also offers a lot of support methods - Find, ToArray, etc; however, these are also available for LinkedList<T> with .NET 3.5/C# 3.0 via extension methods - so that is less of a factor.
Thinking of a linked list as a list can be a bit misleading. It's more like a chain. In fact, in .NET, LinkedList<T> does not even implement IList<T>. There is no real concept of index in a linked list, even though it may seem there is. Certainly none of the methods provided on the class accept indexes.
Linked lists may be singly linked, or doubly linked. This refers to whether each element in the chain has a link only to the next one (singly linked) or to both the prior/next elements (doubly linked). LinkedList<T> is doubly linked.
Internally, List<T> is backed by an array. This provides a very compact representation in memory. Conversely, LinkedList<T> involves additional memory to store the bidirectional links between successive elements. So the memory footprint of a LinkedList<T> will generally be larger than for List<T> (with the caveat that List<T> can have unused internal array elements to improve performance during append operations.)
They have different performance characteristics too:
Append
LinkedList<T>.AddLast(item) constant time
List<T>.Add(item) amortized constant time, linear worst case
Prepend
LinkedList<T>.AddFirst(item) constant time
List<T>.Insert(0, item) linear time
Insertion
LinkedList<T>.AddBefore(node, item) constant time
LinkedList<T>.AddAfter(node, item) constant time
List<T>.Insert(index, item) linear time
Removal
LinkedList<T>.Remove(item) linear time
LinkedList<T>.Remove(node) constant time
List<T>.Remove(item) linear time
List<T>.RemoveAt(index) linear time
Count
LinkedList<T>.Count constant time
List<T>.Count constant time
Contains
LinkedList<T>.Contains(item) linear time
List<T>.Contains(item) linear time
Clear
LinkedList<T>.Clear() linear time
List<T>.Clear() linear time
As you can see, they're mostly equivalent. In practice, the API of LinkedList<T> is more cumbersome to use, and details of its internal needs spill out into your code.
However, if you need to do many insertions/removals from within a list, it offers constant time. List<T> offers linear time, as extra items in the list must be shuffled around after the insertion/removal.
Linked lists provide very fast insertion or deletion of a list member. Each member in a linked list contains a pointer to the next member in the list so to insert a member at position i:
update the pointer in member i-1 to point to the new member
set the pointer in the new member to point to member i
The disadvantage to a linked list is that random access is not possible. Accessing a member requires traversing the list until the desired member is found.
Edit
Please read the comments to this answer. People claim I did not do
proper tests. I agree this should not be an accepted answer. As I was
learning I did some tests and felt like sharing them.
Original answer...
I found interesting results:
// Temporary class to show the example
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
Linked list (3.9 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (2.4 seconds)
List<Temp> list = new List<Temp>(); // 2.4 seconds
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.Add(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Even if you only access data essentially it is much slower!! I say never use a linkedList.
Here is another comparison performing a lot of inserts (we plan on inserting an item at the middle of the list)
Linked List (51 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
var curNode = list.First;
for (var k = 0; k < i/2; k++) // In order to insert a node at the middle of the list we need to find it
curNode = curNode.Next;
list.AddAfter(curNode, a); // Insert it after
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (7.26 seconds)
List<Temp> list = new List<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.Insert(i / 2, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Linked List having reference of location where to insert (.04 seconds)
list.AddLast(new Temp(1,1,1,1));
var referenceNode = list.First;
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
list.AddBefore(referenceNode, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
So only if you plan on inserting several items and you also somewhere have the reference of where you plan to insert the item then use a linked list. Just because you have to insert a lot of items it does not make it faster because searching the location where you will like to insert it takes time.
My previous answer was not enough accurate.
As truly it was horrible :D
But now I can post much more useful and correct answer.
I did some additional tests. You can find it's source by the following link and reCheck it on your environment by your own: https://github.com/ukushu/DataStructuresTestsAndOther.git
Short results:
Array need to use:
So often as possible. It's fast and takes smallest RAM range for same amount information.
If you know exact count of cells needed
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
List need to use:
If needed to add cells to the end of list (often)
If needed to add cells in the beginning/middle of the list (NOT OFTEN)
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
LinkedList need to use:
If needed to add cells in the beginning/middle/end of the list (often)
If needed only sequential access (forward/backward)
If you need to save LARGE items, but items count is low.
Better do not use for large amount of items, as it's use additional memory for links.
More details:
Interesting to know:
LinkedList<T> internally is not a List in .NET. It's even does not implement IList<T>. And that's why there are absent indexes and methods related to indexes.
LinkedList<T> is node-pointer based collection. In .NET it's in doubly linked implementation. This means that prior/next elements have link to current element. And data is fragmented -- different list objects can be located in different places of RAM. Also there will be more memory used for LinkedList<T> than for List<T> or Array.
List<T> in .Net is Java's alternative of ArrayList<T>. This means that this is array wrapper. So it's allocated in memory as one contiguous block of data. If allocated data size exceeds 85000 bytes, it will be moved to Large Object Heap. Depending on the size, this can lead to heap fragmentation(a mild form of memory leak). But in the same time if size < 85000 bytes -- this provides a very compact and fast-access representation in memory.
Single contiguous block is preferred for random access performance and memory consumption but for collections that need to change size regularly a structure such as an Array generally need to be copied to a new location whereas a linked list only needs to manage the memory for the newly inserted/deleted nodes.
The difference between List and LinkedList lies in their underlying implementation. List is array based collection (ArrayList). LinkedList is node-pointer based collection (LinkedListNode). On the API level usage, both of them are pretty much the same since both implement same set of interfaces such as ICollection, IEnumerable, etc.
The key difference comes when performance matter. For example, if you are implementing the list that has heavy "INSERT" operation, LinkedList outperforms List. Since LinkedList can do it in O(1) time, but List may need to expand the size of underlying array. For more information/detail you might want to read up on the algorithmic difference between LinkedList and array data structures. http://en.wikipedia.org/wiki/Linked_list and Array
Hope this help,
The primary advantage of linked lists over arrays is that the links provide us with the capability to rearrange the items efficiently.
Sedgewick, p. 91
A common circumstance to use LinkedList is like this:
Suppose you want to remove many certain strings from a list of strings with a large size, say 100,000. The strings to remove can be looked up in HashSet dic, and the list of strings is believed to contain between 30,000 to 60,000 such strings to remove.
Then what's the best type of List for storing the 100,000 Strings? The answer is LinkedList. If the they are stored in an ArrayList, then iterating over it and removing matched Strings whould take up
to billions of operations, while it takes just around 100,000 operations by using an iterator and the remove() method.
LinkedList<String> strings = readStrings();
HashSet<String> dic = readDic();
Iterator<String> iterator = strings.iterator();
while (iterator.hasNext()){
String string = iterator.next();
if (dic.contains(string))
iterator.remove();
}
When you need built-in indexed access, sorting (and after this binary searching), and "ToArray()" method, you should use List.
Essentially, a List<> in .NET is a wrapper over an array. A LinkedList<> is a linked list. So the question comes down to, what is the difference between an array and a linked list, and when should an array be used instead of a linked list. Probably the two most important factors in your decision of which to use would come down to:
Linked lists have much better insertion/removal performance, so long as the insertions/removals are not on the last element in the collection. This is because an array must shift all remaining elements that come after the insertion/removal point. If the insertion/removal is at the tail end of the list however, this shift is not needed (although the array may need to be resized, if its capacity is exceeded).
Arrays have much better accessing capabilities. Arrays can be indexed into directly (in constant time). Linked lists must be traversed (linear time).
This is adapted from Tono Nam's accepted answer correcting a few wrong measurements in it.
The test:
static void Main()
{
LinkedListPerformance.AddFirst_List(); // 12028 ms
LinkedListPerformance.AddFirst_LinkedList(); // 33 ms
LinkedListPerformance.AddLast_List(); // 33 ms
LinkedListPerformance.AddLast_LinkedList(); // 32 ms
LinkedListPerformance.Enumerate_List(); // 1.08 ms
LinkedListPerformance.Enumerate_LinkedList(); // 3.4 ms
//I tried below as fun exercise - not very meaningful, see code
//sort of equivalent to insertion when having the reference to middle node
LinkedListPerformance.AddMiddle_List(); // 5724 ms
LinkedListPerformance.AddMiddle_LinkedList1(); // 36 ms
LinkedListPerformance.AddMiddle_LinkedList2(); // 32 ms
LinkedListPerformance.AddMiddle_LinkedList3(); // 454 ms
Environment.Exit(-1);
}
And the code:
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace stackoverflow
{
static class LinkedListPerformance
{
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
static readonly int start = 0;
static readonly int end = 123456;
static readonly IEnumerable<Temp> query = Enumerable.Range(start, end - start).Select(temp);
static Temp temp(int i)
{
return new Temp(i, i, i, i);
}
static void StopAndPrint(this Stopwatch watch)
{
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);
}
public static void AddFirst_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(0, temp(i));
watch.StopAndPrint();
}
public static void AddFirst_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddFirst(temp(i));
watch.StopAndPrint();
}
public static void AddLast_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Add(temp(i));
watch.StopAndPrint();
}
public static void AddLast_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddLast(temp(i));
watch.StopAndPrint();
}
public static void Enumerate_List()
{
var list = new List<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
public static void Enumerate_LinkedList()
{
var list = new LinkedList<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
//for the fun of it, I tried to time inserting to the middle of
//linked list - this is by no means a realistic scenario! or may be
//these make sense if you assume you have the reference to middle node
//insertion to the middle of list
public static void AddMiddle_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(list.Count / 2, temp(i));
watch.StopAndPrint();
}
//insertion in linked list in such a fashion that
//it has the same effect as inserting into the middle of list
public static void AddMiddle_LinkedList1()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
LinkedListNode<Temp> evenNode = null, oddNode = null;
for (int i = start; i < end; i++)
{
if (list.Count == 0)
oddNode = evenNode = list.AddLast(temp(i));
else
if (list.Count % 2 == 1)
oddNode = list.AddBefore(evenNode, temp(i));
else
evenNode = list.AddAfter(oddNode, temp(i));
}
watch.StopAndPrint();
}
//another hacky way
public static void AddMiddle_LinkedList2()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start + 1; i < end; i += 2)
list.AddLast(temp(i));
for (int i = end - 2; i >= 0; i -= 2)
list.AddLast(temp(i));
watch.StopAndPrint();
}
//OP's original more sensible approach, but I tried to filter out
//the intermediate iteration cost in finding the middle node.
public static void AddMiddle_LinkedList3()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
{
if (list.Count == 0)
list.AddLast(temp(i));
else
{
watch.Stop();
var curNode = list.First;
for (var j = 0; j < list.Count / 2; j++)
curNode = curNode.Next;
watch.Start();
list.AddBefore(curNode, temp(i));
}
}
watch.StopAndPrint();
}
}
}
You can see the results are in accordance with theoretical performance others have documented here. Quite clear - LinkedList<T> gains big time in case of insertions. I haven't tested for removal from the middle of list, but the result should be the same. Of course List<T> has other areas where it performs way better like O(1) random access.
Use LinkedList<> when
You don't know how many objects are coming through the flood gate. For example, Token Stream.
When you ONLY wanted to delete\insert at the ends.
For everything else, it is better to use List<>.
I do agree with most of the point made above. And I also agree that List looks like a more obvious choice in most of the cases.
But, I just want to add that there are many instance where LinkedList are far better choice than List for better efficiency.
Suppose you are traversing through the elements and you want to perform lot of insertions/deletion; LinkedList does it in linear O(n) time, whereas List does it in quadratic O(n^2) time.
Suppose you want to access bigger objects again and again, LinkedList become very more useful.
Deque() and queue() are better implemented using LinkedList.
Increasing the size of LinkedList is much easier and better once you are dealing with many and bigger objects.
Hope someone would find these comments useful.
In .NET, Lists are represented as Arrays. Therefore using a normal List would be quite faster in comparison to LinkedList.That is why people above see the results they see.
Why should you use the List?
I would say it depends. List creates 4 elements if you don't have any specified. The moment you exceed this limit, it copies stuff to a new array, leaving the old one in the hands of the garbage collector. It then doubles the size. In this case, it creates a new array with 8 elements. Imagine having a list with 1 million elements, and you add 1 more. It will essentially create a whole new array with double the size you need. The new array would be with 2Mil capacity however, you only needed 1Mil and 1. Essentially leaving stuff behind in GEN2 for the garbage collector and so on. So it can actually end up being a huge bottleneck. You should be careful about that.
I asked a similar question related to performance of the LinkedList collection, and discovered Steven Cleary's C# implement of Deque was a solution. Unlike the Queue collection, Deque allows moving items on/off front and back. It is similar to linked list, but with improved performance.
Related
How do you do this in C# without using List?
I am new to C#. The following code was a solution I came up to solve a challenge. I am unsure how to do this without using List since my understanding is that you can't push to an array in C# since they are of fixed size. Is my understanding of what I said so far correct? Is there a way to do this that doesn't involve creating a new array every time I need to add to an array? If there is no other way, how would I create a new array when the size of the array is unknown before my loop begins? Return a sorted array of all non-negative numbers less than the given n which are divisible both by 3 and 4. For n = 30, the output should be threeAndFour(n) = [0, 12, 24]. int[] threeAndFour(int n) { List<int> l = new List<int>(){ 0 }; for (int i = 12; i < n; ++i) if (i % 12 == 0) l.Add(i); return l.ToArray(); } EDIT: I have since refactored this code to be.. int[] threeAndFour(int n) { List<int> l = new List<int>(){ 0 }; for (int i = 12; i < n; i += 12) l.Add(i); return l.ToArray(); }
A. Lists is OK If you want to use a for to find out the numbers, then List is the appropriate data structure for collecting the numbers as you discover them. B. Use more maths static int[] threeAndFour(int n) { var a = new int[(n / 12) + 1]; for (int i = 12; i < n; i += 12) a[i/12] = i; return a; } C. Generator pattern with IEnumerable<int> I know that this doesn't return an array, but it does avoid a list. static IEnumerable<int> threeAndFour(int n) { yield return 0; for (int i = 12; i < n; i += 12) yield return i; } D. Twist and turn to avoid a list The code could for twice. First to figure the size or the array, and then to fill it. int[] threeAndFour(int n) { // Version: A list is really undesirable, arrays are great. int size = 1; for (int i = 12; i < n; i += 12) size++; var a = new int[size]; a[0] = 0; int counter = 1; for (int i = 12; i < n; i += 12) a[counter++] = i; }
if (i % 12 == 0) So you have figured out that the numbers which divides both 3 and 4 are precisely those numbers that divides 12. Can you figure out how many such numbers there are below a given n? - Can you do so without counting the numbers - if so there is no need for a dynamically growing container, you can just initialize the container to the correct size. Once you have your array just keep track of the next index to fill.
You could use Linq and Enumerable.Range method for the purpose. For example, int[] threeAndFour(int n) { return Enumerable.Range(0,n).Where(x=>x%12==0).ToArray(); } Enumerable.Range generates a sequence of integral numbers within a specified range, which is then filtered on the condition (x%12==0) to retrieve the desired result.
Since you know this goes in steps of 12 and you know how many there are before you start, you can do: Enumerable.Range(0,n/12+1).Select(x => x*12).ToArray();
I am unsure how to do this without using List since my understanding is that you can't push to an array in C# since they are of fixed size. It is correct that arrays can not grow. List were invented as a wrapper around a array that automagically grows whenever needed. Note that you can give List a integer via the Constructor, wich will tell it the minimum size it should expect. It will allocate at least that much the first time. This can limit growth related overhead. And dictionaries are just a variation of the list mechanics, with Hash Table key search speed. There is only 1 other Collection I know of that can grow. However it is rarely mentioned outside of theory and some very specific cases: Linked Lists. The linked list has a unbeatable growth performance and the lowest issue of running into OutOfMemory Exceptions due to Fragmentation. Unfortunately, their random access times are the worst as a result. Unless you can process those collections exclusively sequentally from the start (or sometimes the end), their performance will be abysmal. Only stacks and queues are likely to use them. There is however still a implementation you could use in .NET: https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.linkedlist-1 Your code holds some potential too: for (int i = 12; i < n; ++i) if (i % 12 == 0) l.Add(i); It would way more effective to count up by 12 every itteration - you are only interested in every 12th number after all. You may have to change the loop, but I think a do...while would do. Also the array/minimum List size is easily predicted: Just divide n by 12 and add 1. But I asume that is mostly mock-up code and it is not actually that deterministic.
List generally works pretty well, as I understand your question you have challenged yourself to solve a problem without using the List class. An array (or List) uses a contiguous block of memory to store elements. Arrays are of fixed size. List will dynamically expand to accept new elements but still keeps everything in a single block of memory. You can use a linked list https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.linkedlist-1?view=netframework-4.8 to produce a simulation of an array. A linked list allocates additional memory for each element (node) that is used to point to the next (and possibly the previous). This allows you to add elements without large block allocations, but you pay a space cost (increased use of memory) for each element added. The other problem with linked lists are you can't quickly access random elements. To get to element 5, you have to go through elements 0 through 4. There's a reason arrays and array like structures are favored for many tasks, but it's always interesting to try to do common things in a different way.
Why property "ElapsedTicks" of List not equal to "ElapsedTicks" of Array?
For example I have the following code implements Stopwatch: var list = new List<int>(); var array = new ArrayList(); Stopwatch listStopwatch = new Stopwatch(), arrayStopwatch = new Stopwatch(); listStopwatch.Start(); for (int i =0; i <=10000;i++) { list.Add(10); } listStopwatch.Stop(); arrayStopwatch.Start(); for (int i = 0; i <= 10000; i++) { list.Add(10); } arrayStopwatch.Stop(); Console.WriteLine(listStopwatch.ElapsedTicks > arrayStopwatch.ElapsedTicks); Why this values are not equal?
Different code is expected to produce different timing. Second loop adds to array as question imply One most obvious difference is boxing in ArrayList - each int is stored as boxed value (created on heap instead of inline for List<int>). Second loop adds to list as sample shows growing list requires re-allocation and copying all elements which may be slower for second set of elements if in particular range it will hit more re-allocations (as copy operation need to copy a lot more elements each time). Note that on average (as hinted by Adam Houldsworth) re-allocation cost the same (as they happen way lest often when array grows), but one can find set of numbers when there are extra re-allocation in on of the cases to get one number consistently different than another. One would need much higher number of items to add for difference to be consistent.
Is it worth to create new variable to use array instead of list?
I want to have the best performance and I know that array its faster than list but with array I need to create a variable for counter and even may need to use .Count or .Length to find the size so I though maybe better just use list? Below are the examples. Example 1: foreach (var item in items) ItemCollection.Add(item); Example 2: int i = 0; foreach (var item in items) { ItemCollection[i] = item; i++; } Example 3: for (int i = 0; i < items.Count; i++) ItemCollection[i] = item;
Example one is your best option as it appears you are trying to dynamically change the size of your array/list, Example two is just silly. And example 3 would become tricky when you wish to extend the array. See my first point A point to note in your third example is in your for loop you have for (int i = 0; i < items.Count; i++) This will revaluate items.Count every iteration so you could micro-optimize by moving this out of the for loop var length = items.Count for (int i = 0; i < length; i++)
Performance of a list is nearly identical to that of an array. If you know the exact number of items that you are planning to add, you can eliminate the potential memory overhead as well by creating a list with the exact number of elements to avoid re-allocations on Add: // Reserve the required number of spots in the list var ItemCollection = new List<ItemType>(items.Count); foreach (var item in items) // Add is not going to cause reallocation, // because we reserved enough space ahead of time ItemCollection.Add(item); In most instances, this turns out to be a premature micro-optimization.
Well you can use 'foreach' on Arrays : int[] bob = new int[] { 0, 1, 2, 3 }; foreach (int i in bob) { Console.WriteLine(i); } Anyway, in most cases, the difference should be pretty negligible. You also have to realize that 'foreach' doesn't magically iterate through the list, it calls 'GetEnumerator' and then uses this to loop, which also uses some ram (actually more than just creating 'int i'). I generally use Arrays when I know the length is fixed and will remain pretty small, otherwise using Lists is just a lot easier. Also, don't optimize until you know you need to, you're pretty much wasting your time otherwise.
Why is removing by index from an IList performing so much worse than removing by item from an ISet?
Edit: I will add some benchmark results. To about a 1000 - 5000 items in the list, IList and RemoveAt beats ISet and Remove, but that's not something to worry about since the differences are marginal. The real fun begins when collection size extends to 10000 and more. I'm posting only those data I was answering a question here last night and faced a bizarre situation. First a set of simple methods: static Random rnd = new Random(); public static int GetRandomIndex<T>(this ICollection<T> source) { return rnd.Next(source.Count); } public static T GetRandom<T>(this IList<T> source) { return source[source.GetRandomIndex()]; } ------------------------------------------------------------------------------------------------------------------------------------ Let's say I'm removing N number of items from a collection randomly. I would write this function: public static void RemoveRandomly1<T>(this ISet<T> source, int countToRemove) { int countToRemain = source.Count - countToRemove; var inList = source.ToList(); int i = 0; while (source.Count > countToRemain) { source.Remove(inList.GetRandom()); i++; } } or public static void RemoveRandomly2<T>(this IList<T> source, int countToRemove) { int countToRemain = source.Count - countToRemove; int j = 0; while (source.Count > countToRemain) { source.RemoveAt(source.GetRandomIndex()); j++; } } As you can see the first function is written for an ISet and the second for normal IList. In the first function I'm removing by item from ISet and by index in IList, both of which I believe are O(1). Why is the second function performing so much worse than the first, especially when the lists get bigger? Odds (my take): 1) In the first function the ISet is converted to an IList (to get the random item from the IList), where as there is no such thing performed in the second function. Advantage IList. 2) In the first function a call to GetRandomItem is made, where as in the second, a call to GetRandomIndex is made, that's one step less again. Though trivial, advantage IList. 3) In the first function, the random item is got from a separate list, so the obtained item might be already removed from ISet. This leads in more iterations in the while loop in the first function. In the second function, the random index is got from the source that is being iterated on, hence there are never repetitive iterations. I have tested this and verified this. i > j always, advantage IList. I thought the reason for this behaviour is that a List would need constant resizing when items are added or removed. But apparently no in some other testing. I ran: public static void Remove1(this ISet<int> set) { int count = set.Count; for (int i = 0; i < count; i++) { set.Remove(i + 1); } } public static void Remove2(this IList<int> lst) { for (int i = lst.Count - 1; i >= 0; i--) { lst.RemoveAt(i); } } and found that the second function runs faster. Test bed: var f = Enumerable.Range(1, 100000); var s = new HashSet<int>(f); var l = new List<int>(f); Benchmark(() => { //some examples... s.RemoveRandomly1(2500); l.RemoveRandomly2(2500); s.Remove1(); l.Remove2(); }, 1); public static void Benchmark(Action method, int iterations = 10000) { Stopwatch sw = new Stopwatch(); sw.Start(); for (int i = 0; i < iterations; i++) method(); sw.Stop(); MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString()); } Just trying to know what's with the two structures.. Thanks.. Result: var f = Enumerable.Range(1, 10000); s.RemoveRandomly1(7500); => 5ms l.RemoveRandomly2(7500); => 20ms var f = Enumerable.Range(1, 100000); s.RemoveRandomly1(7500); => 7ms l.RemoveRandomly2(7500); => 275ms var f = Enumerable.Range(1, 1000000); s.RemoveRandomly1(75000); => 50ms l.RemoveRandomly2(75000); => 925000ms For most typical needs a list would do though..!
First off, IList and ISet aren't implementations of anything. I can write an IList or an ISet implementation that will run very differently, so the concrete implementations are what is important (List and HashSet in your case). Accessing a List item by index is O(1) but not removing by RemoveAt which is O(n). List removing from the end will be fast because it doesn't have to copy anything, it just decrements its internal counter that stores how many items it has until the number of empty spots in the underlying array goes below a threshold, at which point it will copy the array to a smaller one. Once you hit the max capacity of the underlying array it creates a new array double the size and copies the elements over. If you go below a certain threshold it will create an array half the size and copy the elements over. It tracks how large it is with a length property, so that unused slots appear like they aren't there. Randomly removing from a list means that it will have to copy all the array entries that come after the index so that they slide down one spot, which is inherently pretty slow, particularly as the size of the list gets bigger. If you have a List with 1 million entries, and you remove something at index 500,000, it has to copy the second half of the array down a spot.
How to avoid OrderBy - memory usage problems
Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate. Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that: N = 0.05; // selecting only 5% of points double cutoffValue = pointList .OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data .ElementAt((int) pointList.Count * (1 - N)).Z; List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList(); But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points). Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory? The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.
Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time. public static IEnumerable<TSource> TakeLargest<TSource, TKey> (this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count) { var set = new SortedDictionary<TKey, List<TSource>>(); var resultCount = 0; var first = default(KeyValuePair<TKey, List<TSource>>); foreach (var item in items) { // If the key is already smaller than the smallest // item in the set, we can ignore this item var key = selector(item); if (first.Value == null || resultCount < count || Comparer<TKey>.Default.Compare(key, first.Key) >= 0) { // Add next item to set if (!set.ContainsKey(key)) { set[key] = new List<TSource>(); } set[key].Add(item); if (first.Value == null) { first = set.First(); } // Remove smallest item from set resultCount++; if (resultCount - first.Value.Count >= count) { set.Remove(first.Key); resultCount -= first.Value.Count; first = set.First(); } } } return set.Values.SelectMany(values => values); } That will include more than count elements if there are ties, as your implementation does now.
You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want... pointList.Sort((a, b) => b.Z.CompareTo(a.Z)); var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList(); If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed
You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.
If you combine the two there is a chance a little less work will be done: List<Point> selectedPoints = pointList .OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data .Take((int) pointList.Count * N); But basically this kind of ranking requires sorting, your biggest cost. A few more ideas: if you use a class Point (instead of a struct Point) there will be much less copying. you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.
If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage): pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished. double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z; List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue) .ToList();
Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern. To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient. double N = 0.05; // selecting only 5% of points int count = (1-N) * pointList.Count; var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count); That out of the way, there are three cases where memory use might actually be a problem: Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently. You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines. Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH. Update: Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.
I'd do it by implementing "half" a quicksort. Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate. Choose a pivot x in P. Partition P into L = {y in P | y < x} and U = {y in P | x <= y}. If N = |U| then you're done. If N < |U| then recurse with P := U. Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items. If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time. Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2: Make a traversal of P, finding the least and greatest items, A and Z, respectively. Let M be the mean of A and Z (remember, we're only considering Z coordinates here). Count how many items there are in the range [M, Z], call this Q. If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2. If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2. Repeat until we find an M such that Q = N. Now traverse P, removing all items greater than or equal to M. That's definitely O(n log n) and creates no extra data structures (except for the result). Howzat?
You might use something like this: pointList.Sort(); // Use you own compare here if needed // Skip OrderBy because the list is sorted (and not copied) double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z; // Skip ToList to avoid another copy of the list IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue);
If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order. This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.
int resultSize = pointList.Count * (1-N); FixedSizedPriorityQueue<Point> q = new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z); q.AddEach(pointList); List<Point> selectedPoints = q.ToList(); Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.
You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows. Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are: Sort With Linq: 5.3 seconds Use DataViews: 0.01 seconds Give it a try. [TestClass] public class UnitTest1 { class MyTable : TypedTableBase<MyRow> { public MyTable() { Columns.Add("Col1", typeof(int)); Columns.Add("Col2", typeof(int)); } protected override DataRow NewRowFromBuilder(DataRowBuilder builder) { return new MyRow(builder); } } class MyRow : DataRow { public MyRow(DataRowBuilder builder) : base(builder) { } public int Col1 { get { return (int)this["Col1"]; } } public int Col2 { get { return (int)this["Col2"]; } } } DataView _viewCol1Asc; DataView _viewCol2Desc; MyTable _table; int _countToTake; [TestMethod] public void MyTestMethod() { _table = new MyTable(); int count = 50000; for (int i = 0; i < count; i++) { _table.Rows.Add(i, i); } _countToTake = _table.Rows.Count / 30; Console.WriteLine("SortWithLinq"); RunTest(SortWithLinq); Console.WriteLine("Use DataViews"); RunTest(UseSoredDataViews); } private void RunTest(Action method) { int iterations = 100; Stopwatch watch = Stopwatch.StartNew(); for (int i = 0; i < iterations; i++) { method(); } watch.Stop(); Console.WriteLine(" {0}", watch.Elapsed); } private void UseSoredDataViews() { if (_viewCol1Asc == null) { _viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged); _viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged); } var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row); IterateRows(rows); rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row); IterateRows(rows); } private void SortWithLinq() { var rows = _table.OrderBy(row => row.Col1).Take(_countToTake); IterateRows(rows); rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake); IterateRows(rows); } private void IterateRows(IEnumerable<MyRow> rows) { foreach (var row in rows) if (row == null) throw new Exception("????"); } }