Faster way to do a List<T>.Contains() - c#

I am trying to do what I think is a "de-intersect" (I'm not sure what the proper name is, but that's what Tim Sweeney of EpicGames called it in the old UnrealEd)
// foo and bar have some identical elements (given a case-insensitive match)
List‹string› foo = GetFoo();
List‹string› bar = GetBar();
// remove non matches
foo = foo.Where(x => bar.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
bar = bar.Where(x => foo.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
Then later on, I do another thing where I subtract the result from the original, to see which elements I removed. That's super-fast using .Except(), so no troubles there.
There must be a faster way to do this, because this one is pretty bad-performing with ~30,000 elements (of string) in either List. Preferably, a method to do this step and the one later on in one fell swoop would be nice. I tried using .Exists() instead of .Contains(), but it's slightly slower. I feel a bit thick, but I think it should be possible with some combination of .Except() and .Intersect() and/or .Union().

This operation can be called a symmetric difference.
You need a different data structure, like a hash table. Add the intersection of both sets to it, then difference the intersection from each set.
UPDATE:
I got a bit of time to try this in code. I used HashSet<T> with a set of 50,000 strings, from 2 to 10 characters long with the following results:
Original: 79499 ms
Hashset: 33 ms
BTW, there is a method on HashSet called SymmetricExceptWith which I thought would do the work for me, but it actually adds the different elements from both sets to the set the method is called on. Maybe this is what you want, rather than leaving the initial two sets unmodified, and the code would be more elegant.
Here is the code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// foo and bar have some identical elements (given a case-insensitive match)
var foo = getRandomStrings();
var bar = getRandomStrings();
var timer = new Stopwatch();
timer.Start();
// remove non matches
var f = foo.Where(x => !bar.Contains(x)).ToList();
var b = bar.Where(x => !foo.Contains(x)).ToList();
timer.Stop();
Debug.WriteLine(String.Format("Original: {0} ms", timer.ElapsedMilliseconds));
timer.Reset();
timer.Start();
var intersect = new HashSet<String>(foo);
intersect.IntersectWith(bar);
var fSet = new HashSet<String>(foo);
var bSet = new HashSet<String>(bar);
fSet.ExceptWith(intersect);
bSet.ExceptWith(intersect);
timer.Stop();
var fCheck = new HashSet<String>(f);
var bCheck = new HashSet<String>(b);
Debug.WriteLine(String.Format("Hashset: {0} ms", timer.ElapsedMilliseconds));
Console.WriteLine("Sets equal? {0} {1}", fSet.SetEquals(fCheck), bSet.SetEquals(bCheck)); //bSet.SetEquals(set));
Console.ReadKey();
}
static Random _rnd = new Random();
private const int Count = 50000;
private static List<string> getRandomStrings()
{
var strings = new List<String>(Count);
var chars = new Char[10];
for (var i = 0; i < Count; i++)
{
var len = _rnd.Next(2, 10);
for (var j = 0; j < len; j++)
{
var c = (Char)_rnd.Next('a', 'z');
chars[j] = c;
}
strings.Add(new String(chars, 0, len));
}
return strings;
}
}

With intersect it would be done like this:
var matches = ((from f in foo
select f)
.Intersect(
from b in bar
select b, StringComparer.InvariantCultureIgnoreCase))

If the elements are unique within each list you should consider using an HashSet
The HashSet(T) class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order.

With sorted list, you can use binary search.

Contains on a list is an O(N) operation. If you had a different data structure, such as a sorted list or a Dictionary, you would dramatically reduce your time. Accessing a key in a sorted list is usually O(log N) time, and in a hash is usually O(1) time.

Related

c# HashSet init takes too long

I'm dealing with the fact that I need to init a HashSet with a set of elements but without any kind of comparation class.
After the init, any element added to the HashSet need to be passed with a comparator.
How can I accomplish it?
Now I have this:
HashSet<Keyword> set = new HashSet<Keyword>(new KeyWordComparer());
The problem is that the init takes to long and there's no necessity in applying the comparation.
KeywordComparer Class:
class KeyWordComparer : EqualityComparer<Keyword>
{
public override bool Equals(Keyword k1, Keyword k2)
{
int equals = 0;
int i = 0;
int j = 0;
// based on sorted ids
while (i < k1._lista_modelos.Count && j < k2._lista_modelos.Count)
{
if (k1._lista_modelos[i] < k2._lista_modelos[j])
{
i++;
}
else if (k1._lista_modelos[i] > k2._lista_modelos[j])
{
j++;
}
else
{
equals++;
i++;
j++;
}
}
return equals >= 8;
}
public override int GetHashCode(Keyword keyword)
{
return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.
}
}
Note: This is a follow-up question to c# comparing list of IDs.
every keyword has 20 IDs, so when I want to add a new Keyword to the HashSet, the KeywordComparer check that the new one does not have more than 8 ID's repeated with any keyword of the HashSet.In such case, new keyword is not included, if not, it's included.
Collecting these keywords is not a job for a hash set here. A hash set is generally not suited for items which depend on other elements of the set. You should only use it for things where a useful hash can be calculated for every item. Since it depends on the existing set of items whether a new item gets added to your set, this is totally the wrong tool.
Here’s an attempt to solve this problem according to your short description of what you actually want to do. Here, we are simply collecting the keywords in a list. In order to verify that they may be added, we use an addition hash set to collect the ids of the keywords. That way, we can quickly check for a new item, whether 8 or more of its ids are already contained within the list of keywords.
var keywords = new List<Keyword>();
var selectedIds = new HashSet<int>(); // I’m assuming that the ids are ints here
foreach (var keyword in GetListOfAllKeywords())
{
// count the number of keyword ids that are already in the selectedIds set
var duplicateIdCount = keyword.Ids.Count(id => selectedIds.Contains(id));
if (duplicateIdCount <= 8)
{
// less or equal to 8 ids are already selected, so add this keyword
keywords.Add(keyword);
// and collect all the keyword’s ids
selectedIds.AddRange(keyword.Ids);
}
}
If I stay away from the fact if using the HashSet is the right type for the job at hand or if your Comparer even makes sense implementing a proper GetHashCode does seem to make a huge difference.
Here is an example implementation, based on an answer from Marc Gravell:
class KeyWordComparer : EqualityComparer<Keyword>
{
// omitted your Equals implentaton for brevity
public override int GetHashCode(Keyword keyword)
{
//return 0; // this was the original
// Marc Gravell https://stackoverflow.com/a/371348/578411
int hash = 13;
// not sure what is up with the only 8 ID's but I take that as a given
for(var i=0; i < Math.Min(keyword._lista_modelos.Count, 8) ; i++)
{
hash = (hash * 7) + keyword._lista_modelos[i].GetHashCode();
}
return hash;
}
}
When I run this in LinqPad with this test rig
Random randNum = new Random();
var kc = new KeyWordComparer();
HashSet<Keyword> set = new HashSet<Keyword>(kc);
var sw = new Stopwatch();
sw.Start();
for(int i =0 ; i< 10000; i++)
{
set.Add(new Keyword(Enumerable
.Repeat(0, randNum.Next(1,10))
.Select(ir => randNum.Next(1, 256)).ToList()));
}
sw.Stop();
sw.ElapsedMilliseconds.Dump("ms");
this is what I measure:
7 ms for 10,000 items
If I switch back to your return 0; implementation for GetHashCodeI measure
4754 ms for 10,000 items
If I increase the testloop to insert 100,000 items the better GetHashCode still completes in 224 ms on my box. I didn't wait for your implementation to finish.
So if anything implement a proper GetHashCode method.

faster algorithm or technique for building sparse arrays in c#

I have a matrix-building problem. To build the matrix (for a 3rd party package), I need to do it row-by-row by passing a double[] array to the 3rd-party object. Here's my problem: I have a list of objects that represent paths on a graph. Each object is a path with a 'source' property (string) and a 'destination' property (also string). I need to build a 1-dimensional array where all the elements are 0 except where the source property is equal to a given name. The given name will occur multiple times in the path list. Here's my function for building the sparse array:
static double[] GetNodeSrcRow3(string nodeName)
{
double[] r = new double[cpaths.Count ];
for (int i = 1; i < cpaths.Count; i++)
{
if (cpaths[i].src == nodeName) r[i] = 1;
}
return r;
}
Now I need to call this function about 200k times with different names. The function itself takes between 0.05 and 0.1 seconds (timed with Stopwatch). As you can imagine, if we take the best possible case of 0.05 seconds * 200k calls = 10,000 seconds = 2.7 hours which is too long. The object 'cpaths' contains about 200k objects.
Can someone think of a way to accomplish this in a faster way?
I can't see the rest of your code, but I suspect most of the time is spent allocating and garbage collecting all the arrays. Assuming the size of cpaths doesn't change, you can reuse the same array.
private static double[] NodeSourceRow == null;
private static List<int> LastSetIndices = new List<int>();
static double[] GetNodeSrcRow3(string nodeName) {
// create new array *only* on the first call
NodeSourceRow = NodeSourceRow ?? new double[cpaths.Count];
// reset all elements to 0
foreach(int i in LastSetIndices) NodeSourceRow[i] = 0;
LastSetIndices.Clear();
// set the 1s
for (int i = 1; i < cpaths.Count; i++) {
if (cpaths[i].src == nodeName) {
NodeSourceRow[i] = 1;
LastSetIndices.Add(i);
}
}
// tada!!
return NodeSourceRow;
}
One drawback potential drawback would be if you need all the arrays to used at the same time, they will always have identical contents. But if you only use one at a time, this should be much faster.
if cpaths is normal list then that's not suitable for your case. you need a dictionary of src to list of indexes. like Dictionary<string, List<int>>.
then you can fill sparse array with random access. I would also suggest you to use Sparse list implementation for efficient memory usage rather than using memory inefficient double[]. a good implementation is SparseAList. (written by David Piepgrass)
Before generating your sparse lists, you should convert your cpaths list into a suitable dictionary, this step may take a little long (up to few seconds), but after that you will generate your sparse lists super fast.
public static Dictionary<string, List<int>> _dictionary;
public static void CacheIndexes()
{
_dictionary = cpaths.Select((x, i) => new { index = i, value = x })
.GroupBy(x => x.value.src)
.ToDictionary(x => x.Key, x => x.Select(a => a.index).ToList());
}
you should call CacheIndexes before starting to generate your sparse arrays.
public static double[] GetNodeSrcRow3(string nodeName)
{
double[] r = new double[cpaths.Count];
List<int> indexes;
if(!_dictionary.TryGetValue(nodeName, out indexes)) return r;
foreach(var index in indexes) r[index] = 1;
return r;
}
Note that if you use SparseAList it will occupy very small space. for example if double array is 10K length and has only one index set in it, with SparseAList you will have virtually 10K items, but in fact there is only one item stored in memory. its not hard to use that collection, I suggest you to give it a try.
same code using SparseAList
public static SparseAList<double> GetNodeSrcRow3(string nodeName)
{
SparseAList<double> r = new SparseAList<double>();
r.InsertSpace(0, cpaths.Count); // allocates zero memory.
List<int> indexes;
if(!_dictionary.TryGetValue(nodeName, out indexes)) return r;
foreach(var index in indexes) r[index] = 1;
return r;
}
You could make use of multi-threading using the TPL's Parallel.For method.
static double[] GetNodeSrcRow3(string nodeName)
{
double[] r = new double[cpaths.Count];
Parallel.For(1, cpaths.Count, (i, state) =>
{
if (cpaths[i].src == nodeName) r[i] = 1;
});
return r;
}
Fantastic Answers!
If I may add some, to the already great examples:
System.Numerics.Tensors.SparseTensor<double> GetNodeSrcRow3(string text)
{
// A quick NuGet System.Numerics.Tensors Install:
System.Numerics.Tensors.SparseTensor<double> SparseTensor = new System.Numerics.Tensors.SparseTensor<double>(new int[] { cpaths.Count }, true, 1);
Parallel.For(1, cpaths.Count, (i, state) =>
{
if (cpaths[i].src == nodeName) SparseTensor[i] = 1.0D;
});
return SparseTensor;
}
System.Numerics is optimised hugely, also uses hardware acceleration. It is also Threadsafe. At least from what I have read about it.
For Speed and scalability, a small bit of code that could make all the difference.

Median Maintenance Algorithm - Same implementation yields different results depending on Int32 or Int64

I found something interesting while doing a HW question.
The howework question asks to code the Median Maintenance algorithm.
The formal statement is as follows:
The goal of this problem is to implement the "Median Maintenance" algorithm (covered in the Week 5 lecture on heap applications). The text file contains a list of the integers from 1 to 10000 in unsorted order; you should treat this as a stream of numbers, arriving one by one. Letting xi denote the ith number of the file, the kth median mk is defined as the median of the numbers x1,…,xk. (So, if k is odd, then mk is ((k+1)/2)th smallest number among x1,…,xk; if k is even, then m1 is the (k/2)th smallest number among x1,…,xk.)
In order to get O(n) running time, this should be implemented using heaps obviously. Anyways, I coded this using Brute Force (deadline was too soon and needed an answer right away) (O(n2)) with the following steps:
Read data in
Sort array
Find Median
Add it to running time
I ran the algorithm through several test cases (with a known answer) and got the correct results, however when I was running the same algorithm on a larger data set I was getting the wrong answer. I was doing all the operations using Int64 ro represent the data.
Then I tried switching to Int32 and magically I got the correct answer which makes no sense to me.
The code is below, and it is also found here (the data is in the repo). The algorithm starts to give erroneous results after the 3810 index:
private static void Main(string[] args)
{
MedianMaintenance("Question2.txt");
}
private static void MedianMaintenance(string filename)
{
var txtData = File.ReadLines(filename).ToArray();
var inputData32 = new List<Int32>();
var medians32 = new List<Int32>();
var sums32 = new List<Int32>();
var inputData64 = new List<Int64>();
var medians64 = new List<Int64>();
var sums64 = new List<Int64>();
var sum = 0;
var sum64 = 0f;
var i = 0;
foreach (var s in txtData)
{
//Add to sorted list
var intToAdd = Convert.ToInt32(s);
inputData32.Add(intToAdd);
inputData64.Add(Convert.ToInt64(s));
//Compute sum
var count = inputData32.Count;
inputData32.Sort();
inputData64.Sort();
var index = 0;
if (count%2 == 0)
{
//Even number of elements
index = count/2 - 1;
}
else
{
//Number is odd
index = ((count + 1)/2) - 1;
}
var val32 = Convert.ToInt32(inputData32[index]);
var val64 = Convert.ToInt64(inputData64[index]);
if (i > 3810)
{
var t = sum;
var t1 = sum + val32;
}
medians32.Add(val32);
medians64.Add(val64);
//Debug.WriteLine("Median is {0}", val);
sum += val32;
sums32.Add(Convert.ToInt32(sum));
sum64 += val64;
sums64.Add(Convert.ToInt64(sum64));
i++;
}
Console.WriteLine("Median Maintenance result is {0}", (sum).ToString("N"));
Console.WriteLine("Median Maintenance result is {0}", (medians32.Sum()).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (sum64).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (medians64.Sum()).ToString("N"));
}
What's more interesting is that the running sum (in the sum64 variable) yields a different result than summing all items in the list with LINQ's Sum() function.
The results (the thirs one is the one that's wrong):
These are the computer details:
I'll appreciate if someone can give me some insights on why is this happening.
Thanks,
0f is initializing a 32 bit float variable, you meant 0d or 0.0 to receive a 64 bit floating point.
As for linq, you'll probably get better results if you use strongly typed lists.
new List<int>()
new List<long>()
The first thing I notice is what the commenter did: var sum64 = 0f initializes sum64 as a float. As the median value of a collection of Int64s will itself be an Int64 (the specified rules don't use the mean between two midpoint values in a collection of even cardinality), you should instead declare this variable explicitly as a long. In fact, I would go ahead and replace all usages of var in this code example; the convenience of var is being lost here in causing type-related bugs.

Subgroup of list

I have a 2 list of string.
Is there a simple way to find if one list contains all the strings of the 2nd list?
(By saying simple, I mean that I don't have explicitly compare for each string in one list to all the strings
Use Enumerable.Except to find differences between lists. If there is no items in result, then all items from list2 are in list1:
bool containsAll = !list2.Except(list1).Any();
Internally Except uses Set<T> to get unique items from list1 and returns only that items from list2 which are not in set. If there is nothing to return, then all items in set.
Try this:
firstList.All(x=>secondList.Contains(x));
Shorter version (method group):
firstList.All(secondList.Contains)
You need to write using for Linq:
using System.Linq;
It ckeckes wheter All items from first list are in second list. Contains checks if given item is in list. All gives true if all items of collections are matching predicate. Given predicate is: if item is in second list, so whole expresion checks if all items are in second list <- proved working :)
For larger lists use a HashSet<T> (which results in linear Big O, as opposed to O(n^2) when just using two lists):
var hash = new HashSet<string>(list2);
bool containsAll = list1.All(hash.Contains);
use LINQ
bool isSublistOf = list1.All(list2.Contains);
The All method returns true if the condition in the lambda is met for every element in the IEnumerable. The All is passed the Contains method of List2 as the Func<bool,string> which returns true if the element is found List2. The net effect is that the statement returns true if ALL of the elements in List1 are found in List2.
Performance Note
Due to the nature of the All operator, it is worst case O(n^2), but will exit at the first chance (any mismatch). Using completely random 8 byte strings, I tried out each of the cases using a rudimentary performance harness.
static void Main(string[] args)
{
long count = 5000000;
//Get 5,000,000 random strings (the superset)
var strings = CreateRandomStrings(count);
//Get 1000 random strings (the subset)
var substrings = CreateRandomStrings(1000);
//Perform the hashing technique
var start = DateTime.Now;
var hash = new HashSet<string>(strings);
var mid = DateTime.Now;
var any = substrings.All(hash.Contains);
var end = DateTime.Now;
Console.WriteLine("Hashing took " + end.Subtract(start).TotalMilliseconds + " " + mid.Subtract(start).Milliseconds + " of which was setting up the hash");
//Do the scanning all technique
start = DateTime.Now;
any = substrings.All(strings.Contains);
end = DateTime.Now;
Console.WriteLine("Scanning took " + end.Subtract(start).TotalMilliseconds);
//Do the Excepting technique
start = DateTime.Now;
any = substrings.Except(strings).Any();
end = DateTime.Now;
Console.WriteLine("Excepting took " + end.Subtract(start).TotalMilliseconds);
Console.ReadKey();
}
private static string[] CreateRandomStrings(long count)
{
var rng = new Random(DateTime.Now.Millisecond);
string[] strings = new string[count];
byte[] bytes = new byte[8];
for (long i = 0; i < count; i++) {
rng.NextBytes(bytes);
strings[i] = Convert.ToBase64String(bytes);
}
return strings;
}
The result ranked them in the following order fairly consistently:
Scanning - ~38ms (list1.All(list2.Contains))
Hashing - ~750ms (749 of which was spent setting up the hashset)
Excepting - 1200ms
The excepting method takes much longer because it requires all the work up front. Unlike the other methods, it will not exit on a mismatch, but continue to process all elements. The Hashing is much faster, but also does significant work up front in setting up the hash. This would be the fastest method if the strings were less random and intersections were more certain.
Disclaimer
All performance tuning at this level is next to irrelevant. This is just a mental exercise only

When should I use a List vs a LinkedList

When is it better to use a List vs a LinkedList?
In most cases, List<T> is more useful. LinkedList<T> will have less cost when adding/removing items in the middle of the list, whereas List<T> can only cheaply add/remove at the end of the list.
LinkedList<T> is only at it's most efficient if you are accessing sequential data (either forwards or backwards) - random access is relatively expensive since it must walk the chain each time (hence why it doesn't have an indexer). However, because a List<T> is essentially just an array (with a wrapper) random access is fine.
List<T> also offers a lot of support methods - Find, ToArray, etc; however, these are also available for LinkedList<T> with .NET 3.5/C# 3.0 via extension methods - so that is less of a factor.
Thinking of a linked list as a list can be a bit misleading. It's more like a chain. In fact, in .NET, LinkedList<T> does not even implement IList<T>. There is no real concept of index in a linked list, even though it may seem there is. Certainly none of the methods provided on the class accept indexes.
Linked lists may be singly linked, or doubly linked. This refers to whether each element in the chain has a link only to the next one (singly linked) or to both the prior/next elements (doubly linked). LinkedList<T> is doubly linked.
Internally, List<T> is backed by an array. This provides a very compact representation in memory. Conversely, LinkedList<T> involves additional memory to store the bidirectional links between successive elements. So the memory footprint of a LinkedList<T> will generally be larger than for List<T> (with the caveat that List<T> can have unused internal array elements to improve performance during append operations.)
They have different performance characteristics too:
Append
LinkedList<T>.AddLast(item) constant time
List<T>.Add(item) amortized constant time, linear worst case
Prepend
LinkedList<T>.AddFirst(item) constant time
List<T>.Insert(0, item) linear time
Insertion
LinkedList<T>.AddBefore(node, item) constant time
LinkedList<T>.AddAfter(node, item) constant time
List<T>.Insert(index, item) linear time
Removal
LinkedList<T>.Remove(item) linear time
LinkedList<T>.Remove(node) constant time
List<T>.Remove(item) linear time
List<T>.RemoveAt(index) linear time
Count
LinkedList<T>.Count constant time
List<T>.Count constant time
Contains
LinkedList<T>.Contains(item) linear time
List<T>.Contains(item) linear time
Clear
LinkedList<T>.Clear() linear time
List<T>.Clear() linear time
As you can see, they're mostly equivalent. In practice, the API of LinkedList<T> is more cumbersome to use, and details of its internal needs spill out into your code.
However, if you need to do many insertions/removals from within a list, it offers constant time. List<T> offers linear time, as extra items in the list must be shuffled around after the insertion/removal.
Linked lists provide very fast insertion or deletion of a list member. Each member in a linked list contains a pointer to the next member in the list so to insert a member at position i:
update the pointer in member i-1 to point to the new member
set the pointer in the new member to point to member i
The disadvantage to a linked list is that random access is not possible. Accessing a member requires traversing the list until the desired member is found.
Edit
Please read the comments to this answer. People claim I did not do
proper tests. I agree this should not be an accepted answer. As I was
learning I did some tests and felt like sharing them.
Original answer...
I found interesting results:
// Temporary class to show the example
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
Linked list (3.9 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (2.4 seconds)
List<Temp> list = new List<Temp>(); // 2.4 seconds
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.Add(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Even if you only access data essentially it is much slower!! I say never use a linkedList.
Here is another comparison performing a lot of inserts (we plan on inserting an item at the middle of the list)
Linked List (51 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
var curNode = list.First;
for (var k = 0; k < i/2; k++) // In order to insert a node at the middle of the list we need to find it
curNode = curNode.Next;
list.AddAfter(curNode, a); // Insert it after
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (7.26 seconds)
List<Temp> list = new List<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.Insert(i / 2, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Linked List having reference of location where to insert (.04 seconds)
list.AddLast(new Temp(1,1,1,1));
var referenceNode = list.First;
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
list.AddBefore(referenceNode, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
So only if you plan on inserting several items and you also somewhere have the reference of where you plan to insert the item then use a linked list. Just because you have to insert a lot of items it does not make it faster because searching the location where you will like to insert it takes time.
My previous answer was not enough accurate.
As truly it was horrible :D
But now I can post much more useful and correct answer.
I did some additional tests. You can find it's source by the following link and reCheck it on your environment by your own: https://github.com/ukushu/DataStructuresTestsAndOther.git
Short results:
Array need to use:
So often as possible. It's fast and takes smallest RAM range for same amount information.
If you know exact count of cells needed
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
List need to use:
If needed to add cells to the end of list (often)
If needed to add cells in the beginning/middle of the list (NOT OFTEN)
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
LinkedList need to use:
If needed to add cells in the beginning/middle/end of the list (often)
If needed only sequential access (forward/backward)
If you need to save LARGE items, but items count is low.
Better do not use for large amount of items, as it's use additional memory for links.
More details:
Interesting to know:
LinkedList<T> internally is not a List in .NET. It's even does not implement IList<T>. And that's why there are absent indexes and methods related to indexes.
LinkedList<T> is node-pointer based collection. In .NET it's in doubly linked implementation. This means that prior/next elements have link to current element. And data is fragmented -- different list objects can be located in different places of RAM. Also there will be more memory used for LinkedList<T> than for List<T> or Array.
List<T> in .Net is Java's alternative of ArrayList<T>. This means that this is array wrapper. So it's allocated in memory as one contiguous block of data. If allocated data size exceeds 85000 bytes, it will be moved to Large Object Heap. Depending on the size, this can lead to heap fragmentation(a mild form of memory leak). But in the same time if size < 85000 bytes -- this provides a very compact and fast-access representation in memory.
Single contiguous block is preferred for random access performance and memory consumption but for collections that need to change size regularly a structure such as an Array generally need to be copied to a new location whereas a linked list only needs to manage the memory for the newly inserted/deleted nodes.
The difference between List and LinkedList lies in their underlying implementation. List is array based collection (ArrayList). LinkedList is node-pointer based collection (LinkedListNode). On the API level usage, both of them are pretty much the same since both implement same set of interfaces such as ICollection, IEnumerable, etc.
The key difference comes when performance matter. For example, if you are implementing the list that has heavy "INSERT" operation, LinkedList outperforms List. Since LinkedList can do it in O(1) time, but List may need to expand the size of underlying array. For more information/detail you might want to read up on the algorithmic difference between LinkedList and array data structures. http://en.wikipedia.org/wiki/Linked_list and Array
Hope this help,
The primary advantage of linked lists over arrays is that the links provide us with the capability to rearrange the items efficiently.
Sedgewick, p. 91
A common circumstance to use LinkedList is like this:
Suppose you want to remove many certain strings from a list of strings with a large size, say 100,000. The strings to remove can be looked up in HashSet dic, and the list of strings is believed to contain between 30,000 to 60,000 such strings to remove.
Then what's the best type of List for storing the 100,000 Strings? The answer is LinkedList. If the they are stored in an ArrayList, then iterating over it and removing matched Strings whould take up
to billions of operations, while it takes just around 100,000 operations by using an iterator and the remove() method.
LinkedList<String> strings = readStrings();
HashSet<String> dic = readDic();
Iterator<String> iterator = strings.iterator();
while (iterator.hasNext()){
String string = iterator.next();
if (dic.contains(string))
iterator.remove();
}
When you need built-in indexed access, sorting (and after this binary searching), and "ToArray()" method, you should use List.
Essentially, a List<> in .NET is a wrapper over an array. A LinkedList<> is a linked list. So the question comes down to, what is the difference between an array and a linked list, and when should an array be used instead of a linked list. Probably the two most important factors in your decision of which to use would come down to:
Linked lists have much better insertion/removal performance, so long as the insertions/removals are not on the last element in the collection. This is because an array must shift all remaining elements that come after the insertion/removal point. If the insertion/removal is at the tail end of the list however, this shift is not needed (although the array may need to be resized, if its capacity is exceeded).
Arrays have much better accessing capabilities. Arrays can be indexed into directly (in constant time). Linked lists must be traversed (linear time).
This is adapted from Tono Nam's accepted answer correcting a few wrong measurements in it.
The test:
static void Main()
{
LinkedListPerformance.AddFirst_List(); // 12028 ms
LinkedListPerformance.AddFirst_LinkedList(); // 33 ms
LinkedListPerformance.AddLast_List(); // 33 ms
LinkedListPerformance.AddLast_LinkedList(); // 32 ms
LinkedListPerformance.Enumerate_List(); // 1.08 ms
LinkedListPerformance.Enumerate_LinkedList(); // 3.4 ms
//I tried below as fun exercise - not very meaningful, see code
//sort of equivalent to insertion when having the reference to middle node
LinkedListPerformance.AddMiddle_List(); // 5724 ms
LinkedListPerformance.AddMiddle_LinkedList1(); // 36 ms
LinkedListPerformance.AddMiddle_LinkedList2(); // 32 ms
LinkedListPerformance.AddMiddle_LinkedList3(); // 454 ms
Environment.Exit(-1);
}
And the code:
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace stackoverflow
{
static class LinkedListPerformance
{
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
static readonly int start = 0;
static readonly int end = 123456;
static readonly IEnumerable<Temp> query = Enumerable.Range(start, end - start).Select(temp);
static Temp temp(int i)
{
return new Temp(i, i, i, i);
}
static void StopAndPrint(this Stopwatch watch)
{
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);
}
public static void AddFirst_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(0, temp(i));
watch.StopAndPrint();
}
public static void AddFirst_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddFirst(temp(i));
watch.StopAndPrint();
}
public static void AddLast_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Add(temp(i));
watch.StopAndPrint();
}
public static void AddLast_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddLast(temp(i));
watch.StopAndPrint();
}
public static void Enumerate_List()
{
var list = new List<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
public static void Enumerate_LinkedList()
{
var list = new LinkedList<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
//for the fun of it, I tried to time inserting to the middle of
//linked list - this is by no means a realistic scenario! or may be
//these make sense if you assume you have the reference to middle node
//insertion to the middle of list
public static void AddMiddle_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(list.Count / 2, temp(i));
watch.StopAndPrint();
}
//insertion in linked list in such a fashion that
//it has the same effect as inserting into the middle of list
public static void AddMiddle_LinkedList1()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
LinkedListNode<Temp> evenNode = null, oddNode = null;
for (int i = start; i < end; i++)
{
if (list.Count == 0)
oddNode = evenNode = list.AddLast(temp(i));
else
if (list.Count % 2 == 1)
oddNode = list.AddBefore(evenNode, temp(i));
else
evenNode = list.AddAfter(oddNode, temp(i));
}
watch.StopAndPrint();
}
//another hacky way
public static void AddMiddle_LinkedList2()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start + 1; i < end; i += 2)
list.AddLast(temp(i));
for (int i = end - 2; i >= 0; i -= 2)
list.AddLast(temp(i));
watch.StopAndPrint();
}
//OP's original more sensible approach, but I tried to filter out
//the intermediate iteration cost in finding the middle node.
public static void AddMiddle_LinkedList3()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
{
if (list.Count == 0)
list.AddLast(temp(i));
else
{
watch.Stop();
var curNode = list.First;
for (var j = 0; j < list.Count / 2; j++)
curNode = curNode.Next;
watch.Start();
list.AddBefore(curNode, temp(i));
}
}
watch.StopAndPrint();
}
}
}
You can see the results are in accordance with theoretical performance others have documented here. Quite clear - LinkedList<T> gains big time in case of insertions. I haven't tested for removal from the middle of list, but the result should be the same. Of course List<T> has other areas where it performs way better like O(1) random access.
Use LinkedList<> when
You don't know how many objects are coming through the flood gate. For example, Token Stream.
When you ONLY wanted to delete\insert at the ends.
For everything else, it is better to use List<>.
I do agree with most of the point made above. And I also agree that List looks like a more obvious choice in most of the cases.
But, I just want to add that there are many instance where LinkedList are far better choice than List for better efficiency.
Suppose you are traversing through the elements and you want to perform lot of insertions/deletion; LinkedList does it in linear O(n) time, whereas List does it in quadratic O(n^2) time.
Suppose you want to access bigger objects again and again, LinkedList become very more useful.
Deque() and queue() are better implemented using LinkedList.
Increasing the size of LinkedList is much easier and better once you are dealing with many and bigger objects.
Hope someone would find these comments useful.
In .NET, Lists are represented as Arrays. Therefore using a normal List would be quite faster in comparison to LinkedList.That is why people above see the results they see.
Why should you use the List?
I would say it depends. List creates 4 elements if you don't have any specified. The moment you exceed this limit, it copies stuff to a new array, leaving the old one in the hands of the garbage collector. It then doubles the size. In this case, it creates a new array with 8 elements. Imagine having a list with 1 million elements, and you add 1 more. It will essentially create a whole new array with double the size you need. The new array would be with 2Mil capacity however, you only needed 1Mil and 1. Essentially leaving stuff behind in GEN2 for the garbage collector and so on. So it can actually end up being a huge bottleneck. You should be careful about that.
I asked a similar question related to performance of the LinkedList collection, and discovered Steven Cleary's C# implement of Deque was a solution. Unlike the Queue collection, Deque allows moving items on/off front and back. It is similar to linked list, but with improved performance.

Categories