How to improve performance of ConcurrentDictionary.Count in C# - c#

Recently, I needed to choose between using SortedDictionary and SortedList, and settled on SortedList.
However, now I discovered that my C# program is slowing to a crawl when performing SortedList.Count, which I check using a function/method called thousands of times.
Usually my program would call the function 10,000 times within 35 ms, but while using SortedList.Count, it slowed to 300-400 ms, basically 10x slower.
I also tried SortedList.Keys.Count, but this reduced my performance another 10 times, to over 3000 ms.
I have only ~5000 keys/objects in SortedList<DateTime, object_name>.
I can easily and instantly retrieve data from my sorted list by SortedList[date] (in 35 ms), so I haven't found any problem with the list structure or objects its holding.
Is this performance normal?
What else can I use to obtain the number of records in the list, or just to check that the list is populated?
(besides adding a separate tracking flag, which I may do for now)
CORRECTION:
Sorry, I'm actually using:
ConcurrentDictionary<string, SortedList<DateTime, string>> dict_list = new ConcurrentDictionary<string, SortedList<DateTime, string>>();
And I had various counts in various places, sometimes checking items in the list and other times in ConcurrentDicitonary. So the issue applies to ConcurrentDicitonary and I wrote quick test code to confirm this, which takes 350 ms, without using concurrency.
Here is the test with ConcurrentDicitonary, showing 350 ms:
public static void CountTest()
{
//Create test ConcurrentDictionary
ConcurrentDictionary<int, string> test_dict = new ConcurrentDictionary<int, string>();
for (int i = 0; i < 50000; i++)
{
test_dict[i] = "ABC";
}
//Access .Count property 10,000 times
int tick_count = Environment.TickCount;
for (int i = 1; i <= 10000; i++)
{
int dict_count = test_dict.Count;
}
Console.WriteLine(string.Format("Time: {0} ms", Environment.TickCount - tick_count));
Console.ReadKey();
}

this article recommends calling this instead:
dictionary.Skip(0).Count()
The count could be invalid as soon as the call from the method returns. If you want to write the count to a log for tracing purposes, for example, you can use alternative methods, such as the lock-free enumerator

Well, ConcurrentDictionary<TKey, TValue> must work properly with many threads at once, so it needs some synchronization overhead.
Source code for Count property: https://referencesource.microsoft.com/#mscorlib/system/Collections/Concurrent/ConcurrentDictionary.cs,40c23c8c36011417
public int Count
{
[SuppressMessage("Microsoft.Concurrency", "CA8001", Justification = "ConcurrencyCop just doesn't know about these locks")]
get
{
int count = 0;
int acquiredLocks = 0;
try
{
// Acquire all locks
AcquireAllLocks(ref acquiredLocks);
// Compute the count, we allow overflow
for (int i = 0; i < m_tables.m_countPerLock.Length; i++)
{
count += m_tables.m_countPerLock[i];
}
}
finally
{
// Release locks that have been acquired earlier
ReleaseLocks(0, acquiredLocks);
}
return count;
}
}
Looks like you need to refactor your existing code. Since you didn't provide any code, we can't tell you what you could optimize.

For performance sensitive code I would not recommend to use ConcurrentDictionary.Count property as it has a locking implementation. You could use Interlocked.Increment instead to do the count yourself.

Related

Why is Queue consuming so much memory?

Basically I was doing a code kata on codewars site to kinda of 'warm up' before starting to code, and noticed a problem that I don't know if its because of my code, or just regular thing.
public static string WhoIsNext(string[] names, long n)
{
Queue<string> fifo = new Queue<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = fifo.Dequeue();
fifo.Enqueue(name);
fifo.Enqueue(name);
}
return fifo.Peek();
}
And Is called like this:
// Test 1
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 1;
var nth = CodeKata.WhoIsNext(names, n); // n = 1 Should return sheldon.
// test 2
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 52;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Penny.
// test 3
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 7230702951;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Leonard.
In this code When I put the long n with the value 7230702951 (a really high number...), it throws an out of memory exception. Is the number that high, or is the queue just not optimized for such numbers.
I say this because I tried using a List and the list memory usage stayed under 500 MB (the plateu was around 327MB btw), and this running for about 2/3min, whereas the queue throwed the exception in a matter of seconds, and went over 2GB in just that time alone.
Can someone explain to me the why of this happening, I just curious?
edit 1
I forgot to add the List code:
public static string WhoIsNext(string[] names, long n)
{
List<string> test = new List<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = test[0];
test.RemoveAt(0);
test.Add(name);
test.Add(name);
}
return test[0];
}
edit 2
For those saying that the code doubles the names and is inneficient, I already know that, the code isn't made to be useful, is just a kata. (I updated the link now!)
My question is as to why is Queue so much more inneficient thatn List with high count numbers.
Part of the reason is that the queue code is way faster than the List code, because queues are optimised for deletes due to the fact that they are a circular buffer. Lists aren't - the list copies the array contents every time you remove that first element.
Change the input value to 72307000 for example. On my machine, the queue finishes that in less than a second. The list is still chugging away minutes (and at this rate, hours) later. In 4 minutes i is now at 752408 - it has done almost 1% of the work).
Thus, I am not sure the queue is less memory efficient. It is just so fast that you run into the memory issue sooner. The list almost certainly has the same issue (the way that List and Queue do array size doubling is very similar) - it will just likely take days to run into it.
To a certain extent, you could predict this even without running your code. A queue with 7230702951 entries in it (running 64-bit) will take a minimum of 8 bytes per entry. So 57845623608 bytes. Which is larger than 50GB. Clearly your machine is going to struggle to fit that in RAM (plus .NET won't let you have an array that large)...
Additionally, your code has a subtle bug. The loop can't ever end (if n is greater than int.MaxValue). Your loop variable is an int but the parameter is a long. Your int will overflow (from int.MaxValue to int.MinValue with i++). So the loop will never exit, for large values of n (meaning the queue will grow forever). You likely should change the type of i to long.

How do I solve getting stuck issue of two nested for loops?

I have coded a C# Windows Form Application where there are two nested for loops like below:
List<int> My_LIST = new List<int>();
List<string> RESULT_LIST = new List<string>();
int[] arr = My_LIST.ToArray();
for (int i = 0; i < arr.Length; i++)
{
int counter = i;
for (int j = 1; j <= arr.Length; j++)
{
if (counter == arr.Length)
{
counter = 0;
}
sb.Append(arr[counter].ToString());
RESULT_LIST.Add(sb.ToString());
counter++;
}
sb.Clear();
}
I used this code to take any combination of characters that are inside my array.
When the array's length gets 3000 or more the program gets stuck. How can I fix this issue?
Because based on my application's needs the size of my array must be large.
First, I had used a normal string to get the combinations and my code was like below:
string s = "";
instead of this:
StringBuilder sb = new StringBuilder();
I thought that there may be a problem with the normal string's maximum length and therefore I changed it to StringBuilder but the problem was not fixed.
Is this problem solvable or I should use an alternative?
your program is not getting stuck, rather it's taking a long time to finish due to it only running on a single thread.
you may want to use Parallel.For which its iterations can run in parallel.
I would first suggest parallelising the outer for loop and then test if there are any performance improvements if not then also parallelise the inner loop. In the majority of cases parallelising the outer for loop can prove sufficient.
Also, note that using a StringBuilder as an accumulator and then performing sb.Append(arr[counter].ToString()); inside the loop is sub-optimal. Don't call toString at all, just append.
However, when performing parallel processing keep in mind that StringBuilder is not thread-safe, therefore, you're better off using a String rather than trying to introduce locking or anything of that sort to retain the StringBuilder.
Finally, you'll need to be cautious in regards to shared state, and therefore I'd recommend you do some research in terms of what to take into account when performing parallel processing.

Is 'yield return' slower than "old school" return?

I'm doing some tests about yield return perfomance, and I found that it is slower than normal return.
I tested value variables (int, double, etc.) and some references types (string, etc.)... And yield return were slower in both cases. Why use it then?
Check out my example:
public class YieldReturnTeste
{
private static IEnumerable<string> YieldReturnTest(int limite)
{
for (int i = 0; i < limite; i++)
{
yield return i.ToString();
}
}
private static IEnumerable<string> NormalReturnTest(int limite)
{
List<string> listaInteiros = new List<string>();
for (int i = 0; i < limite; i++)
{
listaInteiros.Add(i.ToString());
}
return listaInteiros;
}
public static void executaTeste()
{
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
List<string> minhaListaYield = YieldReturnTest(2000000).ToList();
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine("Yield return: {0}", elapsedTime);
//****
stopWatch = new Stopwatch();
stopWatch.Start();
List<string> minhaListaNormal = NormalReturnTest(2000000).ToList();
stopWatch.Stop();
ts = stopWatch.Elapsed;
elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine("Normal return: {0}", elapsedTime);
}
}
Consider the difference between File.ReadAllLines and File.ReadLines.
ReadAllLines loads all of the lines into memory and returns a string[]. All well and good if the file is small. If the file is larger than will fit in memory, you'll run out of memory.
ReadLines, on the other hand, uses yield return to return one line at a time. With it, you can read any size file. It doesn't load the whole file into memory.
Say you wanted to find the first line that contains the word "foo", and then exit. Using ReadAllLines, you'd have to read the entire file into memory, even if "foo" occurs on the first line. With ReadLines, you only read one line. Which one would be faster?
That's not the only reason. Consider a program that reads a file and processes each line. Using File.ReadAllLines, you end up with:
string[] lines = File.ReadAllLines(filename);
for (int i = 0; i < lines.Length; ++i)
{
// process line
}
The time it takes that program to execute is equal to the time it takes to read the file, plus time to process the lines. Imagine that the processing takes so long that you want to speed it up with multiple threads. So you do something like:
lines = File.ReadAllLines(filename);
Parallel.Foreach(...);
But the reading is single-threaded. Your multiple threads can't start until the main thread has loaded the entire file.
With ReadLines, though, you can do something like:
Parallel.Foreach(File.ReadLines(filename), line => { ProcessLine(line); });
That starts up multiple threads immediately, which are processing at the same time that other lines are being read. So the reading time is overlapped with the processing time, meaning that your program will execute faster.
I show my examples using files because it's easier to demonstrate the concepts that way, but the same holds true for in-memory collections. Using yield return will use less memory and is potentially faster, especially when calling methods that only need to look at part of the collection (Enumerable.Any, Enumerable.First, etc.).
For one, it's a convenience feature. Two, it lets you do lazy return, which means that it's only evaluated when the value's fetched. That can be invaluable in stuff like a DB query, or just a collection you don't want to completely iterate over. Three, it can be faster in some scenarios. Four, what was the difference? Probably tiny, so micro optimization.
Because C# compiler converts iterator blocks (yield return) into state machine. State machine is very expensive in this case.
You can read more here: http://csharpindepth.com/articles/chapter6/iteratorblockimplementation.aspx
I used yield return to give me results from an algorithm. Every result is based on previous result, but I don't need all all of them. I used foreach with yield return to inspect each result and break the foreach loop if I get a result meet my requirement.
The algorithm was decent complex, so I think there was some decent work involved for saving states between each yield returns.
I noticed it was 3%-5% percentage slower than traditional return, but the improvement I get form not needing to generate all results is much much bigger than the loss of performance.
The .ToList() while necessary to really completing the otherwise deferred iteration of IEnumerable, hinders of measuring the core part.
At least it is important of initializing the list to the known size:
const int listSize=2000000;
var tempList = new List(listSize);
...
List tempList = YieldReturnTest(listSize).ToList();
Remark: Both calls took about the same time on my machine.. No difference (Mono 4 on repl.it).

Performance reading large Dataset from Multiple parallel threads

I’m working on a Genetic Machine Learning project developed in .Net (as opposed to Matlab – My Norm). I’m no pro .net coder so excuse any noobish implementations.
The project itself is huge so I won’t bore you with the full details but basically a population of Artificial Neural Networks (like decision trees) are each evaluated on a problem domain that in this case uses a stream of sensory inputs. The top performers in the population are allowed to breed and produced offspring (that inherit tendencies from both parents) and the poor performers are killed off or breed-out of the population. Evolution continues until an acceptable solution is found. Once found, the final evolved ‘Network’ is extracted from the lab and placed in a light-weight real-world application. The technique can be used to develop very complex control solution that would be almost impossible or too time consuming to program normally, like automated Car driving, mechanical stability control, datacentre load balancing etc, etc.
Anyway, the project has been a huge success so far and is producing amazing results, but the only problem is the very slow performance once I move to larger datasets. I’m hoping is just my code, so would really appreciate some expert help.
In this project, convergence to a solution close to an ideal can often take around 7 days of processing! Just making a little tweak to a parameter and waiting for results is just too painful.
Basically, multiple parallel threads need to read sequential sections of a very large dataset (the data does not change once loaded). The dataset consists of around 300 to 1000 Doubles in a row and anything over 500k rows. As the dataset can exceed the .Net object limit of 2GB, it can’t be stored in normal 2d array – The simplest way round this was to use a Generic List of single arrays.
The parallel scalability seems to be a big limiting factor as running the code on a beast of a server with 32 Xeon cores that normally eats Big dataset for breakfast does not yield much of a performance gain over a Corei3 desktop!
Performance gains quickly dwindle away as the number of cores increases.
From profiling the code (with my limited knowledge) I get the impression that there is a huge amount of contention reading the dataset from multiple threads.
I’ve tried experimenting with different dataset implementations using Jagged arrays and various concurrent collections but to no avail.
I’ve knocked up a quick and dirty bit of code for benchmarking that is similar to the core implementation of the original and still exhibits the similar read performance issues and parallel scalability issues.
Any thoughts or suggestions would be much appreciated or confirmation that this is the best I’m going to get.
Many thanks
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading.Tasks;
//Benchmark script to time how long it takes to read dataset per iteration
namespace Benchmark_Simple
{
class Program
{
public static TrainingDataSet _DataSet;
public static int Features = 100; //Real test will require 300+
public static int Rows = 200000; //Real test will require 500K+
public static int _PopulationSize = 500; //Real test will require 1000+
public static int _Iterations = 10;
public static List<NeuralNetwork> _NeuralNetworkPopulation = new List<NeuralNetwork>();
static void Main()
{
Stopwatch _Stopwatch = new Stopwatch();
//Create Dataset
Console.WriteLine("Creating Training DataSet");
_DataSet = new TrainingDataSet(Features, Rows);
Console.WriteLine("Finished Creating Training DataSet");
//Create Neural Network Population
for (int i = 0; i <= _PopulationSize - 1; i++)
{
_NeuralNetworkPopulation.Add(new NeuralNetwork());
}
//Main Loop
for (int i = 0; i <= _Iterations - 1; i++)
{
_Stopwatch.Restart();
Parallel.ForEach(_NeuralNetworkPopulation, _Network => { EvaluateNetwork(_Network); });
//######## Removed for simplicity ##########
//Run Evolutionary Genetic Algorithm on population - I.E. Breed the strong, kill of the weak
//##########################################
//Repeat until acceptable solution is found
Console.WriteLine("Iteration time: {0}", _Stopwatch.ElapsedMilliseconds / 1000);
_Stopwatch.Stop();
}
Console.ReadLine();
}
private static void EvaluateNetwork(NeuralNetwork Network)
{
//Evaluate network on 10% of the Training Data at a random starting point
double Score = 0;
Random Rand = new Random();
int Count = (Rows / 100) * 10;
int RandonStart = Rand.Next(0, Rows - Count);
//The data must be read sequentially
for (int i = RandonStart; i <= RandonStart + Count; i++)
{
double[] NetworkInputArray = _DataSet.GetDataRow(i);
//####### Dummy Evaluation - just give it somthing to do for the sake of it
double[] Temp = new double[NetworkInputArray.Length + 1];
for (int j = 0; j <= NetworkInputArray.Length - 1; j++)
{
Temp[j] = Math.Log(NetworkInputArray[j] * Rand.NextDouble());
}
Score += Rand.NextDouble();
//##################
}
Network.Score = Score;
}
public class TrainingDataSet
{
//Simple demo class of fake data for benchmarking
private List<double[]> DataList = new List<double[]>();
public TrainingDataSet(int Features, int Rows)
{
Random Rand = new Random();
for (int i = 1; i <= Rows; i++)
{
double[] NewRow = new double[Features];
for (int j = 0; j <= Features - 1; j++)
{
NewRow[j] = Rand.NextDouble();
}
DataList.Add(NewRow);
}
}
public double[] GetDataRow(int Index)
{
return DataList[Index];
}
}
public class NeuralNetwork
{
//Simple Class to represent a dummy Neural Network -
private double _Score;
public NeuralNetwork()
{
}
public double Score
{
get { return _Score; }
set { _Score = value; }
}
}
}
}
The first thing is that the only way to answer any performance questions is by profiling the application. I'm using the VS 2012 builtin profiler - there are others https://stackoverflow.com/a/100490/19624
From an initial read through the code, i.e. a static analysis the only thing that jumped out at me was the continual reallocation of Temp inside the loop; this is not efficient and if possible needs moving outside of the loop.
With a profiler you can see what's happening:
I profiled first using the code you posted, (top marks to you for posting a full compilable example of the problem, if you hadn't I wouldn't be answering this now).
This shows me that the bulk is in the inside of the loop, I moved the allocation to the the Parallel.ForEach loop.
Parallel.ForEach(_NeuralNetworkPopulation, _Network =>
{
double[] Temp = new double[Features + 1];
EvaluateNetwork(_Network, Temp);
});
So what I can see from the above is that there is 4.4% wastage on the reallocation; but the probably unsurprising thing is that it is the inner loop that is taking 87.6%.
This takes me to my first rule of optimisation which is to first to review your algorithm rather than optimizing the code. A poor implementation of a good algorithm is usually faster than a highly optimized poor algorithm.
Removing the repeated allocate of Temp changes the picture slightly;
Also worth tuning a bit by specifying the parallelism; I've found that Parallel.ForEach is good enough for what I use it for, but again you may get better results from manually partitioning the work up into queues.
Parallel.ForEach(_NeuralNetworkPopulation,
new ParallelOptions { MaxDegreeOfParallelism = 32 },
_Network =>
{
double[] Temp = new double[Features + 1];
EvaluateNetwork(_Network, Temp);
});
Whilst running I'm getting what I'd expect in terms of CPU usage: although my machine was also running another lengthy process which was taking the base level (the peak in the chart below is when profiling this program).
So to summarize
Review the most frequently executed part and come up with new algorithm if possible.
Profile on the target machine
Only when you're sure about (1) above is it then worth looking at optimising the algorithm; considering the following
a) Code optimisations
b) Memory tuning / partioning of data to keep as much in cache
c) Improvements to threading usage

When should I use a List vs a LinkedList

When is it better to use a List vs a LinkedList?
In most cases, List<T> is more useful. LinkedList<T> will have less cost when adding/removing items in the middle of the list, whereas List<T> can only cheaply add/remove at the end of the list.
LinkedList<T> is only at it's most efficient if you are accessing sequential data (either forwards or backwards) - random access is relatively expensive since it must walk the chain each time (hence why it doesn't have an indexer). However, because a List<T> is essentially just an array (with a wrapper) random access is fine.
List<T> also offers a lot of support methods - Find, ToArray, etc; however, these are also available for LinkedList<T> with .NET 3.5/C# 3.0 via extension methods - so that is less of a factor.
Thinking of a linked list as a list can be a bit misleading. It's more like a chain. In fact, in .NET, LinkedList<T> does not even implement IList<T>. There is no real concept of index in a linked list, even though it may seem there is. Certainly none of the methods provided on the class accept indexes.
Linked lists may be singly linked, or doubly linked. This refers to whether each element in the chain has a link only to the next one (singly linked) or to both the prior/next elements (doubly linked). LinkedList<T> is doubly linked.
Internally, List<T> is backed by an array. This provides a very compact representation in memory. Conversely, LinkedList<T> involves additional memory to store the bidirectional links between successive elements. So the memory footprint of a LinkedList<T> will generally be larger than for List<T> (with the caveat that List<T> can have unused internal array elements to improve performance during append operations.)
They have different performance characteristics too:
Append
LinkedList<T>.AddLast(item) constant time
List<T>.Add(item) amortized constant time, linear worst case
Prepend
LinkedList<T>.AddFirst(item) constant time
List<T>.Insert(0, item) linear time
Insertion
LinkedList<T>.AddBefore(node, item) constant time
LinkedList<T>.AddAfter(node, item) constant time
List<T>.Insert(index, item) linear time
Removal
LinkedList<T>.Remove(item) linear time
LinkedList<T>.Remove(node) constant time
List<T>.Remove(item) linear time
List<T>.RemoveAt(index) linear time
Count
LinkedList<T>.Count constant time
List<T>.Count constant time
Contains
LinkedList<T>.Contains(item) linear time
List<T>.Contains(item) linear time
Clear
LinkedList<T>.Clear() linear time
List<T>.Clear() linear time
As you can see, they're mostly equivalent. In practice, the API of LinkedList<T> is more cumbersome to use, and details of its internal needs spill out into your code.
However, if you need to do many insertions/removals from within a list, it offers constant time. List<T> offers linear time, as extra items in the list must be shuffled around after the insertion/removal.
Linked lists provide very fast insertion or deletion of a list member. Each member in a linked list contains a pointer to the next member in the list so to insert a member at position i:
update the pointer in member i-1 to point to the new member
set the pointer in the new member to point to member i
The disadvantage to a linked list is that random access is not possible. Accessing a member requires traversing the list until the desired member is found.
Edit
Please read the comments to this answer. People claim I did not do
proper tests. I agree this should not be an accepted answer. As I was
learning I did some tests and felt like sharing them.
Original answer...
I found interesting results:
// Temporary class to show the example
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
Linked list (3.9 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (2.4 seconds)
List<Temp> list = new List<Temp>(); // 2.4 seconds
for (var i = 0; i < 12345678; i++)
{
var a = new Temp(i, i, i, i);
list.Add(a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Even if you only access data essentially it is much slower!! I say never use a linkedList.
Here is another comparison performing a lot of inserts (we plan on inserting an item at the middle of the list)
Linked List (51 seconds)
LinkedList<Temp> list = new LinkedList<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
var curNode = list.First;
for (var k = 0; k < i/2; k++) // In order to insert a node at the middle of the list we need to find it
curNode = curNode.Next;
list.AddAfter(curNode, a); // Insert it after
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
List (7.26 seconds)
List<Temp> list = new List<Temp>();
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.Insert(i / 2, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
Linked List having reference of location where to insert (.04 seconds)
list.AddLast(new Temp(1,1,1,1));
var referenceNode = list.First;
for (var i = 0; i < 123456; i++)
{
var a = new Temp(i, i, i, i);
list.AddLast(a);
list.AddBefore(referenceNode, a);
}
decimal sum = 0;
foreach (var item in list)
sum += item.A;
So only if you plan on inserting several items and you also somewhere have the reference of where you plan to insert the item then use a linked list. Just because you have to insert a lot of items it does not make it faster because searching the location where you will like to insert it takes time.
My previous answer was not enough accurate.
As truly it was horrible :D
But now I can post much more useful and correct answer.
I did some additional tests. You can find it's source by the following link and reCheck it on your environment by your own: https://github.com/ukushu/DataStructuresTestsAndOther.git
Short results:
Array need to use:
So often as possible. It's fast and takes smallest RAM range for same amount information.
If you know exact count of cells needed
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
List need to use:
If needed to add cells to the end of list (often)
If needed to add cells in the beginning/middle of the list (NOT OFTEN)
If data saved in array < 85000 b (85000/32 = 2656 elements for integer data)
If needed high Random Access speed
LinkedList need to use:
If needed to add cells in the beginning/middle/end of the list (often)
If needed only sequential access (forward/backward)
If you need to save LARGE items, but items count is low.
Better do not use for large amount of items, as it's use additional memory for links.
More details:
Interesting to know:
LinkedList<T> internally is not a List in .NET. It's even does not implement IList<T>. And that's why there are absent indexes and methods related to indexes.
LinkedList<T> is node-pointer based collection. In .NET it's in doubly linked implementation. This means that prior/next elements have link to current element. And data is fragmented -- different list objects can be located in different places of RAM. Also there will be more memory used for LinkedList<T> than for List<T> or Array.
List<T> in .Net is Java's alternative of ArrayList<T>. This means that this is array wrapper. So it's allocated in memory as one contiguous block of data. If allocated data size exceeds 85000 bytes, it will be moved to Large Object Heap. Depending on the size, this can lead to heap fragmentation(a mild form of memory leak). But in the same time if size < 85000 bytes -- this provides a very compact and fast-access representation in memory.
Single contiguous block is preferred for random access performance and memory consumption but for collections that need to change size regularly a structure such as an Array generally need to be copied to a new location whereas a linked list only needs to manage the memory for the newly inserted/deleted nodes.
The difference between List and LinkedList lies in their underlying implementation. List is array based collection (ArrayList). LinkedList is node-pointer based collection (LinkedListNode). On the API level usage, both of them are pretty much the same since both implement same set of interfaces such as ICollection, IEnumerable, etc.
The key difference comes when performance matter. For example, if you are implementing the list that has heavy "INSERT" operation, LinkedList outperforms List. Since LinkedList can do it in O(1) time, but List may need to expand the size of underlying array. For more information/detail you might want to read up on the algorithmic difference between LinkedList and array data structures. http://en.wikipedia.org/wiki/Linked_list and Array
Hope this help,
The primary advantage of linked lists over arrays is that the links provide us with the capability to rearrange the items efficiently.
Sedgewick, p. 91
A common circumstance to use LinkedList is like this:
Suppose you want to remove many certain strings from a list of strings with a large size, say 100,000. The strings to remove can be looked up in HashSet dic, and the list of strings is believed to contain between 30,000 to 60,000 such strings to remove.
Then what's the best type of List for storing the 100,000 Strings? The answer is LinkedList. If the they are stored in an ArrayList, then iterating over it and removing matched Strings whould take up
to billions of operations, while it takes just around 100,000 operations by using an iterator and the remove() method.
LinkedList<String> strings = readStrings();
HashSet<String> dic = readDic();
Iterator<String> iterator = strings.iterator();
while (iterator.hasNext()){
String string = iterator.next();
if (dic.contains(string))
iterator.remove();
}
When you need built-in indexed access, sorting (and after this binary searching), and "ToArray()" method, you should use List.
Essentially, a List<> in .NET is a wrapper over an array. A LinkedList<> is a linked list. So the question comes down to, what is the difference between an array and a linked list, and when should an array be used instead of a linked list. Probably the two most important factors in your decision of which to use would come down to:
Linked lists have much better insertion/removal performance, so long as the insertions/removals are not on the last element in the collection. This is because an array must shift all remaining elements that come after the insertion/removal point. If the insertion/removal is at the tail end of the list however, this shift is not needed (although the array may need to be resized, if its capacity is exceeded).
Arrays have much better accessing capabilities. Arrays can be indexed into directly (in constant time). Linked lists must be traversed (linear time).
This is adapted from Tono Nam's accepted answer correcting a few wrong measurements in it.
The test:
static void Main()
{
LinkedListPerformance.AddFirst_List(); // 12028 ms
LinkedListPerformance.AddFirst_LinkedList(); // 33 ms
LinkedListPerformance.AddLast_List(); // 33 ms
LinkedListPerformance.AddLast_LinkedList(); // 32 ms
LinkedListPerformance.Enumerate_List(); // 1.08 ms
LinkedListPerformance.Enumerate_LinkedList(); // 3.4 ms
//I tried below as fun exercise - not very meaningful, see code
//sort of equivalent to insertion when having the reference to middle node
LinkedListPerformance.AddMiddle_List(); // 5724 ms
LinkedListPerformance.AddMiddle_LinkedList1(); // 36 ms
LinkedListPerformance.AddMiddle_LinkedList2(); // 32 ms
LinkedListPerformance.AddMiddle_LinkedList3(); // 454 ms
Environment.Exit(-1);
}
And the code:
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace stackoverflow
{
static class LinkedListPerformance
{
class Temp
{
public decimal A, B, C, D;
public Temp(decimal a, decimal b, decimal c, decimal d)
{
A = a; B = b; C = c; D = d;
}
}
static readonly int start = 0;
static readonly int end = 123456;
static readonly IEnumerable<Temp> query = Enumerable.Range(start, end - start).Select(temp);
static Temp temp(int i)
{
return new Temp(i, i, i, i);
}
static void StopAndPrint(this Stopwatch watch)
{
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);
}
public static void AddFirst_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(0, temp(i));
watch.StopAndPrint();
}
public static void AddFirst_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddFirst(temp(i));
watch.StopAndPrint();
}
public static void AddLast_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Add(temp(i));
watch.StopAndPrint();
}
public static void AddLast_LinkedList()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (int i = start; i < end; i++)
list.AddLast(temp(i));
watch.StopAndPrint();
}
public static void Enumerate_List()
{
var list = new List<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
public static void Enumerate_LinkedList()
{
var list = new LinkedList<Temp>(query);
var watch = Stopwatch.StartNew();
foreach (var item in list)
{
}
watch.StopAndPrint();
}
//for the fun of it, I tried to time inserting to the middle of
//linked list - this is by no means a realistic scenario! or may be
//these make sense if you assume you have the reference to middle node
//insertion to the middle of list
public static void AddMiddle_List()
{
var list = new List<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
list.Insert(list.Count / 2, temp(i));
watch.StopAndPrint();
}
//insertion in linked list in such a fashion that
//it has the same effect as inserting into the middle of list
public static void AddMiddle_LinkedList1()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
LinkedListNode<Temp> evenNode = null, oddNode = null;
for (int i = start; i < end; i++)
{
if (list.Count == 0)
oddNode = evenNode = list.AddLast(temp(i));
else
if (list.Count % 2 == 1)
oddNode = list.AddBefore(evenNode, temp(i));
else
evenNode = list.AddAfter(oddNode, temp(i));
}
watch.StopAndPrint();
}
//another hacky way
public static void AddMiddle_LinkedList2()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start + 1; i < end; i += 2)
list.AddLast(temp(i));
for (int i = end - 2; i >= 0; i -= 2)
list.AddLast(temp(i));
watch.StopAndPrint();
}
//OP's original more sensible approach, but I tried to filter out
//the intermediate iteration cost in finding the middle node.
public static void AddMiddle_LinkedList3()
{
var list = new LinkedList<Temp>();
var watch = Stopwatch.StartNew();
for (var i = start; i < end; i++)
{
if (list.Count == 0)
list.AddLast(temp(i));
else
{
watch.Stop();
var curNode = list.First;
for (var j = 0; j < list.Count / 2; j++)
curNode = curNode.Next;
watch.Start();
list.AddBefore(curNode, temp(i));
}
}
watch.StopAndPrint();
}
}
}
You can see the results are in accordance with theoretical performance others have documented here. Quite clear - LinkedList<T> gains big time in case of insertions. I haven't tested for removal from the middle of list, but the result should be the same. Of course List<T> has other areas where it performs way better like O(1) random access.
Use LinkedList<> when
You don't know how many objects are coming through the flood gate. For example, Token Stream.
When you ONLY wanted to delete\insert at the ends.
For everything else, it is better to use List<>.
I do agree with most of the point made above. And I also agree that List looks like a more obvious choice in most of the cases.
But, I just want to add that there are many instance where LinkedList are far better choice than List for better efficiency.
Suppose you are traversing through the elements and you want to perform lot of insertions/deletion; LinkedList does it in linear O(n) time, whereas List does it in quadratic O(n^2) time.
Suppose you want to access bigger objects again and again, LinkedList become very more useful.
Deque() and queue() are better implemented using LinkedList.
Increasing the size of LinkedList is much easier and better once you are dealing with many and bigger objects.
Hope someone would find these comments useful.
In .NET, Lists are represented as Arrays. Therefore using a normal List would be quite faster in comparison to LinkedList.That is why people above see the results they see.
Why should you use the List?
I would say it depends. List creates 4 elements if you don't have any specified. The moment you exceed this limit, it copies stuff to a new array, leaving the old one in the hands of the garbage collector. It then doubles the size. In this case, it creates a new array with 8 elements. Imagine having a list with 1 million elements, and you add 1 more. It will essentially create a whole new array with double the size you need. The new array would be with 2Mil capacity however, you only needed 1Mil and 1. Essentially leaving stuff behind in GEN2 for the garbage collector and so on. So it can actually end up being a huge bottleneck. You should be careful about that.
I asked a similar question related to performance of the LinkedList collection, and discovered Steven Cleary's C# implement of Deque was a solution. Unlike the Queue collection, Deque allows moving items on/off front and back. It is similar to linked list, but with improved performance.

Categories