which collection class is efficient for insertion in C#

which collection class is efficient for insertion in C# - c#

If i want to insert the data in these collection class "Dictionary","List" and "Sorted List" which will required the less time to perform the insertion? can you give give me a code to explain that process?

List<T> will have the fastest insertion.
LinkedList<T> will have the fastest insertion at the head.
The difference will be miniscule for most practical applications; you should use whichever one fits your needs.

Insert a value to LinkedList is O(1) operation. List (implemented by an array) may require additional allocation and copying of items.

If your performance requirements are strict, you should measure in your environment and on your data before deciding.
I'll offer some guesses that you should not take for granted until you perform your own measurements:
If you know the number of elements in advance, just use the
pre-allocated List (or array).
If you don't:
Use list of chunks (i.e. LinkedList<List<T>>) to avoid List resizes.
Or, for simplicity, you can just use the List and incur some
performance penalty when it is resized to accept more elements. I'm
not sure whether this penalty would justify using Dictionary or LinkedList
instead - but you will be if you measure ;)
All this is under assumption you don't care where in the collection is new element inserted and how you retrieve it later... If you do care, then you'll pick your data structure based on that, and not just insert performance alone.

Related

Fastest way to get any element from a Dictionary

I'm implementing A* in C# (not for pathfinding) and I need Dictionary to hold open nodes, because I need fast insertion and fast lookup. I want to get the first open node from the Dictionary (it can be any random node). Using Dictionary.First() is very slow. If I use an iterator, MoveNext() is still using 15% of the whole CPU time of my program. What is the fastest way to get any random element from a Dictionary?

I suggest you use a specialized data structure for this purpose, as the regular Dictionary was not made for this.
In Java, I would probably recommend LinkedHashMap, for which there are custom C# equivalents (not built-in sadly) (see).
It is, however, rather easy to implement this yourself in a reasonable fashion. You could, for instance, use a regular dictionary with tuples that point to the next element as well as the actual data. Or you could keep a secondary stack that simply stores all keys in order of addition. Just some ideas. I never did implemented nor profiled this myself, but I'm sure you'll find a good way.
Oh, and if you didn't already, you might also want to check the hash code distribution, to make sure there is no problem there.

Finding the first (or an index) element in a dictionary is actually O(n) because it has to iterate over every bucket until a non-empty one is found, so MoveNext will actually be the fastest way.
If this were a problem, I would consider using something like a stack, where pop is an O(1) operation.

Try
Enumerable.ToList(dictionary.Values)[new Random().next(dictionary.Count)].
Should have pretty good performance but watch out for memory usage if your dictionary is huge. Obviously take care of not creating the random object every time and you might be able to cache the return value of Enumerable.ToList if its members don't change too frequently.

Is List<T> really an undercover Array in C#?

I have been looking at .NET libraries using ILSpy and have come across List<T> class definition in System.Collections.Generic namespace. I see that the class uses methods like this one:
// System.Collections.Generic.List<T>
/// <summary>Removes all elements from the <see cref="T:System.Collections.Generic.List`1" />.</summary>
public void Clear()
{
if (this._size > 0)
{
Array.Clear(this._items, 0, this._size);
this._size = 0;
}
this._version++;
}
So, the Clear() method of the List<T> class actually uses Array.Clear method. I have seen many other List<T> methods that use Array stuff in the body.
Does this mean that List<T> is actually an undercover Array or List only uses some part of Array methods?
I know lists are type safe and don't require boxing/unboxing but this has confused me a bit.

The list class is not itself an array. In other words, it does not derive from an array. Instead it encapsulates an array that is used by the implementation to hold the list's member elements.
Since List<T> offers random access to its elements, and those elements are indexed 0..Count-1, using an array to store the elements is the obvious implementation.

This tends to surprise C++ programmers that know std::list. A linked list, covered in .NET as well with the LinkedList class. And has the same perf characteristics, O(1) for inserts and deletes.
You should however in general avoid it. Linked lists do not perform well on modern processors. Which greatly depend on the cpu caches to get reasonable performance with memory that's many times slower than the execution core. A simple array is by far the data structure that takes most advantage of the cache. Accessing an element gives very high odds that subsequent elements are present in the cache as well. That is not the case for a linked list, elements tend to be scattered throughout the address space, make a cache miss likely. They can be very expensive, as much as 200 cycles with the cpu doing nothing but waiting on the memory sub-system to supply the data.
But do keep the perf characteristics in mind, adding or removing an element that is not at the end of the List costs O(n), just like an array. And a large List can generate a lot of garbage as the array needs to be expanded, setting the Capacity property up front can help a lot to avoid that. More about that in this answer. And otherwise the exact same concerns for std::vector<>.

Yes, List<T> uses an array internally to store the items, although in most cases the array is actually larger than the number of elements in the collection -- it has some extra "padding" at the end so that you can add new items without it having to reallocate memory every time. It keeps track of the actual size of the collection with a separate field (you can see this._size in your generated code). When you add more elements than the current array has room for, it will automatically allocate a new larger array -- twice as big, I think -- and copy over all the existing elements.
If you're concerned about a List<T> using more memory than necessary, you can set the size of the array explicitly with the constructor override that accepts a capacity parameter, if you know the size in advance, or call the TrimExcess() method to make sure the array is (close to) to actual size of the collection.

Random access memory is an array, so in that sense all data structures from linked-lists to heaps and beyond, that rely on random-access to memory for their performance behaviour, are built on the array that is system memory. It is more a question of how many-levels of abstraction are in between.
Of course in a modern virtual memory machine, the random-access system memory is itself an abstraction built on a complicated virtual-memory model of multi-tier pipelined caches, non-cached RAM, and disk.

Is there a performance impact when calling ToList()?

When using ToList(), is there a performance impact that needs to be considered?
I was writing a query to retrieve files from a directory, which is the query:
string[] imageArray = Directory.GetFiles(directory);
However, since I like to work with List<> instead, I decided to put in...
List<string> imageList = Directory.GetFiles(directory).ToList();
So, is there some sort of performance impact that should be considered when deciding to do a conversion like this - or only to be considered when dealing with a large number of files? Is this a negligible conversion?

IEnumerable<T>.ToList()
Yes, IEnumerable<T>.ToList() does have a performance impact, it is an O(n) operation though it will likely only require attention in performance critical operations.
The ToList() operation will use the List(IEnumerable<T> collection) constructor. This constructor must make a copy of the array (more generally IEnumerable<T>), otherwise future modifications of the original array will change on the source T[] also which wouldn't be desirable generally.
I would like to reiterate this will only make a difference with a huge list, copying chunks of memory is quite a fast operation to perform.
Handy tip, As vs To
You'll notice in LINQ there are several methods that start with As (such as AsEnumerable()) and To (such as ToList()). The methods that start with To require a conversion like above (ie. may impact performance), and the methods that start with As do not and will just require some cast or simple operation.
Additional details on List<T>
Here is a little more detail on how List<T> works in case you're interested :)
A List<T> also uses a construct called a dynamic array which needs to be resized on demand, this resize event copies the contents of an old array to the new array. So it starts off small and increases in size if required.
This is the difference between the Capacity and Count properties on List<T>. Capacity refers to the size of the array behind the scenes, Count is the number of items in the List<T> which is always <= Capacity. So when an item is added to the list, increasing it past Capacity, the size of the List<T> is doubled and the array is copied.

Is there a performance impact when calling toList()?
Yes of course. Theoretically even i++ has a performance impact, it slows the program for maybe a few ticks.
What does .ToList do?
When you invoke .ToList, the code calls Enumerable.ToList() which is an extension method that return new List<TSource>(source). In the corresponding constructor, under the worst circumstance, it goes through the item container and add them one by one into a new container. So its behavior affects little on performance. It's impossible to be a performance bottle neck of your application.
What's wrong with the code in the question
Directory.GetFiles goes through the folder and returns all files' names immediately into memory, it has a potential risk that the string[] costs a lot of memory, slowing down everything.
What should be done then
It depends. If you(as well as your business logic) gurantees that the file amount in the folder is always small, the code is acceptable. But it's still suggested to use a lazy version: Directory.EnumerateFiles in C#4. This is much more like a query, which will not be executed immediately, you can add more query on it like:
Directory.EnumerateFiles(myPath).Any(s => s.Contains("myfile"))
which will stop searching the path as soon as a file whose name contains "myfile" is found. This is obviously has a better performance then .GetFiles.

Is there a performance impact when calling toList()?
Yes there is. Using the extension method Enumerable.ToList() will construct a new List<T> object from the IEnumerable<T> source collection which of course has a performance impact.
However, understanding List<T> may help you determine if the performance impact is significant.
List<T> uses an array (T[]) to store the elements of the list. Arrays cannot be extended once they are allocated so List<T> will use an over-sized array to store the elements of the list. When the List<T> grows beyond the size the underlying array a new array has to be allocated and the contents of the old array has to be copied to the new larger array before the list can grow.
When a new List<T> is constructed from an IEnumerable<T> there are two cases:
The source collection implements ICollection<T>: Then ICollection<T>.Count is used to get the exact size of the source collection and a matching backing array is allocated before all elements of the source collection is copied to the backing array using ICollection<T>.CopyTo(). This operation is quite efficient and probably will map to some CPU instruction for copying blocks of memory. However, in terms of performance memory is required for the new array and CPU cycles are required for copying all the elements.
Otherwise the size of the source collection is unknown and the enumerator of IEnumerable<T> is used to add each source element one at a time to the new List<T>. Initially the backing array is empty and an array of size 4 is created. Then when this array is too small the size is doubled so the backing array grows like this 4, 8, 16, 32 etc. Every time the backing array grows it has to be reallocated and all elements stored so far have to be copied. This operation is much more costly compared to the first case where an array of the correct size can be created right away.
Also, if your source collection contains say 33 elements the list will end up using an array of 64 elements wasting some memory.
In your case the source collection is an array which implements ICollection<T> so the performance impact is not something you should be concerned about unless your source array is very large. Calling ToList() will simply copy the source array and wrap it in a List<T> object. Even the performance of the second case is not something to worry about for small collections.

It will be as (in)efficient as doing:
var list = new List<T>(items);
If you disassemble the source code of the constructor that takes an IEnumerable<T>, you will see it will do a few things:
Call collection.Count, so if collection is an IEnumerable<T>, it will force the execution. If collection is an array, list, etc. it should be O(1).
If collection implements ICollection<T>, it will save the items in an internal array using the ICollection<T>.CopyTo method. It should be O(n), being n the length of the collection.
If collection does not implement ICollection<T>, it will iterate through the items of the collection, and will add them to an internal list.
So, yes, it will consume more memory, since it has to create a new list, and in the worst case, it will be O(n), since it will iterate through the collection to make a copy of each element.

"is there a performance impact that needs to be considered?"
The issue with your precise scenario is that first and foremost your real concern about performance would be from the hard-drive speed and efficiency of the drive's cache.
From that perspective, the impact is surely negligible to the point that NO it need not be considered.
BUT ONLY if you really need the features of the List<> structure to possibly either make you more productive, or your algorithm more friendly, or some other advantage. Otherwise, you're just purposely adding an insignificant performance hit, for no reason at all. In which case, naturally, you shouldn’t do it! :)

ToList() creates a new List and put the elements in it which means that there is an associated cost with doing ToList(). In case of small collection it won't be very noticeable cost but having a huge collection can cause a performance hit in case of using ToList.
Generally you should not use ToList() unless work you are doing cannot be done without converting collection to List. For example if you just want to iterate through the collection you don't need to perform ToList
If you are performing queries against a data source for example a Database using LINQ to SQL then the cost of doing ToList is much more because when you use ToList with LINQ to SQL instead of doing Delayed Execution i.e. load items when needed (which can be beneficial in many scenarios) it instantly loads items from Database into memory

Considering the performance of retrieving file list, ToList() is negligible. But not really for other scenarios. That really depends on where you are using it.
When calling on an array, list, or other collection, you create a copy of the collection as a List<T>. The performance here depends on the size of the list. You should do it when really necessary.
In your example, you call it on an array. It iterates over the array and adds the items one by one to a newly created list. So the performance impact depends on the number of files.
When calling on an IEnumerable<T>, you materialize the IEnumerable<T> (usually a query).

ToList Will create a new list and copy elements from original source to the newly created list so only thing is to copy the elements from the original source and depends on the source size

Let's look for another example;
If you working on databases when you run ToList() method and check the SQL Profiler for this code;
var IsExist = (from inc in entities.be_Settings
where inc.SettingName == "Number"
select inc).ToList().Count > 0;
Auto created query will like this:
SELECT [Extent1].[SettingName] AS [SettingName], [Extent1].[SettingValue] AS [SettingValue] FROM [dbo].[be_Settings] AS [Extent1] WHERE N'Number' = [Extent1].[SettingName]
The select query is run with the ToList method, and the results of the query are stored in memory, and it is checked whether there is a record by looking at the number of elements of the List. For example, if there are 1000 records in your table with the relevant criteria, these 1000 records are first brought from the database and converted into objects, and then they are thrown into a List and you only check the number of elements of this List. So this is very inefficient way to choose.

its not exactly about list perfomance but if u have high dimensional array
u can use HashSet instead of List.

LINQ ToDictionary initial capacity

I regularly use the LINQ extension method ToDictionary, but am wondering about the performance. There is no parameter to define the capacity for the dictionary and with a list of 100k items or more, this could become an issue:
IList<int> list = new List<int> { 1, 2, ... , 1000000 };
IDictionary<int, string> dictionary = list.ToDictionary<int, string>(x => x, x => x.ToString("D7"));
Does the implementation actually take the list.Count and passes it to the constructor for the dictionary?
Or is the resizing of the dictionary fast enough, so I don't really have to worry about it?

Does the implementation actually take the list.Count and passes it to
the constructor for the dictionary?
No. According to ILSpy, the implementation is basically this:
Dictionary<TKey, TElement> dictionary = new Dictionary<TKey, TElement>(comparer);
foreach (TSource current in source)
{
dictionary.Add(keySelector(current), elementSelector(current));
}
return dictionary;
If you profile your code and determine that the ToDictionary operation is your bottleneck, its trivial to make your own function based on the above code.

Does the implementation actually take the list.Count and passes it to the constructor for the dictionary?
This is an implementation detail and it shouldn't matter to you.
Or is the resizing of the dictionary fast enough, so I don't really have to worry about it?
Well, I don't know. Only you know whether or not this is actually a bottleneck in your application, and whether or not the performance is acceptable. If you want to know if it's fast enough, write the code and time it. As Eric Lippert is wont to say, if you want to know how fast two horses are, do you pit them in a race against each other, or do you ask random strangers on the Internet which one is faster?
That said, I'm having a really hard time imaging this being a bottleneck in any realistic application. If adding items to a dictionary is a bottleneck in your application, you're doing something wrong.

I don't think it'll be a bottleneck TBH. And in case you have real complaints and issues, you should look into it at that time to see if you can improve it, may be you can do paging instead of converting everything at once.

I don't know about resizing the dictionary, but checking the implementation with dotPeek.exe suggests that the implementation does not take the list length.
What the code basically does is:
create a new dictionary
iterate over sequence and add items
If you find this a bottleneck, it would be trivial to create your own extension method ToDictionaryWithCapacity that works on something that can have its length actually computed without iterating the whole thing.
Just scanned the Dictionary implementation. Basically, when it starts to fill up, the internal list is resized by roughly doubling it to a near prime. So that should not happen too frequently.

Does the implementation actually take the list.Count and passes it to the constructor for the dictionary?
It doesn't. That's because the calling Count() would enumerate the source, and then adding it to the dictionary would enumerate the source a second time. It's not a good idea to enumerate the source twice, for example this would fail on DataReaders.
Or is the resizing of the dictionary fast enough, so I don't really have to worry about it?
The Dictionary.Resize method is used to expand the dictionary. It allocates a new dictionary and copies the existing items into the new dictionary (using Array.Copy). The dictionary size is increased in prime number steps.
This is not the fastest way, but fast enough if you do not know the size.

.NET Collection Classes

Group of related data like a list of parts etc., can be handled either using Arrays(Array of Parts) or using Collection. I understand that When Arrays are used, Insertion, Deletion and some other operations have performance impact when it is compared with Collections. Does this mean that Arrays are not used internally by the collections?, If so what is the data structure used for collections like List, Collection etc?
How the collections are handled internally?

List<T> uses an internal array. Removing/inserting items near the beginning of the list will be more expensive than doing the same near the end of the list, since the entire contents of the internal array need to be shifted in one direction. Also, once you try to add an item when the internal list is full, a new, bigger array will be constructed, the contents copied, and the old array discarded.
The Collection<T> class, when used with the parameterless constructor, uses a List<T> internally. So performance-wise they will be identical, with the exception of overhead caused by wrapping. (Essentially one more level of indirection, which is going to be negligible in most scenarios.)
LinkedList<T> is, as its name implies, a linked list. This will sacrifice iteration speed for insertion/removal speed. Since iterating means traversing pointers-to-pointers-to-pointers ad infinitum, this is going to take more work overall. Aside from the pointer traversal, two nodes may not be allocated anywhere near each other, reducing the effectiveness of CPU RAM caches.
However, the amount of time required to insert or remove a node is constant, since it requires the same number of operations no matter the state of the list. (This does not take into account any work that must be done to actually locate the item to remove, or to traverse the list to find the insertion point!)
If your primary concern with your collection is testing if something is in the collection, you might consider a HashSet<T> instead. Addition of items to the set will be relatively fast, somewhere between insertion into a list and a linked list. Removal of items will again be relatively fast. But the real gain is in lookup time -- testing if a HashSet<T> contains an item does not require iterating the entire list. On average it will perform faster than any list or linked list structure.
However, a HashSet<T> cannot contain equivalent items. If part of your requirements is that two items that are considered equal (by an Object.Equals(Object) overload, or by implementing IEquatable<T>) coexist independently in the collection, then you simply cannot use a HashSet<T>. Also, HashSet<T> does not guarantee insertion order, so you also can't use a HashSet<T> if maintaining some sort of ordering is important.

There are two basic ways to implement a simple collection:
contiguous array
linked list
Contiguous arrays have performance disadvantages for the operations you mentioned because the memory space of the collection is either preallocated or allocated based on the contents of the collection. Thus deletion or insertion requires moving many array elements to keep the entire collection contiguous and in the proper order.
Linked lists remove these issues because the items in the collection do not need to be stored in memory contiguously. Instead each element contains a reference to one or more of the other elements. Thus, when an insertion is made, the item in question is created anywhere in memory and only the references on one or two of the elements already in the collection need to be modified.
For example:
LinkedList<object> c = new LinkedList<object>(); // a linked list
object[] a = new object[] { }; // a contiguous array
This is simplified of course. The internals of LinkedList<> are doubtless more complex than a simple singly or doubly linked list, but that is the basic structure.

I think that some collection classes might use arrays internally as well as linked lists or something similar. The benefit of using collections from the System.Collections namespace instead of arrays, is that you do not need to spend any extra time writing code to perform update operations.
Arrays will always be more lightweight, and if you know some very good search algorithms, then you might even be able to use them more efficiently, but most of the the time you can avoid reinventing the wheel by using classes from System.Collections. These classes are meant to help the programmer avoid writing code that has already been written and tuned hundreds of times, so it is unlikely that you'll get a significant performance boost by manipulating arrays yourself.
When you need a static collection that doesn't require much adding, removing or editing, then perhaps it is a good time to use an array, since they don't require the extra memory that collections do.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.