.NET Collection Classes - c#

Group of related data like a list of parts etc., can be handled either using Arrays(Array of Parts) or using Collection. I understand that When Arrays are used, Insertion, Deletion and some other operations have performance impact when it is compared with Collections. Does this mean that Arrays are not used internally by the collections?, If so what is the data structure used for collections like List, Collection etc?
How the collections are handled internally?

List<T> uses an internal array. Removing/inserting items near the beginning of the list will be more expensive than doing the same near the end of the list, since the entire contents of the internal array need to be shifted in one direction. Also, once you try to add an item when the internal list is full, a new, bigger array will be constructed, the contents copied, and the old array discarded.
The Collection<T> class, when used with the parameterless constructor, uses a List<T> internally. So performance-wise they will be identical, with the exception of overhead caused by wrapping. (Essentially one more level of indirection, which is going to be negligible in most scenarios.)
LinkedList<T> is, as its name implies, a linked list. This will sacrifice iteration speed for insertion/removal speed. Since iterating means traversing pointers-to-pointers-to-pointers ad infinitum, this is going to take more work overall. Aside from the pointer traversal, two nodes may not be allocated anywhere near each other, reducing the effectiveness of CPU RAM caches.
However, the amount of time required to insert or remove a node is constant, since it requires the same number of operations no matter the state of the list. (This does not take into account any work that must be done to actually locate the item to remove, or to traverse the list to find the insertion point!)
If your primary concern with your collection is testing if something is in the collection, you might consider a HashSet<T> instead. Addition of items to the set will be relatively fast, somewhere between insertion into a list and a linked list. Removal of items will again be relatively fast. But the real gain is in lookup time -- testing if a HashSet<T> contains an item does not require iterating the entire list. On average it will perform faster than any list or linked list structure.
However, a HashSet<T> cannot contain equivalent items. If part of your requirements is that two items that are considered equal (by an Object.Equals(Object) overload, or by implementing IEquatable<T>) coexist independently in the collection, then you simply cannot use a HashSet<T>. Also, HashSet<T> does not guarantee insertion order, so you also can't use a HashSet<T> if maintaining some sort of ordering is important.

There are two basic ways to implement a simple collection:
contiguous array
linked list
Contiguous arrays have performance disadvantages for the operations you mentioned because the memory space of the collection is either preallocated or allocated based on the contents of the collection. Thus deletion or insertion requires moving many array elements to keep the entire collection contiguous and in the proper order.
Linked lists remove these issues because the items in the collection do not need to be stored in memory contiguously. Instead each element contains a reference to one or more of the other elements. Thus, when an insertion is made, the item in question is created anywhere in memory and only the references on one or two of the elements already in the collection need to be modified.
For example:
LinkedList<object> c = new LinkedList<object>(); // a linked list
object[] a = new object[] { }; // a contiguous array
This is simplified of course. The internals of LinkedList<> are doubtless more complex than a simple singly or doubly linked list, but that is the basic structure.

I think that some collection classes might use arrays internally as well as linked lists or something similar. The benefit of using collections from the System.Collections namespace instead of arrays, is that you do not need to spend any extra time writing code to perform update operations.
Arrays will always be more lightweight, and if you know some very good search algorithms, then you might even be able to use them more efficiently, but most of the the time you can avoid reinventing the wheel by using classes from System.Collections. These classes are meant to help the programmer avoid writing code that has already been written and tuned hundreds of times, so it is unlikely that you'll get a significant performance boost by manipulating arrays yourself.
When you need a static collection that doesn't require much adding, removing or editing, then perhaps it is a good time to use an array, since they don't require the extra memory that collections do.

Related

Most efficient way, Tags or List<GameObject>?

In my game I can use a list of game objects or tags to iterate but i prefer knows what is the most efficient way.
Save more memory using tags or unity requires many resources to do a search by tag?
public List<City> _Citys = new List<City>();
or
foreach(GameObject go in GameObject.FindGameObjectsWithTag("City"))
You're better of using a List of City objects and doing a standard for loop to iterate over the 'City' objects. The List just simply holds references to the 'City' objects, so impact on memory should be minimal - you could use an array of GameObjects[] instead of a List (which is what FindGameObjectsWithTag returns).
It's better for performance to use a populated List/Array rather than searching by Tags and of course you're explicitly pointing to an object rather than using 'magic' strings -- if you change the tag name later on then the FindGameObjectsWithTag method will silently break, as it will no longer find any objects.
Also, avoid using a foreach loop in Unity as this unfortunately creates a lot of garbage (the garbage collector in Unity isn't great so it's best to create as little garbage as possbile), instead just use a standard for loop:
Replace the “foreach” loops with simple “for” loops. For some reason, every iteration of every “foreach” loop generated 24 Bytes of garbage memory. A simple loop iterating 10 times left 240 Bytes of memory ready to be collected which was just unacceptable
EDIT: As mentioned in pid's answer - measure. You can use the built-in Unity profiler to inspect memory usage: http://docs.unity3d.com/Manual/ProfilerMemory.html
Per Microsoft's C# API rules, verbs such as Find* or Count* denote active code while terms such as Length stand for actual values that require no code execution.
Now, if the Unity3D folks respected those guidelines is a matter of debate, but from the name of the method I can already tell that it has a cost and should not be taken too lightly.
On the other side, your question is about performance, not correctness. Both ways are correct per se, but one is supposed to have better performance.
So, the main rule of refactoring for performance is: MEASURE.
It depends on memory allocation and garbage collection, it is impossible to tell which really is faster without measuring.
So the best advice I could give you is pretty general. Whenever you feel the need to enhance performance of code you have to actually measure what you are about to improve, before and after.
Your code examples are 2 distinctly different things. One is instantiating a list, and one is enumerating over an IEnumerable returned from a function call.
I assume you mean the difference between iterating over your declared list vs iterating over the return value from GameObject.FindObjectsWithTag() in which case;
Storing a List as a member variable in your class, populating it once and then iterating over it several times is more efficient than iterating over GameObject.FindObjectsWithTag several times.
This is because you keep your List and your references to the objects in your list at all times without having to repopulate it.
GameObject.FindObjectsWithTag will search your entire object hierarchy and compile a list of all the objects that it finds that matches your search criteria. This is done every time you call the function, so there is additional overhead even if the amount of objects it finds is the same as it still searches your hierarchy.
To be honest, you could just cache your results with a List object using GameObject.FindObjectWithTag providing the amount of objects returned will not change. (As in to say you are not instantiating or destroying any of those objects)

Is List<T> really an undercover Array in C#?

I have been looking at .NET libraries using ILSpy and have come across List<T> class definition in System.Collections.Generic namespace. I see that the class uses methods like this one:
// System.Collections.Generic.List<T>
/// <summary>Removes all elements from the <see cref="T:System.Collections.Generic.List`1" />.</summary>
public void Clear()
{
if (this._size > 0)
{
Array.Clear(this._items, 0, this._size);
this._size = 0;
}
this._version++;
}
So, the Clear() method of the List<T> class actually uses Array.Clear method. I have seen many other List<T> methods that use Array stuff in the body.
Does this mean that List<T> is actually an undercover Array or List only uses some part of Array methods?
I know lists are type safe and don't require boxing/unboxing but this has confused me a bit.
The list class is not itself an array. In other words, it does not derive from an array. Instead it encapsulates an array that is used by the implementation to hold the list's member elements.
Since List<T> offers random access to its elements, and those elements are indexed 0..Count-1, using an array to store the elements is the obvious implementation.
This tends to surprise C++ programmers that know std::list. A linked list, covered in .NET as well with the LinkedList class. And has the same perf characteristics, O(1) for inserts and deletes.
You should however in general avoid it. Linked lists do not perform well on modern processors. Which greatly depend on the cpu caches to get reasonable performance with memory that's many times slower than the execution core. A simple array is by far the data structure that takes most advantage of the cache. Accessing an element gives very high odds that subsequent elements are present in the cache as well. That is not the case for a linked list, elements tend to be scattered throughout the address space, make a cache miss likely. They can be very expensive, as much as 200 cycles with the cpu doing nothing but waiting on the memory sub-system to supply the data.
But do keep the perf characteristics in mind, adding or removing an element that is not at the end of the List costs O(n), just like an array. And a large List can generate a lot of garbage as the array needs to be expanded, setting the Capacity property up front can help a lot to avoid that. More about that in this answer. And otherwise the exact same concerns for std::vector<>.
Yes, List<T> uses an array internally to store the items, although in most cases the array is actually larger than the number of elements in the collection -- it has some extra "padding" at the end so that you can add new items without it having to reallocate memory every time. It keeps track of the actual size of the collection with a separate field (you can see this._size in your generated code). When you add more elements than the current array has room for, it will automatically allocate a new larger array -- twice as big, I think -- and copy over all the existing elements.
If you're concerned about a List<T> using more memory than necessary, you can set the size of the array explicitly with the constructor override that accepts a capacity parameter, if you know the size in advance, or call the TrimExcess() method to make sure the array is (close to) to actual size of the collection.
Random access memory is an array, so in that sense all data structures from linked-lists to heaps and beyond, that rely on random-access to memory for their performance behaviour, are built on the array that is system memory. It is more a question of how many-levels of abstraction are in between.
Of course in a modern virtual memory machine, the random-access system memory is itself an abstraction built on a complicated virtual-memory model of multi-tier pipelined caches, non-cached RAM, and disk.

Is there a performance impact when calling ToList()?

When using ToList(), is there a performance impact that needs to be considered?
I was writing a query to retrieve files from a directory, which is the query:
string[] imageArray = Directory.GetFiles(directory);
However, since I like to work with List<> instead, I decided to put in...
List<string> imageList = Directory.GetFiles(directory).ToList();
So, is there some sort of performance impact that should be considered when deciding to do a conversion like this - or only to be considered when dealing with a large number of files? Is this a negligible conversion?
IEnumerable<T>.ToList()
Yes, IEnumerable<T>.ToList() does have a performance impact, it is an O(n) operation though it will likely only require attention in performance critical operations.
The ToList() operation will use the List(IEnumerable<T> collection) constructor. This constructor must make a copy of the array (more generally IEnumerable<T>), otherwise future modifications of the original array will change on the source T[] also which wouldn't be desirable generally.
I would like to reiterate this will only make a difference with a huge list, copying chunks of memory is quite a fast operation to perform.
Handy tip, As vs To
You'll notice in LINQ there are several methods that start with As (such as AsEnumerable()) and To (such as ToList()). The methods that start with To require a conversion like above (ie. may impact performance), and the methods that start with As do not and will just require some cast or simple operation.
Additional details on List<T>
Here is a little more detail on how List<T> works in case you're interested :)
A List<T> also uses a construct called a dynamic array which needs to be resized on demand, this resize event copies the contents of an old array to the new array. So it starts off small and increases in size if required.
This is the difference between the Capacity and Count properties on List<T>. Capacity refers to the size of the array behind the scenes, Count is the number of items in the List<T> which is always <= Capacity. So when an item is added to the list, increasing it past Capacity, the size of the List<T> is doubled and the array is copied.
Is there a performance impact when calling toList()?
Yes of course. Theoretically even i++ has a performance impact, it slows the program for maybe a few ticks.
What does .ToList do?
When you invoke .ToList, the code calls Enumerable.ToList() which is an extension method that return new List<TSource>(source). In the corresponding constructor, under the worst circumstance, it goes through the item container and add them one by one into a new container. So its behavior affects little on performance. It's impossible to be a performance bottle neck of your application.
What's wrong with the code in the question
Directory.GetFiles goes through the folder and returns all files' names immediately into memory, it has a potential risk that the string[] costs a lot of memory, slowing down everything.
What should be done then
It depends. If you(as well as your business logic) gurantees that the file amount in the folder is always small, the code is acceptable. But it's still suggested to use a lazy version: Directory.EnumerateFiles in C#4. This is much more like a query, which will not be executed immediately, you can add more query on it like:
Directory.EnumerateFiles(myPath).Any(s => s.Contains("myfile"))
which will stop searching the path as soon as a file whose name contains "myfile" is found. This is obviously has a better performance then .GetFiles.
Is there a performance impact when calling toList()?
Yes there is. Using the extension method Enumerable.ToList() will construct a new List<T> object from the IEnumerable<T> source collection which of course has a performance impact.
However, understanding List<T> may help you determine if the performance impact is significant.
List<T> uses an array (T[]) to store the elements of the list. Arrays cannot be extended once they are allocated so List<T> will use an over-sized array to store the elements of the list. When the List<T> grows beyond the size the underlying array a new array has to be allocated and the contents of the old array has to be copied to the new larger array before the list can grow.
When a new List<T> is constructed from an IEnumerable<T> there are two cases:
The source collection implements ICollection<T>: Then ICollection<T>.Count is used to get the exact size of the source collection and a matching backing array is allocated before all elements of the source collection is copied to the backing array using ICollection<T>.CopyTo(). This operation is quite efficient and probably will map to some CPU instruction for copying blocks of memory. However, in terms of performance memory is required for the new array and CPU cycles are required for copying all the elements.
Otherwise the size of the source collection is unknown and the enumerator of IEnumerable<T> is used to add each source element one at a time to the new List<T>. Initially the backing array is empty and an array of size 4 is created. Then when this array is too small the size is doubled so the backing array grows like this 4, 8, 16, 32 etc. Every time the backing array grows it has to be reallocated and all elements stored so far have to be copied. This operation is much more costly compared to the first case where an array of the correct size can be created right away.
Also, if your source collection contains say 33 elements the list will end up using an array of 64 elements wasting some memory.
In your case the source collection is an array which implements ICollection<T> so the performance impact is not something you should be concerned about unless your source array is very large. Calling ToList() will simply copy the source array and wrap it in a List<T> object. Even the performance of the second case is not something to worry about for small collections.
It will be as (in)efficient as doing:
var list = new List<T>(items);
If you disassemble the source code of the constructor that takes an IEnumerable<T>, you will see it will do a few things:
Call collection.Count, so if collection is an IEnumerable<T>, it will force the execution. If collection is an array, list, etc. it should be O(1).
If collection implements ICollection<T>, it will save the items in an internal array using the ICollection<T>.CopyTo method. It should be O(n), being n the length of the collection.
If collection does not implement ICollection<T>, it will iterate through the items of the collection, and will add them to an internal list.
So, yes, it will consume more memory, since it has to create a new list, and in the worst case, it will be O(n), since it will iterate through the collection to make a copy of each element.
"is there a performance impact that needs to be considered?"
The issue with your precise scenario is that first and foremost your real concern about performance would be from the hard-drive speed and efficiency of the drive's cache.
From that perspective, the impact is surely negligible to the point that NO it need not be considered.
BUT ONLY if you really need the features of the List<> structure to possibly either make you more productive, or your algorithm more friendly, or some other advantage. Otherwise, you're just purposely adding an insignificant performance hit, for no reason at all. In which case, naturally, you shouldn’t do it! :)
ToList() creates a new List and put the elements in it which means that there is an associated cost with doing ToList(). In case of small collection it won't be very noticeable cost but having a huge collection can cause a performance hit in case of using ToList.
Generally you should not use ToList() unless work you are doing cannot be done without converting collection to List. For example if you just want to iterate through the collection you don't need to perform ToList
If you are performing queries against a data source for example a Database using LINQ to SQL then the cost of doing ToList is much more because when you use ToList with LINQ to SQL instead of doing Delayed Execution i.e. load items when needed (which can be beneficial in many scenarios) it instantly loads items from Database into memory
Considering the performance of retrieving file list, ToList() is negligible. But not really for other scenarios. That really depends on where you are using it.
When calling on an array, list, or other collection, you create a copy of the collection as a List<T>. The performance here depends on the size of the list. You should do it when really necessary.
In your example, you call it on an array. It iterates over the array and adds the items one by one to a newly created list. So the performance impact depends on the number of files.
When calling on an IEnumerable<T>, you materialize the IEnumerable<T> (usually a query).
ToList Will create a new list and copy elements from original source to the newly created list so only thing is to copy the elements from the original source and depends on the source size
Let's look for another example;
If you working on databases when you run ToList() method and check the SQL Profiler for this code;
var IsExist = (from inc in entities.be_Settings
where inc.SettingName == "Number"
select inc).ToList().Count > 0;
Auto created query will like this:
SELECT [Extent1].[SettingName] AS [SettingName], [Extent1].[SettingValue] AS [SettingValue] FROM [dbo].[be_Settings] AS [Extent1] WHERE N'Number' = [Extent1].[SettingName]
The select query is run with the ToList method, and the results of the query are stored in memory, and it is checked whether there is a record by looking at the number of elements of the List. For example, if there are 1000 records in your table with the relevant criteria, these 1000 records are first brought from the database and converted into objects, and then they are thrown into a List and you only check the number of elements of this List. So this is very inefficient way to choose.
its not exactly about list perfomance but if u have high dimensional array
u can use HashSet instead of List.

which collection class is efficient for insertion in C#

If i want to insert the data in these collection class "Dictionary","List" and "Sorted List" which will required the less time to perform the insertion? can you give give me a code to explain that process?
List<T> will have the fastest insertion.
LinkedList<T> will have the fastest insertion at the head.
The difference will be miniscule for most practical applications; you should use whichever one fits your needs.
Insert a value to LinkedList is O(1) operation. List (implemented by an array) may require additional allocation and copying of items.
If your performance requirements are strict, you should measure in your environment and on your data before deciding.
I'll offer some guesses that you should not take for granted until you perform your own measurements:
If you know the number of elements in advance, just use the
pre-allocated List (or array).
If you don't:
Use list of chunks (i.e. LinkedList<List<T>>) to avoid List resizes.
Or, for simplicity, you can just use the List and incur some
performance penalty when it is resized to accept more elements. I'm
not sure whether this penalty would justify using Dictionary or LinkedList
instead - but you will be if you measure ;)
All this is under assumption you don't care where in the collection is new element inserted and how you retrieve it later... If you do care, then you'll pick your data structure based on that, and not just insert performance alone.

Difference between List<T> and LinkedList<T>

We use List whenever we need a list. I notice now that there is a LinkedList.
I was wondering what was the difference between these 2, and when you should use one over the other.
Well, List<T> is basically backed by an array which is usually bigger than the current number of items. The elements are put in an array, and a new array is created when the old one runs out of space. This is fast for access by index, but slow at removing or inserting elements within the list or at the start. Adding/removing entries at the end of the list is reasonably cheap.
LinkedList<T> is a doubly-linked list - each node knows its previous entry and its next one. This is fast for inserting after/before a particular node (or the head/tail), but slow at access by index.
LinkedList<T> will usually take more memory than List<T> because it needs space for all those next/previous references - and the data will probably have less locality of reference, as each node is a separate object. On the other hand, a List<T> can have a backing array which is much larger than its current needs.
A List<T> is actually an array, meaning that its Add operation is O(1) at the end and O(n) at the front, but you can index into it in O(1). A LinkedList<T> is, as it says, a linked list. Since it's doubly-linked, you can add items to the front or back in O(1) but indexing into it is O(n).
In almost all scenarios a List is going to outperform a LinkedList. Real world results often differ with Big O Complexity Theory.

Categories