How and when to abandon the use of arrays in C#?

How and when to abandon the use of arrays in C#? - c#

I've always been told that adding an element to an array happens like this:
An empty copy of the array+1element is
created and then the data from the
original array is copied into it then
the new data for the new element is
then loaded
If this is true, then using an array within a scenario that requires a lot of element activity is contra-indicated due to memory and CPU utilization, correct?
If that is the case, shouldn't you try to avoid using an array as much as possible when you will be adding a lot of elements? Should you use iStringMap instead? If so, what happens if you need more than two dimensions AND need to add a lot of element additions. Do you just take the performance hit or is there something else that should be used?

Look at the generic List<T> as a replacement for arrays. They support most of the same things arrays do, including allocating an initial storage size if you want.

This really depends on what you mean by "add."
If you mean:
T[] array;
int i;
T value;
...
if (i >= 0 && i <= array.Length)
array[i] = value;
Then, no, this does not create a new array, and is in-fact the fastest way to alter any kind of IList in .NET.
If, however, you're using something like ArrayList, List, Collection, etc. then calling the "Add" method may create a new array -- but they are smart about it, they don't just resize by 1 element, they grow geometrically, so if you're adding lots of values only every once in a while will it have to allocate a new array. Even then, you can use the "Capacity" property to force it to grow before hand, if you know how many elements you're adding (list.Capacity += numberOfAddedElements)

In general, I prefer to avoid array usage. Just use List<T>. It uses a dynamically-sized array internally, and is fast enough for most usage. If you're using multi-dimentional arrays, use List<List<List<T>>> if you have to. It's not that much worse in terms of memory, and is much simpler to add items to.
If you're in the 0.1% of usage that requires extreme speed, make sure it's your list accesses that are really the problem before you try to optimize it.

If you're going to be adding/removing elements a lot, just use a List. If it's multidimensional, you can always use a List<List<int>> or something.
On the other hand, lists are less efficient than arrays if what you're mostly doing is traversing the list, because arrays are all in one place in your CPU cache, where objects in a list are scattered all over the place.
If you want to use an array for efficient reading but you're going to be "adding" elements frequently, you have two main options:
1) Generate it as a List (or List of Lists) and then use ToArray() to turn it into an efficient array structure.
2) Allocate the array to be larger than you need, then put the objects into the pre-allocated cells. If you end up needing even more elements than you pre-allocated, you can just reallocate the array when it fills, doubling the size each time. This gives O(log n) resizing performance instead of O(n) like it would be with a reallocate-once-per-add array. Note that this is pretty much how StringBuilder works, giving you a faster way to continually append to a string.

When to abandon the use of arrays
First and foremost, when semantics of arrays dont match with your intent - Need a dynamically growing collection? A set which doesn't allow duplicates? A collection that has to remain immutable? Avoid arrays in all that cases. That's 99% of the cases. Just stating the obvious basic point.
Secondly, when you are not coding for absolute performance criticalness - That's about 95% of the cases. Arrays perform better marginally, especially in iteration. It almost always never matter.
When you're not forced by an argument with params keyword - I just wished params accepted any IEnumerable<T> or even better a language construct itself to denote a sequence (and not a framework type).
When you are not writing legacy code, or dealing with interop
In short, its very rare that you would actually need an array. I will add as to why may one avoid it?
The biggest reason to avoid arrays imo is conceptual. Arrays are closer to implementation and farther from abstraction. Arrays conveys more how it is done than what is done which is against the spirit of high level languages. That's not surprising, considering arrays are closer to the metal, they are straight out of a special type (though internally array is a class). Not to be pedagogical, but arrays really do translate to a semantic meaning very very rarely required. The most useful and frequent semantics are that of a collections with any entries, sets with distinct items, key value maps etc with any combination of addable, readonly, immutable, order-respecting variants. Think about this, you might want an addable collection, or readonly collection with predefined items with no further modification, but how often does your logic look like "I want a dynamically addable collection but only a fixed number of them and they should be modifiable too"? Very rare I would say.
Array was designed during pre-generics era and it mimics genericity with lot of run time hacks and it will show its oddities here and there. Some of the catches I found:
Broken covariance.
string[] strings = ...
object[] objects = strings;
objects[0] = 1; //compiles, but gives a runtime exception.
Arrays can give you reference to a struct!. That's unlike anywhere else. A sample:
struct Value { public int mutable; }
var array = new[] { new Value() };
array[0].mutable = 1; //<-- compiles !
//a List<Value>[0].mutable = 1; doesnt compile since editing a copy makes no sense
print array[0].mutable // 1, expected or unexpected? confusing surely
Run time implemented methods like ICollection<T>.Contains can be different for structs and classes. It's not a big deal, but if you forget to override non generic Equals correctly for reference types expecting generic collection to look for generic Equals, you will get incorrect results.
public class Class : IEquatable<Class>
{
public bool Equals(Class other)
{
Console.WriteLine("generic");
return true;
}
public override bool Equals(object obj)
{
Console.WriteLine("non generic");
return true;
}
}
public struct Struct : IEquatable<Struct>
{
public bool Equals(Struct other)
{
Console.WriteLine("generic");
return true;
}
public override bool Equals(object obj)
{
Console.WriteLine("non generic");
return true;
}
}
class[].Contains(test); //prints "non generic"
struct[].Contains(test); //prints "generic"
The Length property and [] indexer on T[] seem to be regular properties that you can access through reflection (which should involve some magic), but when it comes to expression trees you have to spit out the exact same code the compiler does. There are ArrayLength and ArrayIndex methods to do that separately. One such question here. Another example:
Expression<Func<string>> e = () => new[] { "a" }[0];
//e.Body.NodeType == ExpressionType.ArrayIndex
Expression<Func<string>> e = () => new List<string>() { "a" }[0];
//e.Body.NodeType == ExpressionType.Call;
Yet another one. string[].IsReadOnly returns false, but if you are casting, IList<string>.IsReadOnly returns true.
Type checking gone wrong: (object)new ConsoleColor[0] is int[] returns true, whereas new ConsoleColor[0] is int[] returns false. Same is true for uint[] and int[] comparisons. No such problems if you use any other collection types.
How to abandon the use of arrays.
The most commonly used substitute is List<T> which has a cleaner API. But it is a dynamically growing structure which means you can add to a List<T> at the end or insert anywhere to any capacity. There is no substitute for the exact behaviour of an array, but people mostly use arrays as readonly collection where you can't add anything to its end. A substitute is ReadOnlyCollection<T>.

When the array is resized, a new array must be allocated, and the contents copied. If you are only modifying the contents of the array, it is just a memory assignment.
So, you should not use arrays when you don't know the size of the array, or the size is likely to change. However, if you have a fixed length array, they are an easy way of retrieving elements by index.

ArrayList and List grow the array by more than one when needed (I think it's by doubling the size, but I haven't checked the source). They are generally the best choice when you are building a dynamically sized array.
When your benchmarks indicate that array resize is seriously slowing down your application (remember - premature optimization is the root of all evil), you can evaluate writing a custom array class with tweaked resizing behavior.

Generally, if you must have the BEST indexed lookup performance it's best to build a List first and then turn it into a array thus paying a small penalty at first but avoiding any later. If the issue is that you will be continually adding new data and removing old data then you may want to use a ArrayList or List for convenience but keep in mind that they are just special case Arrays. When they "grow" they allocate a completely new array and copy everything into it which is extremely slow.
ArrayList is just an Array which grows when needed.
Add is amortized O(1), just be careful to make sure the resize won't happen at a bad time.
Insert is O(n) all items to the right must be moved over.
Remove is O(n) all items to the right must be moved over.
Also important to keep in mind that List is not a linked list. It's just a typed ArrayList. The List documentation does note that it performs better in most cases but does not say why.
The best thing to do is to pick a data structure which is appropriate to your problem. This depends one a LOT of things and so you may want to browse the System.Collections.Generic Namespace.
In this particular case I would say that if you can come up with a good key value Dictionary would be your best bet. It has insert and remove that approaches O(1). However, even with a Dictionary you have to be careful not to let it resize it's internal array (an O(n) operation). It's best to give them a lot of room by specifying a larger-then-you-expect-to-use initial capacity in the constructor.
-Rick

A standard array should be defined with a length, which reserves all of the memory that it needs in a contiguous block. Adding an item to the array would put it inside of the block of already reserved memory.

Arrays are great for few writes and many reads, particularly those of an iterative nature - for anything else, use one of the many other data structures.

You are correct an array is great for look ups. However modifications to the size of the array are costly.
You should use a container that supports incremental size adjustments in the scenario where you're modifying the size of the array. You could use an ArrayList which allows you to set the initial size, and you could continually check the size versus the capacity and then increment the capacity by a large chunk to limit the number of resizes.
Or you could just use a linked list. Then however look ups are slow...

If I think I'm going to be adding items to the collection a lot over its lifetime, than I'll use a List. If I know for sure what the size of the collection will be when its declared, then I'll use an array.
Another time I generally use an array over a List is when I need to return a collection as a property of an object - I don't want callers adding items that collection via List's Add methods, but instead want them to add items to the collection via my object's interface. In that case, I'll take the internal List and call ToArray and return an array.

If you are going to be doing a lot of adding, and you will not be doing random access (such as myArray[i]). You could consider using a linked list (LinkedList<T>), because it will never have to "grow" like the List<T> implementation. Keep in mind, though, that you can only really access items in a LinkedList<T> implementation using the IEnumerable<T> interface.

The best thing you can do is to allocate as much memory as you need upfront if possible. This will prevent .NET from having to make additional calls to get memory on the heap. Failing that then it makes sense to allocate in chunks of five or whatever number makes sense for your application.
This is a rule you can apply to anything really.

Related

How can I get a Span<T> from a List<T> while avoiding needless copies?

I have a List<T> containing some data. I would like to pass it to a function which accepts ReadOnlySpan<T>.
List<T> items = GetListOfItems();
// ...
void Consume<T>(ReadOnlySpan<T> buffer)
// ...
Consume(items??);
In this particular instance T is byte but it doesn't really matter.
I know I can use .ToArray() on the List, and the construct a span, e.g.
Consume(new ReadOnlySpan<T>(items.ToArray()));
However this creates a (seemingly) unneccessary copy of the items. Is there any way to get a Span directly from a List? List<T> is implemented in terms of T[] behind the scenes, so in theory it's possible, but not as far as I can see in practice?

In .Net 5.0, you can use CollectionsMarshal.AsSpan() (source, GitHub issue) to get the underlying array of a List<T> as a Span<T>.
Keep in mind that this is still unsafe: if the List<T> reallocates the array, the Span<T> previously returned by CollectionsMarshal.AsSpan won't reflect any further changes to the List<T>. (Which is why the method is hidden in the System.Runtime.InteropServices.CollectionsMarshal class.)

Thanks for all the comments explaining that there's no actual way to do it and how exposing the internal Array inside List could lead to bad behaviour and a broken span.
I ended up refactoring my code not to use a list and just produce spans in the first place.
void Consume<T>(ReadOnlySpan<T> buffer)
// ...
var buffer = new T[512];
int itemCount = ProduceListOfItems(buffer); // produce now writes into the buffer
Consume(new ReadOnlySpan<T>(buffer, 0, itemCount);
I'm chosing to make the explicit tradeoff of over-allocating the buffer once to avoid making an extra copy later on.
I can do this in my specific case because I know there will a maximum upper bound on the item count, and over-allocating slightly isn't a big deal, however there doesn't appear to be a generalisation here, nor would one ever get added as it would be dangerous.
As always, software performance is the art of making (hopefully favorable) trade-offs.

You can write your own CustomList<T> that exposes the underlying array. It is then on user code to use this class correctly.
In particular the CustomList<T> will not be aware of any Span<T> that you can obtain from the underlying backing array. After taking a Span<T> you should not make the list do anything to create a new array or create undefined data in the old array.
The C++ standard library allows user code to obtain direct pointers into vector<T> backing storage. They document the conditions under which this is safe. Resizing makes it unsafe for example.
.NET itself does something like this with MemoryStream. This class allows you access to the underlying buffer and indeed unsafe operations are possible.

C# List .ConvertAll Efficiency and overhead

I recently learned about List's .ConvertAll extension. I used it a couple times in code today at work to convert a large list of my objects to a list of some other object. It seems to work really well. However I'm unsure how efficient or fast this is compared to just iterating the list and converting the object. Does .ConvertAll use anything special to speed up the conversion process or is it just a short hand way of converting Lists without having to set up a loop?

No better way to find out than to go directly to the source, literally :)
http://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs#dbcc8a668882c0db
As you can see, there's no special magic going on. It just iterates over the list and creates a new item by the converter function that you specify.
To be honest, I was not aware of this method. The more idiomatic .NET way to do this kind of projection is through the use of the Select extension method on IEnumerable<T> like so: source.Select(input => new Something(input.Name)). The advantage of this is threefold:
It's more idomatic as I said, the ConvertAll is likely a remnant of the pre-C#3.0 days. It's not a very arcane method by any means and ConvertAll is a pretty clear description, but it might still be better to stick to what other people know, which is Select.
It's available on all IEnumerable<T>, while ConvertAll only works on instances of List<T>. It doesn't matter if it's an array, a list or a dictionary, Select works with all of them.
Select is lazy. It doesn't do anything until you iterate over it. This means that it returns an IEnumerable<TOutput> which you can then convert to a list by calling ToList() or not if you don't actually need a list. Or if you just want to convert and retrieve the first two items out of a list of a million items, you can simply do source.Select(input => new Something(input.Name)).Take(2).
But if your question is purely about the performance of converting a whole list to another list, then ConvertAll is likely to be somewhat faster as it's less generic than a Select followed by a ToList (it knows that a list has a size and can directly access elements by index from the underlying array for instance).

Decompiled using ILSPy:
public List<TOutput> ConvertAll<TOutput>(Converter<T, TOutput> converter)
{
if (converter == null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.converter);
}
List<TOutput> list = new List<TOutput>(this._size);
for (int i = 0; i < this._size; i++)
{
list._items[i] = converter(this._items[i]);
}
list._size = this._size;
return list;
}
Create a new list.
Populate the new list by iterating over the current instance, executing the specified delegate.
Return the new list.
Does .ConvertAll use anything special to speed up the conversion
process or is it just a short hand way of converting Lists without
having to set up a loop?
It doesn't do anything special with regards to conversion (what "special" thing could it do?) It is directly modifying the private _items and _size members, so it might be trivially faster under some circumstances.
As usual, if the solution makes you more productive, code easier to read, etc. use it until profiling reveals a compelling performance reason to not use it.

It's the second way you described it - basically a short-hand way without setting up a loop.
Here's the guts of ConvertAll():
List<TOutput> list = new List<TOutput>(this._size);
for (int index = 0; index < this._size; ++index)
list._items[index] = converter(this._items[index]);
list._size = this._size;
return list;
Where TOutput is whatever type you're converting to, and converter is a delegate indicating the method that will do the conversion.
So it loops through the List you passed in, running each element through the method you specify, and then returns a new List of the specified type.

For precise timing in your scenarios you need to measure yourself.
Do not expect any miracles - it have to be O(n) operation since each element need to be converted and added to destination list.
Consider using Enumerable.Select instead as it will do lazy evaluation that may allow avoiding second copy of large list, especially you you need to do any filtering of items along the way.

Are List<> elements sequentially located in heap like array?

I'm learning C# and basically know the difference between arrays and Lists that the last is a generic and can dynamically grow but I'm wondering:
are List elements sequentially located in heap like array or is each element located "randomly" in a different locations?
and if that is true, does that affect the speed of access & data retrieval from memory?
and if that is true, is this what makes arrays a little faster than Lists?

Let's see the second and the third questions first:
and if that true does that affect the speed of access & data retrieval from memory ?
and if that true is this what makes array little faster than list ?
There is only a single type of "native" collection in .NET (with .NET I mean the CLR, so the runtime): the array (technically, if you consider a string a type of collection, then there are two native types of collections :-) ) (technically part 2: not all the arrays you think that are arrays are "native" arrays... Only the monodimensional 0 based arrays are "native" arrays. Arrays of type T[,] aren't, and arrays where the first element doesn't have an index of 0 aren't) . Every other collection (other than the LinkedList<>) is built atop it. If you look at the List<T> with IlSpy you'll see that at the base of it there is a T[] with an added int for the Count (the T[].Length is the Capacity). Clearly an array is a little faster than a List<T> because to use it, you have one less indirection (you access the array directly, instead of accessing the array that accesses the list).
Let's see the first question:
does List elements sequentially located in heap like array or each element is located randomly in different locations?
Being based on an array internally, clearly the List<> memorizes its elements like an array, so in a contiguous block of memory (but be aware that with a List<SomeObject> where SomeObject is a reference type, the list is a list of references, not of objects, so the references are put in a contiguous block of memory (we will ignore that with the advanced memory management of computers, the word "contiguous block of memory" isn't exact", it would be better to say "a contiguous block of addresses") )
(yes, even Dictionary<> and HashSet<> are built atop arrays. Conversely a tree-like collection could be built without using an array, because it's more similar to a LinkedList)
Some additional details: there are four groups of instructions in the CIL language (the intermediate language used in compiled .NET programs) that are used with "native" arrays:
Newarr
Ldelem and family Ldelem_*
Stelem and family Stelem_*
ReadOnly (don't ask me its use, I don't know, and the documentation isn't clear)
if you look at OpCodes.Newarr you'll see this comment in the XML documentation:
// Summary:
// Pushes an object reference to a new zero-based, one-dimensional array whose
// elements are of a specific type onto the evaluation stack.

Yes, elements in a List are stored contiguously, just like an array. A List actually uses arrays internally, but that is an implementation detail that you shouldn't really need to be concerned with.
Of course, in order to get the correct impression from that statement, you also have to understand a bit about memory management in .NET. Namely, the difference between value types and reference types, and how objects of those types are stored. Value types will be stored in contiguous memory. With reference types, the references will be stored in contiguous memory, but not the instances themselves.
The advantage of using a List is that the logic inside of the class handles allocating and managing the items for you. You can add elements anywhere, remove elements from anywhere, and grow the entire size of the collection without having to do any extra work. This is, of course, also what makes a List slightly slower than an array. If any reallocation has to happen in order to comply with your request, there'll be a performance hit as a new, larger-sized array is allocated and the elements are copied to it. But it won't be any slower than if you wrote the code to do it manually with a raw array.
If your length requirement is fixed (i.e., you never need to grow/expand the total capacity of the array), you can go ahead and use a raw array. It might even be marginally faster than a List because it avoids the extra overhead and indirection (although that is subject to being optimized out by the JIT compiler).
If you need to be able to dynamically resize the collection, or you need any of the other features provided by the List class, just use a List. The performance difference will be virtually imperceptible.

what is difference between string array and list of string in c#

I hear on MSDN that an array is faster than a collection.
Can you tell me how string[] is faster then List<string>.

Arrays are a lower level abstraction than collections such as lists. The CLR knows about arrays directly, so there's slightly less work involved in iterating, accessing etc.
However, this should almost never dictate which you actually use. The performance difference will be negligible in most real-world applications. I rarely find it appropriate to use arrays rather than the various generic collection classes, and indeed some consider arrays somewhat harmful. One significant downside is that there's no such thing as an immutable array (other than an empty one)... whereas you can expose read-only collections through an API relatively easily.

The article is from 2004, that means it's about .net 1.1 and there was no generics.
Array vs collection performance actually was a problem back then because collection types caused a lot of exta boxing-unboxing operations. But since .net 2.0, where generics was introduced, difference in performance almost gone.

An array is not resizable. This means that when it is created one block of memory is allocated, large enough to hold as many elements as you specify.
A List on the other hand is implicitly resizable. Each time you Add an item, the framework may need to allocate more memory to hold the item you just added. This is an expensive operation, so we end up saying "List is slower than array".
Of course this is a very simplified explanation, but hopefully enough to paint the picture.

An array is the simplest form of collection, so it's faster than other collections. A List (and many other collections) actually uses an array internally to hold its items.
An array is of course also limited by its simplicity. Most notably you can't change the size of an array. If you want a dynamic collection you would use a List.

List<string> is class with a private member that is a string[]. The MSDN documentation states this fact in several places. The List class is basically a wrapper class around an array that gives the array other functionality.
The answer of which is faster all depends on what you are trying to do with the list/array. For accessing and assigning values to elements, the array is probably negligibly faster since the List is an abstraction of the array (as Jon Skeet has said).
If you intend on having a data structure that grows over time (gets more and more elements), performance (ave. speed) wise the List will start to shine. That is because each time you resize an array to add another element it is an O(n) operation. When you add an element to a List (and the list is already at capacity) the list will double itself in size. I won't get into the nitty gritty details, but basically this means that increasing the size of a List is on average a O(log n) operation. Of course this has drawbacks too (you could have almost twice the amount of memory allocated as you really need if you only go a couple items past its last capacity).
Edit: I got a little mixed up in the paragraph above. As Eric has said below, the number of resizes for a List is O(log n), but the actual cost associated with resizing the array is amortized to O(1).

ASP.NET C# Lists Which and When?

In C# There seem to be quite a few different lists. Off the top of my head I was able to come up with a couple, however I'm sure there are many more.
List<String> Types = new List<String>();
ArrayList Types2 = new ArrayList();
LinkedList<String> Types4 = new LinkedList<String>();
My question is when is it beneficial to use one over the other?
More specifically I am returning lists of unknown size from functions and I was wondering if there is a particular list that was better at this.

List<String> Types = new List<String>();
LinkedList<String> Types4 = new LinkedList<String>();
are generic lists, i.e. you define the data type that would go in there which decreased boxing and un-boxing.
for difference in list vs linklist, see this --> When should I use a List vs a LinkedList
ArrayList is a non-generic collection, which can be used to store any type of data type.

99% of the time List is what you'll want. Avoid the non-generic collections at all costs.
LinkedList is useful for adding or removing without shuffling items around, although you have to forego random access as a result. One advantage it does have is you can remove items whilst iterating through the nodes.

ArrayList is a holdover from before Generics. There's really no reason to use them ... they're slow and use more memory than List<>. In general, there's probably no reason to use LinkedList either unless you are inserting midway through VERY large lists.
The only thing you'll find in .NET faster than a List<> is a fixed array ... but the performance difference is surprisingly small.

See the article on Commonly Used Collection Types from MSDN for a list of the the various types of collections available to you, and their intended uses.

ArrayList is a .Net 1.0 list type.
List is a generic list introduced with generics in .Net 2.0.
Generic lists provide better compile time support. Generics lists are type safe. You cannot add objects of wrong type. Therefor you know which type the stored objects has. There are no typechecks and typecasts nessecary.
I dont know about performance differences.
This questions says something about the difference of List and LinkedList.

As mentioned, don't use ArrayList if at all possible.
Here's an bit on Wikipedia about the differences between arrays and linked lists.
In summary:
Arrays
Fast random access
Fast inserting/deleting at end
Good memory locality
Linked Lists
Fast inserting/deleting at beginning
Fast inserting/deleting at end
Fast inserting/deleting at middle (with enumerator)

Generally, use List. Don't use ArrayList; it's obsolete. Use LinkedList in the rare cases where you need to be able to add without resizing and don't mind the overhead and loss of random access.

ArrayList is probably smaller, memory-wise, since it is based on an array. It also has fast random-access to elements. However, adding or removing to the list will take longer. This might be sped up slightly if the object over-allocates under the assumption that you are going to keep adding. (That will, of course, reduce the memory advantage.)
The other lists will be slightly larger (4-to-8 bytes more memory per element), and will have poor random access times. However, it is very fast to add or remove objects to the ends of the list. Also, memory usage is usually spot-on for what you need.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.