I am working in c# and I am creating a list (newList) that I can get from the length of another list (otherList). In c# is list implemented in a way that preallocating the length of the list is better for performance using otherList.Count or to just use newList.Add(obj) and not worry about the length?
The following constructor for List<T> is implemented for the purpose of improving performance in scenarios like yours:
http://msdn.microsoft.com/en-us/library/dw8e0z9z.aspx
public List(int capacity)
Just pass the capacity in the constructor.
newList = new List<string>(otherList.Count);
If you know the exact length of the new list, creating it with that capacity indeed performs - a bit - better.
The reason is that the implementation of List<T> internally uses an array. If this gets too small, a new array is created and the items from the old array are copied over to the new item.
Taken from the Remarks section on MSDN
The capacity of a List<T> is the number of elements that the List<T>
can hold. As elements are added to a List<T>, the capacity is automatically increased as required by reallocating the internal array.
If the size of the collection can be estimated, specifying the initial capacity eliminates the need to perform a number of resizing operations while adding elements to the List<T>.
The capacity can be decreased by calling the TrimExcess method or by
setting the Capacity property explicitly. Decreasing the capacity
reallocates memory and copies all the elements in the List<T>.
So, this would suggest that there would be a performance increase if you have an estimate of the size of the list you are going to populate. Of course the other-side of this is allocating a list size too big and therefore using up memory unnecessarily.
To be honest, I would not worry about this sort of micro optimization unless I really need to.
So I benchmarked the scenario of preallocation, and just wanted to share some numbers. The code simply times this:
var repetitions = 100000000
var list = new List<object>();
for (var i = 0; i < repetitions; i++)
list.Add(new object());
The comparison is by allocating in the list the exact number of repetitions, twice the number and without preallocation. This are the times:
List not preallocated: 6561, 6394, 6556, 6283, 6466
List preallocated: 5951, 6037, 5885, 6044, 5996
List double preallocated: 6710, 6665, 6729, 6760, 6624
So is preallocation worth it? Yes if the number of objects in the list is known, or if a bottom line of this number is known (how many items will the list have at least).
If not, you could risk actually wasting more time and memory, since the preallocation will use time and memory to make space for the allocation desired.
Note that also in terms of memory, exact preallocation is the most memory efficient, since not preallocating will make your list grow over the tightest needed capacity.
Related
I've been making a lot of use of LINQ queries in the application I'm currently writing, and one of the situations that I keep running into is having to convert the LINQ query results into lists for further processing (I have my reasons for wanting lists).
I'd like to have a better understanding of what happens in this list conversion in case there are inefficiencies since I've used it repeatedly now. So, given I execute a line line like this:
var matches = (from x in list1 join y in list2 on x equals y select x).ToList();
Questions:
Is there any overhead here aside from the creation of a new list and its population with references to the elements in the Enumerable returned from the query?
Would you consider this inefficient?
Is there a way to get the LINQ query to directly generate a list to avoid the need for a conversion in this circumstance?
Well, it creates a copy of the data. That could be inefficient - but it depends on what's going on. If you need a List<T> at the end, List<T> is usually going to be close to as efficient as you'll get. The one exception to that is if you're going to just do a conversion and the source is already a list - then using ConvertAll will be more efficient, as it can create the backing array of the right size to start with.
If you only need to stream the data - e.g. you're just going to do a foreach on it, and taking actions which don't affect the original data sources - then calling ToList is definitely a potential source of inefficiency. It will force the whole of list1 to be evaluated - and if that's a lazily-evaluated sequence (e.g. "the first 1,000,000 values from a random number generator") then that's not good. Note that as you're doing a join, list2 will be evaluated anyway as soon as you try to pull the first value from the sequence (whether that's in order to populate a list or not).
You might want to read my Edulinq post on ToList to see what's going on - at least in one possible implementation - in the background.
There is no any other overhed except those ones already mantioned by you.
I would say yes, but it depends on concrete application scenario. By the way, in general it's better to avoid additional calls. (I think this is obvious).
I'm afraid not. The LINQ query return a sequence of data, that could be an infinit sequence potentially. Converting to List<T> you make it finit, with also a possibility of index access, which is not possible to have in sequence or stream.
Suggession: avoid situation where you need the List<T>. If, by the way, you need it, push inside as less data as you you need in the current moment.
Hope this helps.
In addition to what has been said, if the initial two lists that you're joining were already quite large, creating a third (creating an "intersection" of the two) could cause out of memory errors. If you just iterate the result of the LINQ statement, you'll reduce the memory usage dramatically.
Most of the overhead happens before the list creation like the connection to db, getting the data to
an adapter, for the var type the .NET need to decide it's data type/structure...
The efficiency is very relative term. For a programmer who doesn't strong in SQL is efficient,
faster developing (relatively to old ADO) the overheads detailed in 1.
On the other hand LINQ can call procedures from the db itself, which already faster.
I suggest you to to the next test:
Run your program on maximal amount of data and measure the time.
Use some db procedure to export the data to file (like XML, CSV,....) and try to build your list
from that file and measure the time.
Then you can see if the difference is significant.
But the second ways is less efficient for the programmer, but can reduce the run time.
Enumerable.ToList(source) is essentially just a call to new List(source).
This constructor will test whether source is an ICollection<T>, and if it is allocate an array of the appropriate size. In other cases, i.e. most cases where the source is a LINQ query, it will allocate an array with the default initial capacity (four items) and grow it by doubling the capacity as needed. Each time the capacity doubles, a new array is allocated and the old one is copied over into the new one.
This may introduce some overhead in cases where your list wil have a lot of items (we're probably talking thousands at least). The overhead can be significant as soon as the list grows over 85 KB, as it is then allocated on the Large Object Heap, which is not compacted and may suffer from memory fragmentation. Note that I'm refering to the array in the list. If T is a reference type, that array contains only references, not the actual objects. Those objects then don't count for the 85 KB limitation.
You could remove some of this overhead if you can accurately estimate the size of your sequence (where it is better to overestimate a little bit than it is to underestimate a little bit). For example, if you are only running a .Select() operator on something that implements ICollection<T>, you know the size of the output list.
In such cases, this extension method would reduce this overhead:
public static List<T> ToList<T>(this IEnumerable<T> source, int initialCapacity)
{
// parameter validation ommited for brevity
var result = new List<T>(initialCapacity);
foreach (T item in source)
{
result.Add(item);
}
return result;
}
In some cases, the list you create is just going to replace a list that was already there, e.g. from a previous run. In those cases, you can avoid quite a few memory allocations if you reuse the old list. That would only work if you don't have concurrent access to that old list though, and I wouldn't do it if new lists will typically be significantly smaller than old lists. If that's the case, you can use this extension method:
public static void CopyToList<T>(this IEnumerable<T> source, List<T> destination)
{
// parameter validation ommited for brevity
destination.Clear();
foreach (T item in source)
{
destination.Add(item);
}
}
This being said, would I consider .ToList() being inefficient? No, if you have the memory, and you're going to use the list repeatedly, either for random indexing into it a lot, or iterating over it multiple times.
Now back to your specific example:
var matches = (from x in list1 join y in list2 on x equals y select x).ToList();
It may be more efficient to do this in some other way, for example:
var matches = list1.Intersect(list2).ToList();
which would yield the same results if list1 and list2 don't contain duplicates, and is very efficient if list2 is small.
The only way to really know though, as usual, is to measure using typical workloads.
I am looking for the most efficient way to store a collection of integers. Right now they're being stored in a HashSet<T>, but profiling has shown that these collections weigh heavily on some performance-critical code and I suspect there's a better option.
Some more details:
Random lookups must be O(1) or close to it.
The collections can grow large, so space efficiency is desirable.
The values are uniformly distributed in a 64-bit space.
Mutability is not needed.
There's no clear upper bound on size, but tens of millions of elements is not uncommon.
The most painful performance hit right now is creating them. That seems to be allocation-related - clearing and reusing HashSets helps a lot in benchmarks, but unfortunately that is not a feasible option in the application code.
(added) Implementing a data structure that's tailored to the task is fine. Is a hash table still the way to go? A trie also seems like a possibility at first glance, but I don't have any practical experience with them.
HashSet is usually the best general purpose collection in this case.
If you have any specific information about your collection you may have better options.
If you have a fixed upper bound that is not incredibly large you can use a bit vector of suitable size.
If you have a very dense collection you can instead store the missing values.
If you have very small collections, <= 4 items or so, you can store them in a regular array. A full scan of such small array may be faster than the hashing required to use the hash-set.
If you don't have any more specific characteristics of your data than "large collections of int" HashSet is the way to go.
If the size of the values is bounded you could use a bitset. It stores one bit per integer. In total the memory use would be log n bits with n being the greatest integer.
Another option is a bloom filter. Bloom filters are very compact but you have to be prepared for an occasional false positive in lookups. You can find more about them in wikipedia.
A third option is using a simle sorted array. Lookups are log n with n being the number of integers. It may be fast enough.
I decided to try and implement a special purpose hash-based set class that uses linear probing to handle collisions:
Backing store is a simple array of longs
The array is sized to be larger than the expected number of elements to be stored.
For a value's hash code, use the least-significant 31 bits.
Searching for the position of a value in the backing store is done using a basic linear probe, like so:
int FindIndex(long value)
{
var index = ((int)(value & 0x7FFFFFFF) % _storage.Length;
var slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
for(++index; ; index++)
{
if (index == _storage.Length) index = 0;
slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
}
}
(I was able to determine that the data being stored will never include 0, so that number is safe to use for empty slots.)
The array needs to be larger than the number of elements stored. (Load factor less than 1.) If the set is ever completely filled then FindIndex() will go into an infinite loop if it's used to search for a value that isn't already in the set. In fact, it will want to have quite a lot of empty space, otherwise search and retrieval may suffer as the data starts to form large clumps.
I'm sure there's still room for optimization, and I will may get stuck using some sort of BigArray<T> or sharding for the backing store on large sets. But initial results are promising. It performs over twice as fast as HashSet<T> at a load factor of 0.5, nearly twice as fast with a load factor of 0.8, and even at 0.9 it's still working 40% faster in my tests.
Overhead is 1 / load factor, so if those performance figures hold out in the real world then I believe it will also be more memory-efficient than HashSet<T>. I haven't done a formal analysis, but judging by the internal structure of HashSet<T> I'm pretty sure its overhead is well above 10%.
--
So I'm pretty happy with this solution, but I'm still curious if there are other possibilities. Maybe some sort of trie?
--
Epilogue: Finally got around to doing some competitive benchmarks of this vs. HashSet<T> on live data. (Before I was using synthetic test sets.) It's even beating my optimistic expectations from before. Real-world performance is turning out to be as much as 6x faster than HashSet<T>, depending on collection size.
What I would do is just create an array of integers with a sufficient enough size to handle how ever many integers you need. Is there any reason from staying away from the generic List<T>? http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
The most painful performance hit right now is creating them...
As you've obviously observed, HashSet<T> does not have a constructor that takes a capacity argument to initialize its capacity.
One trick which I believe would work is the following:
int capacity = ... some appropriate number;
int[] items = new int[capacity];
HashSet<int> hashSet = new HashSet<int>(items);
hashSet.Clear();
...
Looking at the implementation with reflector, this will initialize the capacity to the size of the items array, ignoring the fact that this array contains duplicates. It will, however, only actually add one value (zero), so I'd assume that initializing and clearing should be reasonably efficient.
I haven't tested this so you'd have to benchmark it. And be willing to take the risk of depending on an undocumented internal implementation detail.
It would be interesting to know why Microsoft didn't provide a constructor with a capacity argument like they do for other collection types.
I'm currently working with an app that will do the following.
// Initialize a list:
myList = new List<aPoint>;
while(WeHaveMoreData)
myList->Add(ReturnNext1000Points());
I have no way of knowing the total size of the list from the beginning. From what I've read, List<> is the best way to handle this much data incoming (could be upwards of 500k records).
I'm wondering if I should handle the capicity of the list (give it initial values, or increase the cap if it needs it)?
How do I approach optimizing such a procedure?
If you have an approximation of the total records you could set the capacity of the list otherwise leave it to grow. It is pretty optimized, just ensure you don't run out of memory. Another approach is to use a lazy iterator which won't load the entire list in memory:
public IEnumerable<aPoint> GetPoints()
{
while(WeHaveMoreData)
{
yield return new aPoint();
}
}
It is only once you start iterating that records will begin to be fetched, one by one and released immediately:
foreach (var point in GetPoints())
{
/// TODO: do something with the point
}
First rule: premature optimization is the root of all evil. If perfomance is not an issue leave it as is. Overwise you should try set initial size of list to about AverageExpectedSize/0.7.
I also think you can't optimize it much.. I guess you could do slightly better in some specific cases, so I have a question - what do you do with the data afterwards? Also - do you want to optimize for memory or speed?
A typical list implementation will grow capacity by a factor of 2x every time, so maybe you could save some space by having a List<aPoint[]>, which would have much fewer elements, so it would be less likely that you have a few 100k of spare capacity. But that would only matter if you were just about to run out of memory - it's likely that much more memory is spent on the data itself in any case..
In general, I would say that if you don't know the number of elements within say +/- 20% then you are probably should just add blindly to the List instead of guessing the capacity.
List is different than an array when it comes to matters of adding when at capacity. Remember that the List will double its capacity once you exceed the capacity. So for example, if your list has a current capacity of 128 elements and you add an element that makes it 129 elements, the list will resize its capacity to 256 elements. Then for the next 128 Adds you don't resize the list at all. Once you get to 257, it will double to 512, and the process repeats itself.
Thus you will have O(log(n)) resizes to your list.
In my program I have a bunch of growing arrays where a new element is grown one by one to the end of the array. I identified Lists to be a speed bottleneck in a critical part of my program due to their slow access time in comparison with an array - switching to an array increased performance tremendously to an acceptable level. So to grow the array i'm using Array.Resize. This works well as my implementation restricts the array size to approximately 20 elements, so the O(N) performance of Array.Resize is bounded.
But it would be better if there was a way to just increase an array by one element at the end without having to use Array.Resize; which I believe does a copy of the old array to the newly sized array.
So my question is, is there a more efficiant method for adding one element to the end of an array without using List or Array.Resize?
A List has constant time access just like an array. For 'growing arrays' you really should be using List.
When you know that you may be adding elements to an array backed structure, you don't want to add one new size at a time. Usually it is best to grow an array by doubling it's size when it fills up.
As has been previously mentioned, List<T> is what you are looking for. If you know the initial size of the list, you can supply an initial capacity to the constructor, which will increase your performance for your initial allocations:
List<int> values = new List<int>(5);
values.Add(1);
values.Add(2);
values.Add(3);
values.Add(4);
values.Add(5);
List's allocate 4 elements to begin with (unless you specify a capacity when you construct it) and then grow every 4 elements.
Why don't you try a similar thing with Array? I.e. create it as having 4 elements, then when you insert the fifth element, first grow the array by another 4 elements.
There is no way to resize an array, so the only way to get a larger array is to use Array.Resize to create a new array.
Why not just create the arrays to have 20 elements from start (or whatever capacity you need at most), and use a variable to keep track of how many elements are used in the array? That way you never have to resize any arrays.
Growing an array AFAIK means that a new array is allocated, the existing content being copied to the new instance. I doubt that this should be faster than using List...?
it's much faster to resize an array in chunks (like 10) and store this as a seperate variable e.g capacity and then only resize the array when the capacity is reached. This is how a list works but if you prefer to use arrays then you should look into resizing them in larger chunks especially if you have a large number of Array.Resize calls
I think that every method, that wants to use array, will not be ever optimized because an array is a static structure so I think it's better to use dynamic structures like List or others.
Are C# lists fast? What are the good and bad sides of using lists to handle objects?
Extensive use of lists will make software slower? What are the alternatives to lists in C#?
How many objects is "too many objects" for lists?
List<T> uses a backing array to hold items:
Indexer access (i.e. fetch/update) is O(1)
Remove from tail is O(1)
Remove from elsewhere requires existing items to be shifted up, so O(n) effectively
Add to end is O(1) unless it requires resizing, in which case it's O(n). (This doubles the size of the buffer, so the amortized cost is O(1).)
Add to elsewhere requires existing items to be shifted down, so O(n) effectively
Finding an item is O(n) unless it's sorted, in which case a binary search gives O(log n)
It's generally fine to use lists fairly extensively. If you know the final size when you start populating a list, it's a good idea to use the constructor which lets you specify the capacity, to avoid resizing. Beyond that: if you're concerned, break out the profiler...
Compared to what?
If you mean List<T>, then that is essentially a wrapper around an array; so fast to read/write by index, relatively fast to append (since it allows extra space at the end, doubling in size when necessary) and remove from the end, but more expensive to do other operations (insert/delete other than the end)
An array is again fast by index, but fixed size (no append/delete)
Dictionary<,> etc offer better access by key
A list isn't intrinsically slow; especially if you know you always need to look at all the data, or can access it by index. But for large lists it may be better (and more convenient) to search via a key. There are various dictionary implementations in .NET, each with different costs re size / performance.