Why HashSet<T> class is not used to implement Enumerable.Distinct - c#

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation
So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"
What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions
However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime
So, my questions are:
What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?

which is almost a classical implementation of a hash table with "open addressing"
Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.
a lot of code in Set is just a copy-paste from HashSet, with some omissions
AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.
It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells
In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.
As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.
Especially given the fact that both of these classes are in the same assembly System.Core
Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.
Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.
That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think #james-ko was considering testing the benefits of doing that.
It looks like HashSet<T> will perform better
For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).
so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.
Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

Related

Should ConditionalWeakTable<TKey, TValue> be used for non-compiler purposes?

I've recently come across the ConditionalWeakTable<TKey,TValue> class in my search for an IDictionary which uses weak references, as suggested in answers here and here.
There is a definitive MSDN article which introduced the class and which states:
You can find the class ... in the System.Runtime.CompilerServices namespace. It’s in CompilerServices because it’s not a general-purpose dictionary type: we intend for it to only be used by compiler writers.
and later again:
...the conditional weak table is not intended to be a general purpose collection... But if you’re writing a .NET language of your own and need to expose the ability to attach properties to objects you should definitely look into the Conditional Weak Table.
In line with this, the MSDN entry description of the class reads:
Enables compilers to dynamically attach object fields to managed objects.
So obviously it was originally created for a very specific purpose - to help the DLR, and the System.Runtime.CompilerServices namespace embodies this. But it seems to have found a much wider use than that - even within the CLR. If I search for references of ConditionalWeakTable in ILSpy, for example, I can see that is used in the MEF class CatalogExportProvider and in the internal WPF DataGridHelper class, amongst others.
My question is whether it is okay to use ConditionalWeakTable outside of compiler writing and language tools, and whether there is any risk in doing so in terms of incurring additional overhead or of the implementation changing significantly in future .NET versions. (Or should it be avoided and a custom implementation like this one be used instead).
There is also further reading here, here and here about how the ConditionalWeakTable makes use of a hidden CLR implementation of ephemerons (via System.Runtime.Compiler.Services. DependentHandle) to deal with the problem of cycles between keys and values, and how this cannot easily be accomplished in a custom manner.
I don't see anything wrong with using ConditionalWeakTable. If you need ephemerons, you pretty much have no other choice.
I don't think future .NET versions will be a problem - even if only compilers would use this class, Microsoft still couldn't change it without breaking compatibility with existing binaries.
As for overhead - there certainly will be overhead compared to a normal Dictionary. Having many DependentHandles probably will be expensive similarly to how many WeakReferences are more expensive than normal references (the GC has to do additional work to scan them to see if they need to be nulled out). But that's not a problem unless you have lots (several million) of entries.

Immutable Data Structures in C#

I was reading some entries in Eric Lippert's blog about immutable data structures and I got to thinking, why doesn't C# have this built into the standard libraries? It seems strange for something with obvious reuse to not be already implemented out of the box.
EDIT: I feel I might be misunderstood on my question. I'm not asking how to implement one in C#, I'm asking why some of the basic data structures (Stack, Queue, etc.) aren't already available as immutable variants.
It does now.
.NET just shipped their first immutable collections, which I suggest you try out.
Any framework, language, or combination thereof that is not a purely experimental exercise has a market. Some purely experimental ones go on to develop a market.
In this sense, "market" does not necessarily refer to market economics, it's as true whether the producers of the framework/language/both are commercially or non-commercially oriented and distributing the framework/language/both (I'm just going to say "framework" for now on) at a cost or for free. Indeed, free-as-in-both-beer-and-speech projects can be even more heavily dependent on their markets than commercial projects in this way, because their producers are a subset of their market. The market is anyone and everyone who uses it.
The nature of this market will affect the framework in several ways both by organic processes (some parts prove more popular than others and get a bigger share of the mindspace within the community that educates its own members about them) and by fiat (someone decides a feature will serve the market and therefore adds it).
When .NET was developed, it was developed to serve its future market. Ideas about what would serve them therefore influenced decisions as to what should and should not be included in the FCL, the runtime, and the languages that work with it.
Someone decided that we'd quite likely need System.Collections.ArrayList. Someone decided we'd quite likely need System.IO.DirectoryInfo to have a Delete() method. Nobody decided we'd be likely to need a System.Collections.ImmutableStack.
Maybe nobody thought of it at all. Maybe someone did and even implemented it and then it was decided not to be of enough general use. Maybe it was debated at length within Microsoft. I've never worked for MS, so I don't have a clue.
What I can consider though, is the question as to what the people who were using the .NET framework in 2002 using in 2001.
Well, COM, ActiveX, ("Classic") ASP, and VB6 & VBScript is now much less used than it was, so it can be said to have been replaced by .NET. Indeed, that could be said to have been an intention.
As well as VB6 & VBScript, a considerable number who were writing in C++ and Java with Windows as a sole or primary target platform are now at least partly using .NET instead. Again, I think that could be said to be an intention, or at the very least I don't think MS were surprised that it went that way.
In COM we had an enumerator-object based foreach approach to iteration that had direct support in some languages (the VB family*), and .NET we have an enumerator-object based foreach approach to iteration that has direct support in some languages (C#, VB.NET, and others)†.
In C++ we had a rich set of collection types from the STL, and in .NET we have a rich set of collection types from the FCL (and typesafe generic types from .NET2.0 on).
In Java we had a strong everything-is-an-object style of OOP with a small set of methods provided by a common base-type and a boxing mechanism to allow for simple efficient primitives while keeping to this style. In .NET we have a strong everything-is-an-object style of OOP with a small set of methods provided by a common base-type and a (different) boxing mechanism to allow for simple efficient primitives while keeping to this style.
These cases show choices that are unsurprising considering who was likely to end up being the market for .NET (though such broad statements above shouldn't be read in a way that underestimates the amount of work and subtlety of issues within each of them). Another thing that relates to this is when .NET differs from COM or classic VB or C++ or Java, there may well be a bit of an explanation given in the documentation. When .NET differs from Haskell or Lisp, nobody feels the need to point it out!
Now, of course there are things done differently in .NET than to any of the above (or there'd have been no point and we could have stayed with COM etc.)
However, my point is that out of the near-infinite range of possible things that could end up in a framework like .NET, there are some complete no-brainers ("they might need some sort of string type..."), some close-to-obvious ("this is really easy to do in COM, so it must be easy to do in .NET"), some harder calls ("this will be more complicated than in VB6, but the benefits are worth it"), some improvements ("locale support could really be made a lot easier for developers, so lets build a new approach to the old issue") and some that were less related to the above.
At the other extreme, we can probably all imagine something that would be so out there as to be bizarre ("hey, all coders like Conway's Life - let's put a Conway's Life right into the framework") and hence there's no surprise at not finding it supported.
So far I've quickly brushed over a lot of hard work and difficult design balances in a way that makes the design they came up with seem simpler than it no doubt was. Most likely, the more "obvious" it seems to an outsider after the fact, the more difficult it was for the designers.
Immutable collection types falls into the large range of possible components to the FCL that while not as bizarre as a built-in-conway-support idea, was not as strongly called for by examining the market as a mutable list or a way to encapsulate locale information nicely. It would have been novel to much of the initial market, and therefore at risk of ending up not being used. In an alternate universe there's a .NET1.0 with immutable collections, but it's not very surprising that there wasn't one here.
*At least for consuming. Producing IEnumVARIANT implementations in VB6 wasn't simple, and could involve writing pointer values straight into v-tables in a rather nasty way that it suddenly occurs to me, is possibly not even allowed by today's DEP.
†With a sometimes impossible to implement .Reset() method. Is there any reason for this other than it was in IEnumVARIANT? Was it even ever much used in IEnumVARIANT?
I'll quote from that Eric Lippert blog that you've been reading:
because no one ever designed, specified, implemented, tested, documented and shipped that feature.
In other words, there is no reason other than it hasn't been high enough value or priority to get done ahead of all the other things they're working on.
Why can't you make an immutable struct? Consider the following:
public struct Coordinate
{
public int X
{
get { return _x; }
}
private int _x;
public int Y
{
get { return _y; }
}
private int _y;
public Coordinate(int x, int y)
{
_x = x;
_y = y;
}
}
It's an immutable value type.
It's hard to work with immutable data structures unless you have some functional programming constructs. Suppose you wanted to create an immutable vector containing every other capital letter. How would you do it unless you
A) had functions that did things like range(65, 91), filter(only even) and map(int -> char) to create the sequence in one shot and then turn it into an array
B) created the vector as mutable, added the characters in a loop, then then "froze" it, making it immutable?
By the way, C# does have the B option to some extent -- ReadOnlyCollection can wrap a mutable collection and prevent people from mutating it. However, it's a pain in the ass to do that all the time (and obviously it's hard to support sharing structure between copies when you don't know if something is going to become immutable or not.) A is a better option.
Remember, when C# 1.0 existed, it didn't have anonymous functions, it didn't have language support for generators or other laziness, it didn't have any functional APIs like LINQ -- not even map or filter -- it didn't have concise array initialization syntax (you couldn't write new int[] { 1, 2, 5 }) and it didn't have generic types; just putting stuff into and getting stuff out of collections normally was a pain. So I don't think it would have been a great choice to spend time on making robust immutable collections with such poor language support for using them.
It would be nice if .net had some really solid support for immutable data holders (classes and structures). One difficulty with adding really good support for such things, though, is that taking maximum advantage of mutable and immutable data structures would require some fundamental changes to the way inheritance works. While I would like to see such support in the next major object-oriented framework, I don't know that it can be efficiently worked into existing frameworks like .net or Java.
To see the problem, imagine that there are two basic data types: basicItem and deluxeItem (which is a basicItem with a few extra fields added). Each can exist in two concrete forms: mutable and immutable. Each can also be described in an abstract form: readable. Thus, there should be six data types; all but ReadableBasicItem should be substitutable for at least one other:
ReadableBasicItem: Not substitutable for anything
MutableBasicItem: ReadableBasicItem
ImmutableBasicItem: ReadableBasicItem
ReadableDeluxeItem: ReadableBasicItem
MutableDeluxeItem: ReadableDeluxeItem, MutableBasicItem (also ReadableBasicItem)
ImmutableDeluxeItem: ReadableDeluxeItem, ImmutableBasicItem (also ReadableBasicItem)
Even thought the underlying data type has just one base and one derived type, there inheritance graph has two "diamonds" since both "MutableDeluxeItem" and "ImmutableDeluxeItem" have two parents (MutableBasicItem and ReadableDeluxeItem), both of which inherit from ReadableBasicItem. Existing class architectures cannot effectively deal with that. Note that it wouldn't be necessary to support generalized multiple inheritance; merely to allow some specific variants such as those above (which, despite having "diamonds" in the inheritance graph, have an internal structure such that both ReadableDeluxeItem and MutableBasicItem would inherit from "the same" ReadableBasicItem).
Also, while support for that style of inheritance of mutable and immutable types might be nice, the biggest payoff wouldn't happen unless the system had a means of distinguishing heap-stored objects that should expose value semantics from those which should expose reference semantics, could distinguish mutable objects from immutable ones, and could allow objects to start out in an "uncommitted" state (neither mutable nor guaranteed immutable). Copying a reference to a heap object with mutable value semantics should perform a memberwise clone on that object and any nested objects with mutable value semantics, except in cases where the original reference would be guaranteed destroyed; the clones should start life as uncommitted, but be CompareExchange'd to mutable or immutable as needed.
Adding framework support for such features would allow copy-on-write value semantics to be implemented much more efficiently than would be possible without framework support, but such support would really have to be built into the framework from the ground up. I don't think it could very well be overlaid onto an existing framework.

Performance concern when using LINQ "everywhere"?

After upgrading to ReSharper5 it gives me even more useful tips on code improvements. One I see everywhere now is a tip to replace foreach-statements with LINQ queries. Take this example:
private Ninja FindNinjaById(int ninjaId)
{
foreach (var ninja in Ninjas)
{
if (ninja.Id == ninjaId)
return ninja;
}
return null;
}
This is suggested replaced with the following using LINQ:
private Ninja FindNinjaById(int ninjaId)
{
return Ninjas.FirstOrDefault(ninja => ninja.Id == ninjaId);
}
This looks all fine, and I'm sure it's no problem regarding performance to replace this one foreach. But is it something I should do in general? Or might I run into performance problems with all these LINQ queries everywhere?
You need to understand what the LINQ query is going to do "under the hood" and compare that to running your code before you can know whether you should change it. Generally, I don't mean that you need to know the exact code that will be generated, but you do need to know the basic idea of how it would go about performing the operation. In your example, I would surmise that LINQ would basically work about the same as your code and because the LINQ statement is more compact and descriptive, I would prefer it. There are times, though, when LINQ may not be the ideal choice, though probably not many. Generally I would think that just about any looping construct would be replaceable by an equivalent LINQ construct.
Let me start by saying that I love LINQ for its expressiveness and use it all the time without any problem.
There are however some differences in performance. Normally they are small enough to ignore, but in the critical path of your application, there might be times you want to optimize them away.
Here is the set of differences that you should be aware of, that could matter with performance:
LINQ uses delegate calls excessively, and delegate invocations are (a very tiny bit) slower than method invocations and of course slower than inline code.
A delegate is a method pointer inside an object. That object need to be created.
LINQ operators usually return a new object (an iterator) that allows looping through the collection. Chained LINQ operators thus create multiple new objects.
When your inner loop uses objects from outside (called closures) they have to be wrapped in objects as well (which need to be created).
Many LINQ operators call the GetEnumerator method on an collection to iterate it. Calling GetEnumerator usually ensures the creation of yet another object.
Iterating the collection is done using the IEnumerator interface. Interface calls are a bit slower than normal method calls.
IEnumerator objects often need to be disposed or at least, Dispose has to be called.
When performance is a concern, also try using for over foreach.
Again, I love LINQ and I can't remember ever decided not to use a LINQ (to objects) query because of performance. So, don't do any premature optimizations. Start with the most readability solution first, than optimize when needed. So profile, profile and profile.
One thing we identified to be performance problematic is creating lots of lambdas and iterating over small collections. What happens in the converted sample?
Ninjas.FirstOrDefault(ninja => ninja.Id == ninjaId)
First, new instance of (generated) closure type is created. New instance in managed heap, some work for GC.
Second, new delegate instance is created from method in that closure.
Then method FirstOrDefault is called. What it does?
It iterates collection (same as your original code) and calls delegate.
So basically, you have 4 things added here:
1. Create closure
2. Create delegate
3. Call through delegate
4. Collect closure and delegate
If you call FindNinjaById lots of times, you will add this to may be important perforamnce hit. Of course, measure it.
If you replace it with (equivalent)
Ninjas.Where(ninja => ninja.Id == ninjaId).FirstOrDefault()
it adds
5. Creating state machine for iterator ("Where" is yielding function)
Profile
The only way to know for sure is to profile. Yes, certain queries can be slower. But when you look at what ReSharper has replaced here, it's essentially the same thing, done in a different manner. The ninjas are looped, each Id is checked. If anything, you could argue this refactoring comes down to readability. Which of the two do you find easier to read?
Larger data sets will have a bigger impact sure, but as I've said, profile. It's the only way to be sure if such enhancements have a negative effect.
We've built massive apps, with LINQ sprinkled liberally throughout. It's never, ever slowed us down.
It's perfectly possible to write LINQ queries that will be very slow, but it's easier to fix simple LINQ statements than enormous for/if/for/return algorithms.
Take resharper's advice :)
An anecdote: when I was just getting to know C# 3.0 and LINQ, I was still in my "when you have a hammer, everything looks like a nail" phase. As a school assignment, I was supposed to write a connect four/four in row game as an exercise in adversarial search algorithms. I used LINQ throughout the program. In one particular case, I needed to find the row a game-piece would land on if I dropped it in a particular column. Perfect use-case for a LINQ query! This turned out to be really slow. However, LINQ wasn't the problem, the problem was that I was searching to begin with. I optimized this by just keeping a look-up table: an integer array containing the row number for every column of the game-board, updating that table when inserting a game-piece. Needless to say, this was much, much faster.
Lesson learned: optimize your algorithm first, and high level constructs like LINQ might actually make that easier.
That said, there is a definite cost to creating all those delegates. On the other hand, there can also be a performance benefit by utilizing LINQ's lazy nature. If you manually loop over a collection, you're pretty much forced to create intermediate List<>'s whereas with LINQ, you basically stream the results.
The above does the exact same thing.
As long as you use your LINQ queries correctly you will not suffer from performance issues. If you use it correctly it is more likely to be faster due to the skill of the people creating LINQ.
The only thing you can benefit of creating your own is if you want full control or LINQ does not offer what you need or you want a better ability to debug.
The cool thing about LINQ queries is that it makes it dead simple to convert to a parallel query. Depending on what you're doing, it may or may not be faster (as always, profile), but it's pretty neat, nonetheless.
To add my own experience of using LINQ where performance really does matter - with Monotouch - the difference there is still insignificant.
You're 'handicapped' on the 3GS iPhone to around 46mb of ram and a 620mhz ARM processor. Admittedly the code is AOT compiled but even on the simulator where it is JIT'd and going through a long series of indirection the difference is tenths of a millisecond for sets of 1000s of objects.
Along with Windows Mobile this is where you have to worry about the performance costs - not in huge ASP.NET applications that are running on quad-core 8gb servers, or desktops with dual scores. One exception to this would be with large object sets, although arguably you would lazy load anyway, and the initial query task would be performed on the database server.
It's a bit of a cliché on Stackoverflow, but use the shorter more readable code until 100s of milliseconds really do matter.

Should I be concerned about .NET dictionary speed?

I will be creating a project that will use dictionary lookups and inserts quite a bit. Is this something to be concerned about?
Also, if I do benchmarking and such and it is really bad, then what is the best way of replacing dictionary with something else? Would using an array with "hashed" keys even be faster? That wouldn't help on insert time though will it?
Also, I don't think I'm micro-optimizing because this really will be a significant part of code on a production server, so if this takes an extra 100ms to complete, then we will be looking for new ways to handle this.
You are micro-optimizing. Do you even have working code yet? Remember, "If it doesn't work, it doesn't matter how fast it doesn't work." (Mich Ravera) http://www.codingninja.co.uk/best-programmers-quotes/.
You have no idea where the bottlenecks will be, and already you're focused on Dictionary. What if the problem is somewhere else?
How do you know how the Dictionary class is implemented? Maybe it already uses an array with hashed keys!
P.S. It's really ".NET Dictionaries", not "C# Dictionaries", because C# is just one of several programming languages that use the framework.
Hello, I will be creating a project
that will use dictionary lookups and
inserts quite a bit. Is this something
to be concerned about?
Yes. It is always wise to consider performance factors up front.
The form that your concern should take is as follows: your concern should be encouraging you to write realistic, user-focused performance specifications. It should be encouraging you to start writing performance tests early, and running them often, so that you can see how every single change to the product affects performance. That way you will be informed immediately when a code change causes a user-affecting change in performance. And it should be encouraging you to run profiles often, so that you are reasoning about performance based on empirical measurements, rather than random guesses and hunches.
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
The best way to do this is to build a reasonable abstraction layer. If you have a class (or interface) which represents the "insert" and "lookup" abstract data type, then you can replace its internals without changing any of the callers.
Note that adding a layer of abstraction itself has a performance cost. If your profiling shows that the abstraction layer is too expensive, if the extra couple nanoseconds per call is too much, then you might have to get rid of the abstraction layer. Again, this decision will be driven by real-world performance data.
Would using an array with "hashed"
keys even be faster? That wouldn't
help on insert time though will it?
Neither you nor anyone reading this can possibly know which one is faster until you write it both ways and then benchmark it both ways under real-world conditions. Doing it under "lab" conditions will skew your results; you'll need to understand how things work when the GC is under realistic memory pressure, and so on. You might as well ask us which horse will run faster in next year's Kentucky Derby. If we knew the answer just by looking at the racing form, we'd all be rich already. You can't possibly expect anyone to know which of two entirely hypothetical, unwritten pieces of code will be faster under unspecified conditions!
The Dictionary<TKey, TValue> class is actually implemented as a hash table which makes lookups very fast (close to O(1)). See the API documentation for more information. I doubt you could make a better implementation yourself.
Wait and see if the performance of your application is below expectations
If it is then use a profiler to determine if the Dictionary lookup is the source of the problem
If it is then do some tests with representative data to see if another choice of list would be quicker.
In short - no, in general you shouldn't worry about the performance of implementation details until after you have a problem.
I would do a benchmark of the Dictionary, HashTable (HashSet in .NET), and perhaps a home grown class, and see which works out best under your typical usage conditions.
Normally I would say it's fine (insert StackOverflow's favorite premature ejaculation quote here), but if this is a core peice of the application, Benchmark, Benchmark, Benchmark.
The only concern that I can think of is that the speed of the dictionary relies on the key class having a reasonably fast GetHashCode method. Lookups and inserts are really fast, so you shouldn't have any problem there.
Regarding using an array, that's what the Dictionary class does already. Actually it uses two arrays, one for the keys and one for the values.
If you would have any performance problems with a Dictionary, it would be quite easy to make a wrapper for any kind of storage, that has the same methods and behaviour as a Dictionary so that you can replace it seamlessly.
I'm not sure that anyone has really answered this part yet:
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
For this, wherever possible, declare your variables as IDictionary<TKey, TValue>. That's the main interface that Dictionary derives from. (I'm assuming that if you care that much about performance, then you aren't considering non-generic collections.) Then, in the future, you can change the underlying implementation class without having to change any of the code that uses that dictionary. For example:
IDictionary<string, int> myDict = new Dictionary<string, int>();
If your application is multithreaded then the key part of performance is going to be synchronizing this Dictionary correctly.
If it is single-threaded then almost certainly bottleneck will be elsewhere. Such as reading these objects from wherever you are reading them.
I use Dictionary for UDP relay server . Each time packet arrives it performs Dictionary.ContainsKey and Dictionary[Key] , and it works great (massive number of clients). I had concerns when I was making the thing but it turned out that was last thing I should worry about.
Have a look at C# HybridDictionary Usage
HybridDictionary Class
This class is recommended for cases
where the number of elements in a
dictionary is unknown. It takes
advantage of the improved performance
of a ListDictionary with small
collections, and offers the
flexibility of switching to a
Hashtable which handles larger
collections better than ListDictionary
You may consider using the C5 library. I've found it to be very fast and thoughtfully designed. Others on stackoverflow have found the same. With C5 you have the option of using general type interfaces (with a captial I), or directly the data structures underneath. Naturally the interfaces allow you to swap out different implementations, but I have found in performance testing that the interfaces will cost you.
You may want to look at the KeyedCollection class in System.ObjectModel. From the MSDN description, "provides the abstract base class for a collection whose keys are embedded in the values."

Do C# Generics Have a Performance Benefit?

I have a number of data classes representing various entities.
Which is better: writing a generic class (say, to print or output XML) using generics and interfaces, or writing a separate class to deal with each data class?
Is there a performance benefit or any other benefit (other than it saving me the time of writing separate classes)?
There's a significant performance benefit to using generics -- you do away with boxing and unboxing. Compared with developing your own classes, it's a coin toss (with one side of the coin weighted more than the other). Roll your own only if you think you can out-perform the authors of the framework.
Not only yes, but HECK YES. I didn't believe how big of a difference they could make. We did testing in VistaDB after a rewrite of a small percentage of core code that used ArrayLists and HashTables over to generics. 250% or more was the speed improvement.
Read my blog about the testing we did on generics vs weak type collections. The results blew our mind.
I have started rewriting lots of old code that used the weakly typed collections into strongly typed ones. One of my biggest grips with the ADO.NET interface is that they don't expose more strongly typed ways of getting data in and out. The casting time from an object and back is an absolute killer in high volume applications.
Another side effect of strongly typing is that you often will find weakly typed reference problems in your code. We found that through implementing structs in some cases to avoid putting pressure on the GC we could further speed up our code. Combine this with strongly typing for your best speed increase.
Sometimes you have to use weakly typed interfaces within the dot net runtime. Whenever possible though look for ways to stay strongly typed. It really does make a huge difference in performance for non trivial applications.
Generics in C# are truly generic types from the CLR perspective. There should not be any fundamental difference between the performance of a generic class and a specific class that does the exact same thing. This is different from Java Generics, which are more of an automated type cast where needed or C++ templates that expand at compile time.
Here's a good paper, somewhat old, that explains the basic design:
"Design and Implementation of Generics for the
.NET Common Language Runtime".
If you hand-write classes for specific tasks chances are you can optimize some aspects where you would need additional detours through an interface of a generic type.
In summary, there may be a performance benefit but I would recommend the generic solution first, then optimize if needed. This is especially true if you expect to instantiate the generic with many different types.
I did some simple benchmarking on ArrayList's vs Generic Lists for a different question: Generics vs. Array Lists, your mileage will vary, but the Generic List was 4.7 times faster than the ArrayList.
So yes, boxing / unboxing are critical if you are doing a lot of operations. If you are doing simple CRUD stuff, I wouldn't worry about it.
Generics are one of the way to parameterize code and avoid repetition. Looking at your program description and your thought of writing a separate class to deal with each and every data object, I would lean to generics. Having a single class taking care of many data objects, instead of many classes that do the same thing, increases your performance. And of course your performance, measured in the ability to change your code, is usually more important than the computer performance. :-)
According to Microsoft, Generics are faster than casting (boxing/unboxing primitives) which is true.
They also claim generics provide better performance than casting between reference types, which seems to be untrue (no one can quite prove it).
Tony Northrup - co-author of MCTS 70-536: Application Development Foundation - states in the same book the following:
I haven’t been able to reproduce the
performance benefits of generics;
however, according to Microsoft,
generics are faster than using
casting. In practice, casting proved
to be several times faster than using
a generic. However, you probably won’t
notice performance differences in your
applications. (My tests over 100,000
iterations took only a few seconds.)
So you should still use generics
because they are type-safe.
I haven't been able to reproduce such performance benefits with generics compared to casting between reference types - so I'd say the performance gain is "supposed" more than "significant".
if you compare a generic list (for example) to a specific list for exactly the type you use then the difference is minimal, the results from the JIT compiler are almost the same.
if you compare a generic list to a list of objects then there is significant benefits to the generic list - no boxing/unboxing for value types and no type checks for reference types.
also the generic collection classes in the .net library were heavily optimized and you are unlikely to do better yourself.
In the case of generic collections vs. boxing et al, with older collections like ArrayList, generics are a performance win. But in the vast majority of cases this is not the most important benefit of generics. I think there are two things that are of much greater benefit:
Type safety.
Self documenting aka more readable.
Generics promote type safety, forcing a more homogeneous collection. Imagine stumbling across a string when you expect an int. Ouch.
Generic collections are also more self documenting. Consider the two collections below:
ArrayList listOfNames = new ArrayList();
List<NameType> listOfNames = new List<NameType>();
Reading the first line you might think listOfNames is a list of strings. Wrong! It is actually storing objects of type NameType. The second example not only enforces that the type must be NameType (or a descendant), but the code is more readable. I know right away that I need to go find TypeName and learn how to use it just by looking at the code.
I have seen a lot of these "does x perform better than y" questions on StackOverflow. The question here was very fair, and as it turns out generics are a win any way you skin the cat. But at the end of the day the point is to provide the user with something useful. Sure your application needs to be able to perform, but it also needs to not crash, and you need to be able to quickly respond to bugs and feature requests. I think you can see how these last two points tie in with the type safety and code readability of generic collections. If it were the opposite case, if ArrayList outperformed List<>, I would probably still take the List<> implementation unless the performance difference was significant.
As far as performance goes (in general), I would be willing to bet that you will find the majority of your performance bottlenecks in these areas over the course of your career:
Poor design of database or database queries (including indexing, etc),
Poor memory management (forgetting to call dispose, deep stacks, holding onto objects too long, etc),
Improper thread management (too many threads, not calling IO on a background thread in desktop apps, etc),
Poor IO design.
None of these are fixed with single-line solutions. We as programmers, engineers and geeks want to know all the cool little performance tricks. But it is important that we keep our eyes on the ball. I believe focusing on good design and programming practices in the four areas I mentioned above will further that cause far more than worrying about small performance gains.
Generics are faster!
I also discovered that Tony Northrup wrote wrong things about performance of generics and non-generics in his book.
I wrote about this on my blog:
http://andriybuday.blogspot.com/2010/01/generics-performance-vs-non-generics.html
Here is great article where author compares performance of generics and non-generics:
nayyeri.net/use-generics-to-improve-performance
If you're thinking of a generic class that calls methods on some interface to do its work, that will be slower than specific classes using known types, because calling an interface method is slower than a (non-virtual) function call.
Of course, unless the code is the slow part of a performance-critical process, you should focus of clarity.
See Rico Mariani's Blog at MSDN too:
http://blogs.msdn.com/ricom/archive/2005/08/26/456879.aspx
Q1: Which is faster?
The Generics version is considerably
faster, see below.
The article is a little old, but gives the details.
Not only can you do away with boxing but the generic implementations are somewhat faster than the non generic counterparts with reference types due to a change in the underlying implementation.
The originals were designed with a particular extension model in mind. This model was never really used (and would have been a bad idea anyway) but the design decision forced a couple of methods to be virtual and thus uninlineable (based on the current and past JIT optimisations in this regard).
This decision was rectified in the newer classes but cannot be altered in the older ones without it being a potential binary breaking change.
In addition iteration via foreach on an List<> (rather than IList<>) is faster due to the ArrayList's Enumerator requiring a heap allocation. Admittedly this did lead to an obscure bug

Categories