Should I be concerned about .NET dictionary speed? - c#

I will be creating a project that will use dictionary lookups and inserts quite a bit. Is this something to be concerned about?
Also, if I do benchmarking and such and it is really bad, then what is the best way of replacing dictionary with something else? Would using an array with "hashed" keys even be faster? That wouldn't help on insert time though will it?
Also, I don't think I'm micro-optimizing because this really will be a significant part of code on a production server, so if this takes an extra 100ms to complete, then we will be looking for new ways to handle this.

You are micro-optimizing. Do you even have working code yet? Remember, "If it doesn't work, it doesn't matter how fast it doesn't work." (Mich Ravera) http://www.codingninja.co.uk/best-programmers-quotes/.
You have no idea where the bottlenecks will be, and already you're focused on Dictionary. What if the problem is somewhere else?
How do you know how the Dictionary class is implemented? Maybe it already uses an array with hashed keys!
P.S. It's really ".NET Dictionaries", not "C# Dictionaries", because C# is just one of several programming languages that use the framework.

Hello, I will be creating a project
that will use dictionary lookups and
inserts quite a bit. Is this something
to be concerned about?
Yes. It is always wise to consider performance factors up front.
The form that your concern should take is as follows: your concern should be encouraging you to write realistic, user-focused performance specifications. It should be encouraging you to start writing performance tests early, and running them often, so that you can see how every single change to the product affects performance. That way you will be informed immediately when a code change causes a user-affecting change in performance. And it should be encouraging you to run profiles often, so that you are reasoning about performance based on empirical measurements, rather than random guesses and hunches.
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
The best way to do this is to build a reasonable abstraction layer. If you have a class (or interface) which represents the "insert" and "lookup" abstract data type, then you can replace its internals without changing any of the callers.
Note that adding a layer of abstraction itself has a performance cost. If your profiling shows that the abstraction layer is too expensive, if the extra couple nanoseconds per call is too much, then you might have to get rid of the abstraction layer. Again, this decision will be driven by real-world performance data.
Would using an array with "hashed"
keys even be faster? That wouldn't
help on insert time though will it?
Neither you nor anyone reading this can possibly know which one is faster until you write it both ways and then benchmark it both ways under real-world conditions. Doing it under "lab" conditions will skew your results; you'll need to understand how things work when the GC is under realistic memory pressure, and so on. You might as well ask us which horse will run faster in next year's Kentucky Derby. If we knew the answer just by looking at the racing form, we'd all be rich already. You can't possibly expect anyone to know which of two entirely hypothetical, unwritten pieces of code will be faster under unspecified conditions!

The Dictionary<TKey, TValue> class is actually implemented as a hash table which makes lookups very fast (close to O(1)). See the API documentation for more information. I doubt you could make a better implementation yourself.

Wait and see if the performance of your application is below expectations
If it is then use a profiler to determine if the Dictionary lookup is the source of the problem
If it is then do some tests with representative data to see if another choice of list would be quicker.
In short - no, in general you shouldn't worry about the performance of implementation details until after you have a problem.

I would do a benchmark of the Dictionary, HashTable (HashSet in .NET), and perhaps a home grown class, and see which works out best under your typical usage conditions.
Normally I would say it's fine (insert StackOverflow's favorite premature ejaculation quote here), but if this is a core peice of the application, Benchmark, Benchmark, Benchmark.

The only concern that I can think of is that the speed of the dictionary relies on the key class having a reasonably fast GetHashCode method. Lookups and inserts are really fast, so you shouldn't have any problem there.
Regarding using an array, that's what the Dictionary class does already. Actually it uses two arrays, one for the keys and one for the values.
If you would have any performance problems with a Dictionary, it would be quite easy to make a wrapper for any kind of storage, that has the same methods and behaviour as a Dictionary so that you can replace it seamlessly.

I'm not sure that anyone has really answered this part yet:
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
For this, wherever possible, declare your variables as IDictionary<TKey, TValue>. That's the main interface that Dictionary derives from. (I'm assuming that if you care that much about performance, then you aren't considering non-generic collections.) Then, in the future, you can change the underlying implementation class without having to change any of the code that uses that dictionary. For example:
IDictionary<string, int> myDict = new Dictionary<string, int>();

If your application is multithreaded then the key part of performance is going to be synchronizing this Dictionary correctly.
If it is single-threaded then almost certainly bottleneck will be elsewhere. Such as reading these objects from wherever you are reading them.

I use Dictionary for UDP relay server . Each time packet arrives it performs Dictionary.ContainsKey and Dictionary[Key] , and it works great (massive number of clients). I had concerns when I was making the thing but it turned out that was last thing I should worry about.

Have a look at C# HybridDictionary Usage
HybridDictionary Class
This class is recommended for cases
where the number of elements in a
dictionary is unknown. It takes
advantage of the improved performance
of a ListDictionary with small
collections, and offers the
flexibility of switching to a
Hashtable which handles larger
collections better than ListDictionary

You may consider using the C5 library. I've found it to be very fast and thoughtfully designed. Others on stackoverflow have found the same. With C5 you have the option of using general type interfaces (with a captial I), or directly the data structures underneath. Naturally the interfaces allow you to swap out different implementations, but I have found in performance testing that the interfaces will cost you.

You may want to look at the KeyedCollection class in System.ObjectModel. From the MSDN description, "provides the abstract base class for a collection whose keys are embedded in the values."

Related

Performance impact of class inheritance

So, this may seem like a very odd question for many, but here it is:
Say you have an abstract class "Object" with an abstract method doStuff() which 10.000 classes inherit from.
Then in another class you have an "Object" dictionary with 100 random objects of the "Object" type in it. You call doStuff() on them.
Does the amount of classes have any performance impact? How does the executable find which class to execute the method of? Is that a jumptable, a pointertable, the equivalent logic of a huge switch-case, ..?
If it has any performance impact, are there ways to structure your code differently to eliminate this problem?
I feel I am really overthinking this.
There is no noticeable performance impact when you call doStuff.
At runtime, the type of object you are calling doStuff on is known for sure. At compile time you'd need a giant switch statement because you don't know the type. CLR sees that you are trying to call doStuff on Subclass0679, goes into that class, and invokes the method. Simple as that.
Think about it this way. ToString() is declared in Object and all classes inherit Object. When you call ToString() on something, is it really slow? No.
The number of derived classes can have some impact.
In particular, with ten thousand derived classes and only 100 objects, chances are pretty good that each call to doStuff will actually be to a unique function, separate from the others. That means your cache won't be very effective.
This is a fairly unrealistic scenario though. A collection that could be any one of ten thousand different derived classes is unlikely to ever arise in actual practice.
To put this in perspective, .NET (in its entirety) consists of about nine thousand nine hundred classes. A decent sized application could easily add the hundred or so more needed to get to ten thousand--but you're talking about a collection that could include anything in .NET, plus anything in your application. I find it difficult to imagine a situation in which this is likely to make sense.
If you're asking this out of curiosity as a hypothetical question, then fair enough.
However, if you're trying to prematurely optimise some code and this is the level of decisions you're making, I would highly recommend you concentrate on making your code work first and then use a profiler to identify hotspots as areas to optimise.
Also, super optimised code is usually far less readable and maintainable.
Unless your code is a game engine or perform some enormous calculation, then does it really need to be so optimised? If the code communicates with the outside world at all - network, disk, db etc - then that latency will completely dwarf any imperceptible difference in timing because of using inheritance.

Why HashSet<T> class is not used to implement Enumerable.Distinct

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation
So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"
What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions
However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime
So, my questions are:
What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?
which is almost a classical implementation of a hash table with "open addressing"
Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.
a lot of code in Set is just a copy-paste from HashSet, with some omissions
AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.
It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells
In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.
As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.
Especially given the fact that both of these classes are in the same assembly System.Core
Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.
Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.
That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think #james-ko was considering testing the benefits of doing that.
It looks like HashSet<T> will perform better
For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).
so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.
Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

Do C# Generics Have a Performance Benefit?

I have a number of data classes representing various entities.
Which is better: writing a generic class (say, to print or output XML) using generics and interfaces, or writing a separate class to deal with each data class?
Is there a performance benefit or any other benefit (other than it saving me the time of writing separate classes)?
There's a significant performance benefit to using generics -- you do away with boxing and unboxing. Compared with developing your own classes, it's a coin toss (with one side of the coin weighted more than the other). Roll your own only if you think you can out-perform the authors of the framework.
Not only yes, but HECK YES. I didn't believe how big of a difference they could make. We did testing in VistaDB after a rewrite of a small percentage of core code that used ArrayLists and HashTables over to generics. 250% or more was the speed improvement.
Read my blog about the testing we did on generics vs weak type collections. The results blew our mind.
I have started rewriting lots of old code that used the weakly typed collections into strongly typed ones. One of my biggest grips with the ADO.NET interface is that they don't expose more strongly typed ways of getting data in and out. The casting time from an object and back is an absolute killer in high volume applications.
Another side effect of strongly typing is that you often will find weakly typed reference problems in your code. We found that through implementing structs in some cases to avoid putting pressure on the GC we could further speed up our code. Combine this with strongly typing for your best speed increase.
Sometimes you have to use weakly typed interfaces within the dot net runtime. Whenever possible though look for ways to stay strongly typed. It really does make a huge difference in performance for non trivial applications.
Generics in C# are truly generic types from the CLR perspective. There should not be any fundamental difference between the performance of a generic class and a specific class that does the exact same thing. This is different from Java Generics, which are more of an automated type cast where needed or C++ templates that expand at compile time.
Here's a good paper, somewhat old, that explains the basic design:
"Design and Implementation of Generics for the
.NET Common Language Runtime".
If you hand-write classes for specific tasks chances are you can optimize some aspects where you would need additional detours through an interface of a generic type.
In summary, there may be a performance benefit but I would recommend the generic solution first, then optimize if needed. This is especially true if you expect to instantiate the generic with many different types.
I did some simple benchmarking on ArrayList's vs Generic Lists for a different question: Generics vs. Array Lists, your mileage will vary, but the Generic List was 4.7 times faster than the ArrayList.
So yes, boxing / unboxing are critical if you are doing a lot of operations. If you are doing simple CRUD stuff, I wouldn't worry about it.
Generics are one of the way to parameterize code and avoid repetition. Looking at your program description and your thought of writing a separate class to deal with each and every data object, I would lean to generics. Having a single class taking care of many data objects, instead of many classes that do the same thing, increases your performance. And of course your performance, measured in the ability to change your code, is usually more important than the computer performance. :-)
According to Microsoft, Generics are faster than casting (boxing/unboxing primitives) which is true.
They also claim generics provide better performance than casting between reference types, which seems to be untrue (no one can quite prove it).
Tony Northrup - co-author of MCTS 70-536: Application Development Foundation - states in the same book the following:
I haven’t been able to reproduce the
performance benefits of generics;
however, according to Microsoft,
generics are faster than using
casting. In practice, casting proved
to be several times faster than using
a generic. However, you probably won’t
notice performance differences in your
applications. (My tests over 100,000
iterations took only a few seconds.)
So you should still use generics
because they are type-safe.
I haven't been able to reproduce such performance benefits with generics compared to casting between reference types - so I'd say the performance gain is "supposed" more than "significant".
if you compare a generic list (for example) to a specific list for exactly the type you use then the difference is minimal, the results from the JIT compiler are almost the same.
if you compare a generic list to a list of objects then there is significant benefits to the generic list - no boxing/unboxing for value types and no type checks for reference types.
also the generic collection classes in the .net library were heavily optimized and you are unlikely to do better yourself.
In the case of generic collections vs. boxing et al, with older collections like ArrayList, generics are a performance win. But in the vast majority of cases this is not the most important benefit of generics. I think there are two things that are of much greater benefit:
Type safety.
Self documenting aka more readable.
Generics promote type safety, forcing a more homogeneous collection. Imagine stumbling across a string when you expect an int. Ouch.
Generic collections are also more self documenting. Consider the two collections below:
ArrayList listOfNames = new ArrayList();
List<NameType> listOfNames = new List<NameType>();
Reading the first line you might think listOfNames is a list of strings. Wrong! It is actually storing objects of type NameType. The second example not only enforces that the type must be NameType (or a descendant), but the code is more readable. I know right away that I need to go find TypeName and learn how to use it just by looking at the code.
I have seen a lot of these "does x perform better than y" questions on StackOverflow. The question here was very fair, and as it turns out generics are a win any way you skin the cat. But at the end of the day the point is to provide the user with something useful. Sure your application needs to be able to perform, but it also needs to not crash, and you need to be able to quickly respond to bugs and feature requests. I think you can see how these last two points tie in with the type safety and code readability of generic collections. If it were the opposite case, if ArrayList outperformed List<>, I would probably still take the List<> implementation unless the performance difference was significant.
As far as performance goes (in general), I would be willing to bet that you will find the majority of your performance bottlenecks in these areas over the course of your career:
Poor design of database or database queries (including indexing, etc),
Poor memory management (forgetting to call dispose, deep stacks, holding onto objects too long, etc),
Improper thread management (too many threads, not calling IO on a background thread in desktop apps, etc),
Poor IO design.
None of these are fixed with single-line solutions. We as programmers, engineers and geeks want to know all the cool little performance tricks. But it is important that we keep our eyes on the ball. I believe focusing on good design and programming practices in the four areas I mentioned above will further that cause far more than worrying about small performance gains.
Generics are faster!
I also discovered that Tony Northrup wrote wrong things about performance of generics and non-generics in his book.
I wrote about this on my blog:
http://andriybuday.blogspot.com/2010/01/generics-performance-vs-non-generics.html
Here is great article where author compares performance of generics and non-generics:
nayyeri.net/use-generics-to-improve-performance
If you're thinking of a generic class that calls methods on some interface to do its work, that will be slower than specific classes using known types, because calling an interface method is slower than a (non-virtual) function call.
Of course, unless the code is the slow part of a performance-critical process, you should focus of clarity.
See Rico Mariani's Blog at MSDN too:
http://blogs.msdn.com/ricom/archive/2005/08/26/456879.aspx
Q1: Which is faster?
The Generics version is considerably
faster, see below.
The article is a little old, but gives the details.
Not only can you do away with boxing but the generic implementations are somewhat faster than the non generic counterparts with reference types due to a change in the underlying implementation.
The originals were designed with a particular extension model in mind. This model was never really used (and would have been a bad idea anyway) but the design decision forced a couple of methods to be virtual and thus uninlineable (based on the current and past JIT optimisations in this regard).
This decision was rectified in the newer classes but cannot be altered in the older ones without it being a potential binary breaking change.
In addition iteration via foreach on an List<> (rather than IList<>) is faster due to the ArrayList's Enumerator requiring a heap allocation. Admittedly this did lead to an obscure bug

Generics vs. Array Lists

The system I work on here was written before .net 2.0 and didn't have the benefit of generics. It was eventually updated to 2.0, but none of the code was refactored due to time constraints. There are a number of places where the code uses ArraysLists etc. that store things as objects.
From performance perspective, how important change the code to using generics? I know from a perfomance perspective, boxing and unboxing etc., it is inefficient, but how much of a performance gain will there really be from changing it? Are generics something to use on a go forward basis, or it there enough of a performance change that a conscience effort should be made to update old code?
Technically the performance of generics is, as you say, better. However, unless performance is hugely important AND you've already optimised in other areas you're likely to get MUCH better improvements by spending your time elsewhere.
I would suggest:
use generics going forward.
if you have solid unit tests then refactor to generics as you touch code
spend other time doing refactorings/measurement that will significantly improve performance (database calls, changing data structures, etc) rather than a few milliseconds here and there.
Of course there's reasons other than performance to change to generics:
less error prone, since you have compile-time checking of types
more readable, you don't need to cast all over the place and it's obvious what type is stored in a collection
if you're using generics going forward, then it's cleaner to use them everywhere
Here's the results I got from a simple parsing of a string from a 100KB file 100,000 times. The Generic List(Of char) took 612.293 seconds to go 100,000 times through the file.
The ArrayList took 2,880.415 seconds to go 100,000 times through the file. This means in this scenario (as your mileage will vary) the Generic List(Of char) is 4.7 times faster.
Here is the code I ran through 100,000 times:
Public Sub Run(ByVal strToProcess As String) Implements IPerfStub.Run
Dim genList As New ArrayList
For Each ch As Char In strToProcess.ToCharArray
genList.Add(ch)
Next
Dim dummy As New System.Text.StringBuilder()
For i As Integer = 0 To genList.Count - 1
dummy.Append(genList(i))
Next
End Sub
Public Sub Run(ByVal strToProcess As String) Implements IPerfStub.Run
Dim genList As New List(Of Char)
For Each ch As Char In strToProcess.ToCharArray
genList.Add(ch)
Next
Dim dummy As New System.Text.StringBuilder()
For i As Integer = 0 To genList.Count - 1
dummy.Append(genList(i))
Next
End Sub
The only way to know for sure is to profile your code using a tool like dotTrace.
http://www.jetbrains.com/profiler/
It's possible that the boxing/unboxing is trivial in your particular application and wouldn't be worth refactoring. Going forward, you should still consider using generics due to the compile-time type safety.
Generics, whether Java or .NET, should be used for design and type safety, not for performance. Autoboxing is different from generics (essentially implicit object to primitive conversions), and as you mentioned, you should NOT use them in place of a primitive if there is to be a lot of arithmetic or other operations which will cause a performance hit from the repeated implicit object creation/destruction.
Overall I would suggest using going forward, and only updating existing code if it needs to be cleaned up for type safety / design purposes, not performance.
It depends, the best answer is to profile your code and see. I like AQTime but a number of packages exist for this.
In general, if an ArrayList is being used a LOT it may be worth switching it to a generic version. Really though, it's most likely that you wouldn't even be able to measure the performance difference. Boxing and unboxing are extra steps but modern computers are so fast that it makes almost no difference. As an ArrayList is really just an normal array with a nice wrapper, you would probably see much more performance gained from better data structure selection (ArrayList.Remove is O(n)!) than with the conversion to generics.
Edit: Outlaw Programmer has a good point, you will still be boxing and unboxing with generics, it just happens implicitly. All the code around checking for exceptions and nulls from casting and "is/as" keywords would help a bit though.
The biggest gains, you will find in Maintenance phases. Generics are much easier to deal with and update, without having to deal with conversion and casting issues. If this is code that you continually visit, then by all means take the effort. If this is code that hasn't been touched in years, I wouldn't really bother.
What does autoboxing/unboxing have to do with generics? This is just a type-safety issue. With a non-generic collection, you are required to explicitly cast back to an object's actual type. With generics, you can skip this step. I don't think there is a performance difference one way or the other.
My old company actually considered this problem. The approach we took was: if it's easy to refactor, do it; if not (i.e. it will touch too many classes), leave it for a later time. It really depends on whether or not you have the time to do it, or whether there are more important items to be coding (i.e. features you should be implementing for clients).
Then again, if you're not working on something for a client, go ahead and spend time refactoring. It'll improve readability of the code for yourself.
Depends on how much is out there in your code. If you binding or display large lists in the UI, you would probably see a great gain in performance.
If your ArrayList are just sprinkled about here and there, then it probably wouldn't be a big deal to just get it cleaned up, but also wouldn't impact overall performance very much.
If you are using a lot a ArrayLists throughout your code and it would be a big untertaking to replace them (something that may impact your schedules), then you could adopt a if-you-touch-it-change-it approach.
Main thing is, though, that Generics are a lot easier to read, and are more stable across the app due to the strong typing you get from them. You'll see gains not just from performance, but from code maintainablity and stability. If you can do it quickly, I'd say do it.
If you can get buy-in from the Product Owner, I'd recommend getting it cleaned up. You love your code more afterward.
If the entities in the ArrayLists are Object types, you'll gain a little from not casting them to the correct type. If they're Value types (structs or primitives like Int32), then the boxing/unboxing process adds a lot of overhead, and Generic collections should be much faster.
Here's an MSDN article on the subject
Generics has much better performance especially if you'll be using value-type (int, bool, struct etc.) where you'll gain a noticeble performance gain.
Using Arraylist with value-types causes boxing/unboxing which if done several hundred times is substantialy slower then using generic List.
when storing value-types as object you'll up to four time memory per item. While this amount won't drain your RAM the cache memory that is smaller could contain less items, Which means that while iterating a long collections there would be many copies from the main memory to the cache that would slow your application.
I wrote about here.
Using generics should also mean that your code will be simplier and easy to use if you want to leverage things like linq in the later c# versions.

In C# (or any language) what is/are your favourite way of removing repetition?

I've just coded a 700 line class. Awful. I hang my head in shame. It's as opposite to DRY as a British summer.
It's full of cut and paste with minor tweaks here and there. This makes it's a prime candidate for refactoring. Before I embark on this, I'd thought I'd ask when you have lots of repetition, what are the first refactoring opportunities you look for?
For the record, mine are probably using:
Generic classes and methods
Method overloading/chaining.
What are yours?
I like to start refactoring when I need to, rather than the first opportunity that I get. You might say this is somewhat of an agile approach to refactoring. When do I feel I need to? Usually when I feel that the ugly parts of my codes are starting to spread. I think ugliness is okay as long as they are contained, but the moment when they start having the urge to spread, that's when you need to take care of business.
The techniques you use for refactoring should start with the simplest. I would strongly recommand Martin Fowler's book. Combining common code into functions, removing unneeded variables, and other simple techniques gets you a lot of mileage. For list operations, I prefer using functional programming idioms. That is to say, I use internal iterators, map, filter and reduce(in python speak, there are corresponding things in ruby, lisp and haskell) whenever I can, this makes code a lot shorter and more self-contained.
#region
I made a 1,000 line class only one line with it!
In all seriousness, the best way to avoid repetition is the things covered in your list, as well as fully utilizing polymorphism, examine your class and discover what would best be done in a base class, and how different components of it can be broken away a subclasses.
Sometimes by the time you "complete functionality" using copy and paste code, you've come to a point that it is maimed and mangled enough that any attempt at refactoring will actually take much, much longer than refactoring it at the point where it was obvious.
In my personal experience my favorite "way of removing repetition" has been the "Extract Method" functionality of Resharper (although this is also available in vanilla Visual Studio).
Many times I would see repeated code (some legacy app I'm maintaining) not as whole methods but in chunks within completely separate methods. That gives a perfect opportunity to turn those chunks into methods.
Monster classes also tend to reveal that they contain more than one functionality. That in turn becomes an opportunity to separate each distinct functionality into its own (hopefully smaller) class.
I have to reiterate that doing all of these is not a pleasurable experience at all (for me), so I really would rather do it right while it's a small ball of mud, rather than let the big ball of mud roll and then try to fix that.
First of all, I would recommend refactoring much sooner than when you are done with the first version of the class. Anytime you see duplication, eliminate it ASAP. This may take a little longer initially, but I think the results end up being a lot cleaner, and it helps you rethink your code as you go to ensure you are doing things right.
As for my favorite way of removing duplication.... Closures, especially in my favorite language (Ruby). They tend to be a really concise way of taking 2 pieces of code and merging the similarities. Of course (like any "best practice" or tip), this can not be blindly done... I just find them really fun to use when I can use them.
One of the things I do, is try to make small and simple methods that I can see on a single page in my editor (visual studio).
I've learnt from experience that making code simple makes it easier for the compiler to optimise it. The larger the method, the harder the compiler has to work!
I've also recently seen a problem where large methods have caused a memory leak. Basically I had a loop very much like the following:
while (true)
{
var smallObject = WaitForSomethingToTurnUp();
var largeObject = DoSomethingWithSmallObject();
}
I was finding that my application was keeping a large amount of data in memory because even though 'largeObject' wasn't in scope until smallObject returned something, the garbage collector could still see it.
I easily solved this by moving the 'DoSomethingWithSmallObject()' and other associated code to another method.
Also, if you make small methods, your reuse within a class will become significantly higher. I generally try to make sure that none of my methods look like any others!
Hope this helps.
Nick
"cut and paste with minor tweaks here and there" is the kind of code repetition I usually solve with an entirely non-exotic approach- Take the similar chunk of code, extract it out to a seperate method. The little bit that is different in every instance of that block of code, change that to a parameter.
There's also some easy techniques for removing repetitive-looking if/else if and switch blocks, courtesy of Scott Hanselman:
http://www.hanselman.com/blog/CategoryView.aspx?category=Source+Code&page=2
I might go something like this:
Create custom (private) types for data structures and put all the related logic in there. Dictionary<string, List<int>> etc.
Make inner functions or properties that guarantee behaviour. If you’re continually checking conditions from a publically accessible property then create an private getter method with all of the checking baked in.
Split methods apart that have too much going on. If you can’t put something succinct into the or give it a good name, then start breaking the function apart until the code is (even if these “child” functions aren’t used anywhere else).
If all else fails, slap a [SuppressMessage("Microsoft.Maintainability", "CA1502:AvoidExcessiveComplexity")] on it and comment why.

Categories