So, this may seem like a very odd question for many, but here it is:
Say you have an abstract class "Object" with an abstract method doStuff() which 10.000 classes inherit from.
Then in another class you have an "Object" dictionary with 100 random objects of the "Object" type in it. You call doStuff() on them.
Does the amount of classes have any performance impact? How does the executable find which class to execute the method of? Is that a jumptable, a pointertable, the equivalent logic of a huge switch-case, ..?
If it has any performance impact, are there ways to structure your code differently to eliminate this problem?
I feel I am really overthinking this.
There is no noticeable performance impact when you call doStuff.
At runtime, the type of object you are calling doStuff on is known for sure. At compile time you'd need a giant switch statement because you don't know the type. CLR sees that you are trying to call doStuff on Subclass0679, goes into that class, and invokes the method. Simple as that.
Think about it this way. ToString() is declared in Object and all classes inherit Object. When you call ToString() on something, is it really slow? No.
The number of derived classes can have some impact.
In particular, with ten thousand derived classes and only 100 objects, chances are pretty good that each call to doStuff will actually be to a unique function, separate from the others. That means your cache won't be very effective.
This is a fairly unrealistic scenario though. A collection that could be any one of ten thousand different derived classes is unlikely to ever arise in actual practice.
To put this in perspective, .NET (in its entirety) consists of about nine thousand nine hundred classes. A decent sized application could easily add the hundred or so more needed to get to ten thousand--but you're talking about a collection that could include anything in .NET, plus anything in your application. I find it difficult to imagine a situation in which this is likely to make sense.
If you're asking this out of curiosity as a hypothetical question, then fair enough.
However, if you're trying to prematurely optimise some code and this is the level of decisions you're making, I would highly recommend you concentrate on making your code work first and then use a profiler to identify hotspots as areas to optimise.
Also, super optimised code is usually far less readable and maintainable.
Unless your code is a game engine or perform some enormous calculation, then does it really need to be so optimised? If the code communicates with the outside world at all - network, disk, db etc - then that latency will completely dwarf any imperceptible difference in timing because of using inheritance.
Related
I'm pretty new to C# so bear with me.
One of the first things I noticed about C# is that many of the classes are static method heavy. For example...
Why is it:
Array.ForEach(arr, proc)
instead of:
arr.ForEach(proc)
And why is it:
Array.Sort(arr)
instead of:
arr.Sort()
Feel free to point me to some FAQ on the net. If a detailed answer is in some book somewhere, I'd welcome a pointer to that as well. I'm looking for the definitive answer on this, but your speculation is welcome.
Because those are utility classes. The class construction is just a way to group them together, considering there are no free functions in C#.
Assuming this answer is correct, instance methods require additional space in a "method table." Making array methods static may have been an early space-saving decision.
This, along with avoiding the this pointer check that Amitd references, could provide significant performance gains for something as ubiquitous as arrays.
Also see this rule from FXCOP
CA1822: Mark members as static
Rule Description
Members that do not access instance data or call instance methods can
be marked as static (Shared in Visual Basic). After you mark the
methods as static, the compiler will emit nonvirtual call sites to
these members. Emitting nonvirtual call sites will prevent a check at
runtime for each call that makes sure that the current object pointer
is non-null. This can achieve a measurable performance gain for
performance-sensitive code. In some cases, the failure to access the
current object instance represents a correctness issue.
Perceived functionality.
"Utility" functions are unlike much of the functionality OO is meant to target.
Think about the case with collections, I/O, math and just about all utility.
With OO you generally model your domain. None of those things really fit in your domain--it's not like you are coding and go "Oh, we need to order a new hashtable, ours is getting full". Utility stuff often just doesn't fit.
We get pretty close, but it's still not very OO to pass around collections (where is your business logic? where do you put the methods that manipulate your collection and that other little piece or two of data you are always passing around with it?)
Same with numbers and math. It's kind of tough to have Integer.sqrt() and Long.sqrt() and Float.sqrt()--it just doesn't make sense, nor does "new Math().sqrt()". There are a lot of areas it just doesn't model well. If you are looking for mathematical modeling then OO may not be your best bet. (I made a pretty complete "Complex" and "Matrix" class in Java and made them fairly OO, but making them really taught me some of the limits of OO and Java--I ended up "Using" the classes from Groovy mostly)
I've never seen anything anywhere NEAR as good as OO for modeling your business logic, being able to demonstrate the connections between code and managing your relationship between data and code though.
So we fall back on a different model when it makes more sense to do so.
The classic motivations against static:
Hard to test
Not thread-safe
Increases code size in memory
1) C# has several tools available that make testing static methods relatively easy. A comparison of C# mocking tools, some of which support static mocking: https://stackoverflow.com/questions/64242/rhino-mocks-typemock-moq-or-nmock-which-one-do-you-use-and-why
2) There are well-known, performant ways to do static object creation/logic without losing thread safety in C#. For example implementing the Singleton pattern with a static class in C# (you can jump to the fifth method if the inadequate options bore you): http://www.yoda.arachsys.com/csharp/singleton.html
3) As #K-ballo mentions, every method contributes to code size in memory in C#, rather than instance methods getting special treatment.
That said, the 2 specific examples you pointed out are just a problem of legacy code support for the static Array class before generics and some other code sugar was introduced back in C# 1.0 days, as #Inerdia said. I tried to answer assuming you had more code you were referring to, possibly including outside libraries.
The Array class isn't generic and can't be made fully generic because this would break backwards compatibility. There's some sort of magic going on where arrays implement IList<T>, but that's only for single-dimension arrays with a lower bound of 0 β "list-ish" arrays.
I'm guessing the static methods are the only way to add generic methods that work over any shape of array regardless of whether it qualifies for the above-mentioned compiler magic.
I have a method that is over 700+ lines long. In the beginning of the method, there are around 50 local variables declared. I decided to take the local variables out and put them into a separate class as properties so I could just declare the class in the method and use the properties through out it. Is this perfectly fine or does another data type fit in here such as a struct? This method was written during classic ASP times.
I have a method that is over 700+ lines long. In the beginning of the method, there are around 50 local variables declared.
Ok, so, the length of that method is also a problem. 700 lines is just too much to keep straight in one normal person's head all at once. When you have to fix a bug in there you end up scrolling up and down and up and down and... you get the idea. It really makes things hard to maintain.
So my answer is, yes, you should likely split up your data into a structure of some sort assuming that it actually makes sense to do so (i.e., I probably wouldn't create a SomeMethodParmaters class). The next thing to do is to split that method out into smaller pieces. You may even find that you no longer need a data structure as now each method only has a handful of variables declared for the work it needs to do.
Also, this is subjective, but there is really no good reason to declare all variables at the top of the method. Try declaring them as close to when they are actually used as possible. Again, this just keeps things nice and clean for maintenance in the future. It's much easier to concentrate on one section of code when you can see it all on the screen.
Hrm... I think you'd probably be better off refactoring the method to not have to operate on so many variables at all. For instance, five methods operating on ten variables each would infinitely better. As it stands now, it feels like you're simply trying to mask an issue rather than solve it.
I would strongly recommend you take a read through this book and/or any number of web sites concerned with refactoring. http://www.amazon.com/Refactoring-Improving-Design-Existing-Code/dp/0201485672
Although you can't look at 700+ lines in a single method and automatically say that is bad, it does indicate a bad code smell. Methods should be small units of code with a single purpose. This makes it easier for you to maintain or those who come behind you. It can also help you to figure out improvements to your design and make altering your design in the future much easier.
Creating a class just to hold properties without looking at what the overall structure should be is just hiding a problem. That is not to say in this particular instance that is not a perfectly acceptable and correct solution, just that you should make sure you are taking the time to provide a properly thought out design where your classes have the properties, state, and functionality they deserve.
Hope this helps.
Unlike Java, why does C# treat methods as non-virtual functions by default? Is it more likely to be a performance issue rather than other possible outcomes?
I am reminded of reading a paragraph from Anders Hejlsberg about several advantages the existing architecture is bringing out. But, what about side effects? Is it really a good trade-off to have non-virtual methods by default?
Classes should be designed for inheritance to be able to take advantage of it. Having methods virtual by default means that every function in the class can be plugged out and replaced by another, which is not really a good thing. Many people even believe that classes should have been sealed by default.
virtual methods can also have a slight performance implication. This is not likely to be the primary reason, however.
I'm surprised that there seems to be such a consensus here that non-virtual-by-default is the right way to do things. I'm going to come down on the other - I think pragmatic - side of the fence.
Most of the justifications read to me like the old "If we give you the power you might hurt yourself" argument. From programmers?!
It seems to me like the coder who didn't know enough (or have enough time) to design their library for inheritance and/or extensibility is the coder who's produced exactly the library I'm likely to have to fix or tweak - exactly the library where the ability to override would come in most useful.
The number of times I've had to write ugly, desperate work-around code (or to abandon usage and roll my own alternative solution) because I can't override far, far outweighs the number of times I've ever been bitten (e.g. in Java) by overriding where the designer might not have considered I might.
Non-virtual-by-default makes my life harder.
UPDATE: It's been pointed out [quite correctly] that I didn't actually answer the question. So - and with apologies for being rather late....
I kinda wanted to be able to write something pithy like "C# implements methods as non-virtual by default because a bad decision was made which valued programs more highly than programmers". (I think that could be somewhat justified based on some of the other answers to this question - like performance (premature optimisation, anyone?), or guaranteeing the behaviour of classes.)
However, I realise I'd just be stating my opinion and not that definitive answer that Stack Overflow desires. Surely, I thought, at the highest level the definitive (but unhelpful) answer is:
They're non-virtual by default because the language-designers had a decision to make and that's what they chose.
Now I guess the exact reason that they made that decision we'll never.... oh, wait! The transcript of a conversation!
So it would seem that the answers and comments here about the dangers of overriding APIs and the need to explicitly design for inheritance are on the right track but are all missing an important temporal aspect: Anders' main concern was about maintaining a class's or API's implicit contract across versions. And I think he's actually more concerned about allowing the .Net / C# platform to change under code rather than concerned about user-code changing on top of the platform. (And his "pragmatic" viewpoint is the exact opposite of mine because he's looking from the other side.)
(But couldn't they just have picked virtual-by-default and then peppered "final" through the codebase? Perhaps that's not quite the same.. and Anders is clearly smarter than me so I'm going to let it lie.)
Because it's too easy to forget that a method may be overridden and not design for that. C# makes you think before you make it virtual. I think this is a great design decision. Some people (such as Jon Skeet) have even said that classes should be sealed by default.
To summarize what others said, there are a few reasons:
1- In C#, there are many things in syntax and semantics that come straight from C++. The fact that methods where not-virtual by default in C++ influenced C#.
2- Having every method virtual by default is a performance concern because every method call must use the object's Virtual Table. Moreover, this strongly limits the Just-In-Time compiler's ability to inline methods and perform other kinds of optimization.
3- Most importantly, if methods are not virtual by default, you can guarantee the behavior of your classes. When they are virtual by default, such as in Java, you can't even guarantee that a simple getter method will do as intended because it could be overridden to do anything in a derived class (of course you can, and should, make the method and/or the class final).
One might wonder, as Zifre mentioned, why the C# language did not go a step further and make classes sealed by default. That's part of the whole debate about the problems of implementation inheritance, which is a very interesting topic.
C# is influenced by C++ (and more). C++ does not enable dynamic dispatch (virtual functions) by default. One (good?) argument for this is the question: "How often do you implement classes that are members of a class hiearchy?". Another reason to avoid enabling dynamic dispatch by default is the memory footprint. A class without a virtual pointer (vpointer) pointing to a virtual table, is ofcourse smaller than the corresponding class with late binding enabled.
The performance issue is not so easy to say "yes" or "no" to. The reason for this is the Just In Time (JIT) compilation which is a run time optimization in C#.
Another, similar question about "speed of virtual calls.."
The simple reason is design and maintenance cost in addition to performance costs. A virtual method has additional cost as compared with a non-virtual method because the designer of the class must plan for what happens when the method is overridden by another class. This has a big impact if you expect a particular method to update internal state or have a particular behavior. You now have to plan for what happens when a derived class changes that behavior. It's much harder to write reliable code in that situation.
With a non-virtual method you have total control. Anything that goes wrong is the fault of the original author. The code is much easier to reason about.
If all C# methods were virtual then the vtbl would be much bigger.
C# objects only have virtual methods if the class has virtual methods defined. It is true that all objects have type information that includes a vtbl equivalent, but if no virtual methods are defined then only the base Object methods will be present.
#Tom Hawtin: It is probably more accurate to say that C++, C# and Java are all from the C family of languages :)
Coming from a perl background I think C# sealed the doom of every developer who might have wanted to extend and modify the behaviour of a base class' thru a non virtual method without forcing all users of the new class to be aware of potentially behind the scene details.
Consider the List class' Add method. What if a developer wanted to update one of several potential databases whenever a particular List is 'Added' to? If 'Add' had been virtual by default the developer could develop a 'BackedList' class that overrode the 'Add' method without forcing all client code to know it was a 'BackedList' instead of a regular 'List'. For all practical purposes the 'BackedList' can be viewed as just another 'List' from client code.
This makes sense from the perspective of a large main class which might provide access to one or more list components which themselves are backed by one or more schemas in a database. Given that C# methods are not virtual by default, the list provided by the main class cannot be a simple IEnumerable or ICollection or even a List instance but must instead be advertised to the client as a 'BackedList' in order to ensure that the new version of the 'Add' operation is called to update the correct schema.
It is certainly not a performance issue. Sun's Java interpreter uses he same code to dispatch (invokevirtual bytecode) and HotSpot generates exactly the same code whether final or not. I believe all C# objects (but not structs) have virtual methods, so you are always going to need the vtbl/runtime class identification. C# is a dialect of "Java-like languages". To suggest it comes from C++ is not entirely honest.
There is an idea that you should "design for inheritance or else prohibit it". Which sounds like a great idea right up to the moment you have a severe business case to put in a quick fix. Perhaps inheriting from code that you don't control.
Performance.
Imagine a set of classes that override a virtual base method:
class Base {
public virtual int func(int x) { return 0; }
}
class ClassA: Base {
public override int func(int x) { return x + 100; }
}
class ClassB: Base {
public override int func(int x) { return x + 200; }
}
Now imagine you want to call the func method:
Base foo;
//...sometime later...
int x = foo.func(42);
Look at what the CPU has to actually do:
mov ecx, bfunc$ -- load the address of the "ClassB.func" method from the VMT
push 42 -- push the 42 argument
call [eax] -- call ClassB.func
No problem? No, problem!
The assembly isn't that hard to follow:
mov ecx, foo$: This needs to reach into memory, and hit the part of the object's Virtual Method Table (VMT) to get the address of the overridden foo method. The CPU will begin the fetch of the data from memory, and then it will continue on:
push 42: Push the argument 42 onto the stack for the call to the function. No problem, that can run right away, and then we continue to:
call [ecx] Call the address of the ClassB.func function. β πππΈππ!!!
That's a problem. The address of ClassB.func function has not been fetched from the VMT yet. This means that the CPU doesn't know where the go to next. Ideally it would follow a jump and continue spectatively executing instructions as it waits for the address of ClassB.func to come back from memory. But it can't; so we wait.
If we are lucky: the data already is in the L2 cache. Getting a value out of the L2 cache into a place where it can be used is going to take 12-15 cycles. The CPU can't know where to go next without having to wait for memory for 12-15 cycles.
πππ ββπ ππ€ π€π₯πππππ ππ π£ ππ-ππ ππͺππππ€
Our program is stuck doing nothing for 12-15 cycles.
The CPU core has 7 execution engines. The main job of the CPU is keeping those 7 pipelines full of stuff to do. That means:
JITing your machine code into a different order
Starting the fetch from memory as soon as possible, letting us move on to other things
executing 100, 200, 300 instructions ahead. It will be executing 17 iterations ahead in your loop, across multiple function call and returns
it has a branch predictor to try to guess which way a comparison will go, so that it can continue executing ahead while we wait. If it guesses wrong, then it does have to undo all that work. But the branch predictor is not stupid - it's right 94% of the time.
Your CPU has all this power, and capability, and it's just STALLED FOR 15 CYCLES!?
This is awful. This is terrible. And you suffer this penalty every time you call a virtual method - whether you actually overrode it or not.
Our program is 12-15 cycles slower every method call because the language designer made virtual methods opt-out rather than opt-in.
This is why Microsoft decided to not make all methods virtual by default: they learned from Java's mistakes.
Someone ported Android to C#, and it was faster
In 2012, the Xamarin people ported all of Android's Dalvik (i.e. Java) to C#. From them:
Performance
When C# came around, Microsoft modified the language in a couple of significant ways that made it easier to optimize. Value types were introduced to allow small objects to have low overheads and virtual methods were made opt-in, instead of opt-out which made for simpler VMs.
(emphasis mine)
I have a number of data classes representing various entities.
Which is better: writing a generic class (say, to print or output XML) using generics and interfaces, or writing a separate class to deal with each data class?
Is there a performance benefit or any other benefit (other than it saving me the time of writing separate classes)?
There's a significant performance benefit to using generics -- you do away with boxing and unboxing. Compared with developing your own classes, it's a coin toss (with one side of the coin weighted more than the other). Roll your own only if you think you can out-perform the authors of the framework.
Not only yes, but HECK YES. I didn't believe how big of a difference they could make. We did testing in VistaDB after a rewrite of a small percentage of core code that used ArrayLists and HashTables over to generics. 250% or more was the speed improvement.
Read my blog about the testing we did on generics vs weak type collections. The results blew our mind.
I have started rewriting lots of old code that used the weakly typed collections into strongly typed ones. One of my biggest grips with the ADO.NET interface is that they don't expose more strongly typed ways of getting data in and out. The casting time from an object and back is an absolute killer in high volume applications.
Another side effect of strongly typing is that you often will find weakly typed reference problems in your code. We found that through implementing structs in some cases to avoid putting pressure on the GC we could further speed up our code. Combine this with strongly typing for your best speed increase.
Sometimes you have to use weakly typed interfaces within the dot net runtime. Whenever possible though look for ways to stay strongly typed. It really does make a huge difference in performance for non trivial applications.
Generics in C# are truly generic types from the CLR perspective. There should not be any fundamental difference between the performance of a generic class and a specific class that does the exact same thing. This is different from Java Generics, which are more of an automated type cast where needed or C++ templates that expand at compile time.
Here's a good paper, somewhat old, that explains the basic design:
"Design and Implementation of Generics for the
.NET Common Language Runtime".
If you hand-write classes for specific tasks chances are you can optimize some aspects where you would need additional detours through an interface of a generic type.
In summary, there may be a performance benefit but I would recommend the generic solution first, then optimize if needed. This is especially true if you expect to instantiate the generic with many different types.
I did some simple benchmarking on ArrayList's vs Generic Lists for a different question: Generics vs. Array Lists, your mileage will vary, but the Generic List was 4.7 times faster than the ArrayList.
So yes, boxing / unboxing are critical if you are doing a lot of operations. If you are doing simple CRUD stuff, I wouldn't worry about it.
Generics are one of the way to parameterize code and avoid repetition. Looking at your program description and your thought of writing a separate class to deal with each and every data object, I would lean to generics. Having a single class taking care of many data objects, instead of many classes that do the same thing, increases your performance. And of course your performance, measured in the ability to change your code, is usually more important than the computer performance. :-)
According to Microsoft, Generics are faster than casting (boxing/unboxing primitives) which is true.
They also claim generics provide better performance than casting between reference types, which seems to be untrue (no one can quite prove it).
Tony Northrup - co-author of MCTS 70-536: Application Development Foundation - states in the same book the following:
I havenβt been able to reproduce the
performance benefits of generics;
however, according to Microsoft,
generics are faster than using
casting. In practice, casting proved
to be several times faster than using
a generic. However, you probably wonβt
notice performance differences in your
applications. (My tests over 100,000
iterations took only a few seconds.)
So you should still use generics
because they are type-safe.
I haven't been able to reproduce such performance benefits with generics compared to casting between reference types - so I'd say the performance gain is "supposed" more than "significant".
if you compare a generic list (for example) to a specific list for exactly the type you use then the difference is minimal, the results from the JIT compiler are almost the same.
if you compare a generic list to a list of objects then there is significant benefits to the generic list - no boxing/unboxing for value types and no type checks for reference types.
also the generic collection classes in the .net library were heavily optimized and you are unlikely to do better yourself.
In the case of generic collections vs. boxing et al, with older collections like ArrayList, generics are a performance win. But in the vast majority of cases this is not the most important benefit of generics. I think there are two things that are of much greater benefit:
Type safety.
Self documenting aka more readable.
Generics promote type safety, forcing a more homogeneous collection. Imagine stumbling across a string when you expect an int. Ouch.
Generic collections are also more self documenting. Consider the two collections below:
ArrayList listOfNames = new ArrayList();
List<NameType> listOfNames = new List<NameType>();
Reading the first line you might think listOfNames is a list of strings. Wrong! It is actually storing objects of type NameType. The second example not only enforces that the type must be NameType (or a descendant), but the code is more readable. I know right away that I need to go find TypeName and learn how to use it just by looking at the code.
I have seen a lot of these "does x perform better than y" questions on StackOverflow. The question here was very fair, and as it turns out generics are a win any way you skin the cat. But at the end of the day the point is to provide the user with something useful. Sure your application needs to be able to perform, but it also needs to not crash, and you need to be able to quickly respond to bugs and feature requests. I think you can see how these last two points tie in with the type safety and code readability of generic collections. If it were the opposite case, if ArrayList outperformed List<>, I would probably still take the List<> implementation unless the performance difference was significant.
As far as performance goes (in general), I would be willing to bet that you will find the majority of your performance bottlenecks in these areas over the course of your career:
Poor design of database or database queries (including indexing, etc),
Poor memory management (forgetting to call dispose, deep stacks, holding onto objects too long, etc),
Improper thread management (too many threads, not calling IO on a background thread in desktop apps, etc),
Poor IO design.
None of these are fixed with single-line solutions. We as programmers, engineers and geeks want to know all the cool little performance tricks. But it is important that we keep our eyes on the ball. I believe focusing on good design and programming practices in the four areas I mentioned above will further that cause far more than worrying about small performance gains.
Generics are faster!
I also discovered that Tony Northrup wrote wrong things about performance of generics and non-generics in his book.
I wrote about this on my blog:
http://andriybuday.blogspot.com/2010/01/generics-performance-vs-non-generics.html
Here is great article where author compares performance of generics and non-generics:
nayyeri.net/use-generics-to-improve-performance
If you're thinking of a generic class that calls methods on some interface to do its work, that will be slower than specific classes using known types, because calling an interface method is slower than a (non-virtual) function call.
Of course, unless the code is the slow part of a performance-critical process, you should focus of clarity.
See Rico Mariani's Blog at MSDN too:
http://blogs.msdn.com/ricom/archive/2005/08/26/456879.aspx
Q1: Which is faster?
The Generics version is considerably
faster, see below.
The article is a little old, but gives the details.
Not only can you do away with boxing but the generic implementations are somewhat faster than the non generic counterparts with reference types due to a change in the underlying implementation.
The originals were designed with a particular extension model in mind. This model was never really used (and would have been a bad idea anyway) but the design decision forced a couple of methods to be virtual and thus uninlineable (based on the current and past JIT optimisations in this regard).
This decision was rectified in the newer classes but cannot be altered in the older ones without it being a potential binary breaking change.
In addition iteration via foreach on an List<> (rather than IList<>) is faster due to the ArrayList's Enumerator requiring a heap allocation. Admittedly this did lead to an obscure bug
I've just coded a 700 line class. Awful. I hang my head in shame. It's as opposite to DRY as a British summer.
It's full of cut and paste with minor tweaks here and there. This makes it's a prime candidate for refactoring. Before I embark on this, I'd thought I'd ask when you have lots of repetition, what are the first refactoring opportunities you look for?
For the record, mine are probably using:
Generic classes and methods
Method overloading/chaining.
What are yours?
I like to start refactoring when I need to, rather than the first opportunity that I get. You might say this is somewhat of an agile approach to refactoring. When do I feel I need to? Usually when I feel that the ugly parts of my codes are starting to spread. I think ugliness is okay as long as they are contained, but the moment when they start having the urge to spread, that's when you need to take care of business.
The techniques you use for refactoring should start with the simplest. I would strongly recommand Martin Fowler's book. Combining common code into functions, removing unneeded variables, and other simple techniques gets you a lot of mileage. For list operations, I prefer using functional programming idioms. That is to say, I use internal iterators, map, filter and reduce(in python speak, there are corresponding things in ruby, lisp and haskell) whenever I can, this makes code a lot shorter and more self-contained.
#region
I made a 1,000 line class only one line with it!
In all seriousness, the best way to avoid repetition is the things covered in your list, as well as fully utilizing polymorphism, examine your class and discover what would best be done in a base class, and how different components of it can be broken away a subclasses.
Sometimes by the time you "complete functionality" using copy and paste code, you've come to a point that it is maimed and mangled enough that any attempt at refactoring will actually take much, much longer than refactoring it at the point where it was obvious.
In my personal experience my favorite "way of removing repetition" has been the "Extract Method" functionality of Resharper (although this is also available in vanilla Visual Studio).
Many times I would see repeated code (some legacy app I'm maintaining) not as whole methods but in chunks within completely separate methods. That gives a perfect opportunity to turn those chunks into methods.
Monster classes also tend to reveal that they contain more than one functionality. That in turn becomes an opportunity to separate each distinct functionality into its own (hopefully smaller) class.
I have to reiterate that doing all of these is not a pleasurable experience at all (for me), so I really would rather do it right while it's a small ball of mud, rather than let the big ball of mud roll and then try to fix that.
First of all, I would recommend refactoring much sooner than when you are done with the first version of the class. Anytime you see duplication, eliminate it ASAP. This may take a little longer initially, but I think the results end up being a lot cleaner, and it helps you rethink your code as you go to ensure you are doing things right.
As for my favorite way of removing duplication.... Closures, especially in my favorite language (Ruby). They tend to be a really concise way of taking 2 pieces of code and merging the similarities. Of course (like any "best practice" or tip), this can not be blindly done... I just find them really fun to use when I can use them.
One of the things I do, is try to make small and simple methods that I can see on a single page in my editor (visual studio).
I've learnt from experience that making code simple makes it easier for the compiler to optimise it. The larger the method, the harder the compiler has to work!
I've also recently seen a problem where large methods have caused a memory leak. Basically I had a loop very much like the following:
while (true)
{
var smallObject = WaitForSomethingToTurnUp();
var largeObject = DoSomethingWithSmallObject();
}
I was finding that my application was keeping a large amount of data in memory because even though 'largeObject' wasn't in scope until smallObject returned something, the garbage collector could still see it.
I easily solved this by moving the 'DoSomethingWithSmallObject()' and other associated code to another method.
Also, if you make small methods, your reuse within a class will become significantly higher. I generally try to make sure that none of my methods look like any others!
Hope this helps.
Nick
"cut and paste with minor tweaks here and there" is the kind of code repetition I usually solve with an entirely non-exotic approach- Take the similar chunk of code, extract it out to a seperate method. The little bit that is different in every instance of that block of code, change that to a parameter.
There's also some easy techniques for removing repetitive-looking if/else if and switch blocks, courtesy of Scott Hanselman:
http://www.hanselman.com/blog/CategoryView.aspx?category=Source+Code&page=2
I might go something like this:
Create custom (private) types for data structures and put all the related logic in there. Dictionary<string, List<int>> etc.
Make inner functions or properties that guarantee behaviour. If youβre continually checking conditions from a publically accessible property then create an private getter method with all of the checking baked in.
Split methods apart that have too much going on. If you canβt put something succinct into the or give it a good name, then start breaking the function apart until the code is (even if these βchildβ functions arenβt used anywhere else).
If all else fails, slap a [SuppressMessage("Microsoft.Maintainability", "CA1502:AvoidExcessiveComplexity")] on it and comment why.