Alright, so I've been trying to figure out this for far too long now.
I've read countless articles/questions on static, and also singletons.
This is how I have been using static for quite some time:
I have a GameManager.cs, which has TileMap.cs & Player.cs. TileMap.cs needs to access the player's camera, for updating the position of the map:
GameManager.cs:
public static TileMap map;
public static Player player;
Update code, draw, etc. below...
There will only be one of each of these.
TileMap.cs:
Vector2 mapPosition = new Vector2(GameManager.Player.position.X, GameManager.Player.position.Y);
Is this acceptable? What downsides are there to this?
Edit: Seeing as though this question is too broad, let me see if I can be more specific.
Is there a better way to do this rather than through static methods? I have multiple lists of classes inside TileMap.cs, some of which have Lists inside of them (thinking mostly of my particle engine), so would Update(Player player) be more efficient, or would it not really matter?
P.S, I have noticed when the player moves the game map sort of "jitters" (lags for a small fraction of a second). Could this be causing it?
Thanks,
Shyy
As you are asking about downsides:
1) does the code that uses static variables involved in muti threading ?
If yes, you may consider locking management. Which easily makes apparently easy code complicated.
2) Are Player, position and others structures ? If so, every time accessing them via property, you create a copy of instances and not access to the references directly. Considering that code provided some 2D engine, and you are creating a Vector, so probably some rendering pipeline code, this may introduce some serious performance implications.
I've noticed similar questions quite often professionally. So let me give you a straight answer, of which I am well aware that it doens't apply to the general case. See it as an opinion of what I consider 'best practice for beginners'.
static is often used for making variables available across class boundaries, as-if they are singletons. The singleton pattern here is just a design pattern wrapper (which doesn't solve most of the problems). While this might make programs easier to write, using static can also make programs much more complex if you want to make your application multi-threaded.
In general I therefore think it's a good idea to avoid using static alltogether and simply pass objects around.
There is one exception, that is: if you need to have data that is accessible across thread boundaries. If this is the case, you'll quickly find yourself in a world of hurt, and it is best to learn as much as you can about locking and use that for all static variables (including their members if they are structures/classes!).
If that isn't good enough as well, you can continue on that path and learn about things like interlocked and memory barriers (but I wouldn't recommend that if you don't need it).
Fortunately, most applications are fast enough to work in a single thread. I imagine your application to be no exception (and you can probably use a standard game framework to do the multi-threaded part of the application -- PS: if so, be careful with class variables as well, since they might be passed across thread boundaries as well).
As for your lag: I think you should use a good profiler to find performance hotspots; it's probably unrelated to the use of static.
Related
So, this may seem like a very odd question for many, but here it is:
Say you have an abstract class "Object" with an abstract method doStuff() which 10.000 classes inherit from.
Then in another class you have an "Object" dictionary with 100 random objects of the "Object" type in it. You call doStuff() on them.
Does the amount of classes have any performance impact? How does the executable find which class to execute the method of? Is that a jumptable, a pointertable, the equivalent logic of a huge switch-case, ..?
If it has any performance impact, are there ways to structure your code differently to eliminate this problem?
I feel I am really overthinking this.
There is no noticeable performance impact when you call doStuff.
At runtime, the type of object you are calling doStuff on is known for sure. At compile time you'd need a giant switch statement because you don't know the type. CLR sees that you are trying to call doStuff on Subclass0679, goes into that class, and invokes the method. Simple as that.
Think about it this way. ToString() is declared in Object and all classes inherit Object. When you call ToString() on something, is it really slow? No.
The number of derived classes can have some impact.
In particular, with ten thousand derived classes and only 100 objects, chances are pretty good that each call to doStuff will actually be to a unique function, separate from the others. That means your cache won't be very effective.
This is a fairly unrealistic scenario though. A collection that could be any one of ten thousand different derived classes is unlikely to ever arise in actual practice.
To put this in perspective, .NET (in its entirety) consists of about nine thousand nine hundred classes. A decent sized application could easily add the hundred or so more needed to get to ten thousand--but you're talking about a collection that could include anything in .NET, plus anything in your application. I find it difficult to imagine a situation in which this is likely to make sense.
If you're asking this out of curiosity as a hypothetical question, then fair enough.
However, if you're trying to prematurely optimise some code and this is the level of decisions you're making, I would highly recommend you concentrate on making your code work first and then use a profiler to identify hotspots as areas to optimise.
Also, super optimised code is usually far less readable and maintainable.
Unless your code is a game engine or perform some enormous calculation, then does it really need to be so optimised? If the code communicates with the outside world at all - network, disk, db etc - then that latency will completely dwarf any imperceptible difference in timing because of using inheritance.
Suppose we would like to program efficiently for some reasons and constraints.
Should we put aside OOP?
Let's illustrate with an example
public class CPlayer{
Vector3 m_position;
Quaternion m_rotation;
// other fields
}
public class CPlayerController{
CPlayer[] players;
public CPlayerController(int _count){
players=new CPlayer[_count];
}
public void ComputeClosestPlayer(CPlayer _player){
for(int i=0;i<players.Length;i++){
// find the closest player to _player
}
}
}
If we convert the class to a struct, we can utilize to cache the player array in cache memory and get better performance. When we need to iterate the array in ComputeClosestPlayer function, we know that the players structs were stored consecutively so can come into the cache memory when reading the first element of the array.
public struct CPlayerController{
CPlayer[] players;
public CPlayerController(int _count){
players=new CPlayer[_count];
}
public void ComputeClosestPlayer(CPlayer _player){
for(int i=0;i<players.Length;i++){
// find the closest player to _player
}
}
}
If we want to achieve more performance, we can separate position field from the class:
public Vector3[] m_positions;
So now, only the positions(12 bytes for every of them) are cached in cache memory when we call the function while in previous approach, we have to cache objects that take more memory.
Finally I do not know that it is a standard approach or you avoid it to separate some fields from a class to get better performance and share your approach to get the most performance in strategy games that you have a lot of items and soldiers
Put aside OO design pattern to achieve better performance in strategic
games?
I tend to favor this broad approach of putting aside OOP for visual FX at the central architecture and specifically with an entity-component system like so:
... where the components, in blue, are just data (structs with no functionality of their own). If there's any functionality at all inside a component, it's pure data structure functionality (like the functions you find in std::vector in C++ or ArrayList in C#).
It does allow efficient things to be done more easily, but the primary benefit for me wasn't efficiency. It was the flexibility and maintainability. When I need brand new behavior, I can just make a small local modification a system or add a new component or add a new system when the wide dependencies are flowing towards data rather than abstractions. The need to face cascading design changes has been a thing of the past ever since I embraced this approach.
"No Central Design"
Every new design idea, no matter how crazy, tends to be fairly easy to extend and add to the system without breaking any central designs, since there are no central abstract designs in the first place (or a kind that contain functionality) besides the ECS database itself. And with the exception of dependencies to the ECS database and a handful of components (raw data) in each system, the system is uber decoupled.
And it makes every system easy to reason about from everything like thread safety to what side effects go on where and when. Each system performs a very well-defined role that maps very directly to business requirements. It's harder to reason about designs and responsibilities when you have this instead for the communication/interaction between medium and teeny-sized objects:
... and this above graph is not a dependency diagram. In terms of coupling there might be an abstraction between every object to decouple them that way (ex: Modeling object depending on IMesh, not a concrete mesh), but they still talk to each other, and all that interaction and communication can make it difficult to reason about what's going on as well as come up with the most efficient loopy code.
Meanwhile the first system that has each independent system processing data from a central database in a flat pipeline-style fashion makes it very easy to figure out what's going on as well as implement the loopy critical paths of execution very efficiently. It also makes it so you can sit down and work on a system without knowing what the other systems are doing: all you have to do to implement a physics system is read things like motion components from the database and transform the data correctly. You don't need to know much to implement and maintain the physics system besides a few component types and how to fetch them from the ECS "database".
That also makes it easier to work in a team and also hire new developers who can quickly come up to speed without spending 2 years trying to teach them how the central abstractions in the overall system works just so that they can do their job. They can start doing things as bold and as centrally-impacting to the software's design as introducing a brand new physics engine or rendering engine to the system within weeks.
Efficiency
If we convert the class to a struct, we can utilize to cache the
player array in cache memory and get better performance.
That's only at a granular level where objects start to get in the way of performance. For example, if you try to represent one single pixel of an image with an abstract Pixel object or IPixel interface that encapsulates and hides its data, then it can very easily become a performance barrier even without considering the cost of dynamic dispatch. Such a granular object just tends to force you to work at the granular level of one pixel at a time while fudging about with its public interface, so optimizations like processing the image on the GPU or SIMD processing on the CPU are out when we have this barrier between ourselves and the pixel data.
Aside from that you generally can't code to an interface and expect efficient solutions at this granular of a level (one pixel). We can't hide away concrete details like pixel formats and abstract them away and expect to write efficient video filters that loop through millions of pixels every frame, e.g. At a low enough level, we have to start writing code against concrete details for reasonable efficiency. Abstractions and coding to an interface are helpful as you work towards high-level operations.
But naturally that doesn't apply if you turn Pixel into just raw data stored in an Image. There an image object, which is really a container of often millions of pixels, isn't a practical barrier to achieving a very efficient solution. We don't have to abandon OOP to write very efficient loops. We just might need to do it for the teeniest, most granular objects that store barely any data of their own.
So one alternative strategy is to just model your objects at a coarser level. You don't have to design a Human class. You can design a Humans class which inherits from Creatures. Now the implementation of Humans might consist of loopy multithreaded SIMD code which processes thousands of humans worth of data (ex: SoA fields stored in parallel arrays) at once.
Alternatives to OOP
The appeal to me of abandoning OOP at the broadest level of the design (I still use OOP plenty for the implementation of each subsystem) in favor of value aggregates at the central level as with the case of components in an ECS is mostly the flexibility to me of leaving the data wide open and accessible through a central "database". It makes it easier to just implement the systems you need and effectively access the data without jumping through hoops and going through layers upon layers of abstractions while fighting against the abstractions you designed.
Of course there are downsides, like that your data now has to be very stable (we can't keep changing it) or else your systems will break. And you also have to have a decent system organization to make it so your data is accessed and modified in the minimum number of places to allow you to effectively maintain invariants, reason about side effects, etc. But I've found an ECS does that beautifully just naturally, since it becomes very easy to tell what systems access what components.
In my case, I found a nice fit using ECS for the large-scale design and then when you zoom in to the implementation of a particular system, like a physics or rendering system, then they use OOP to help implement auxiliary data structures and medium-sized objects to make the system's implementation more comprehensible and things like that. I find OOP very useful at the medium kind of scale of complexity, but sometimes very difficult to maintain and optimize at the largest scale.
Hot/Cold Field Splitting and Value Aggregates
Finally I do not know that it is a standard approach or you avoid it
to separate some fields from a class to get better performance and
share your approach to get the most performance in strategy games that
you have a lot of items and soldiers
This is getting a bit C#-specific and I'm more of a C++ and C programmer, but I believe I've read that C# structs, as long as they aren't boxed, can be stored contiguously in an array. That contiguity you get can make a world of difference in terms of reducing cache misses.
Specifically in GC languages, often the initial set of objects allocations can be done quickly using a sequential allocator ("Eden" space in Java, I believe C# does something similar though I haven't read any papers on C#'s implementation details) for very efficient GC implementations. But after the first GC cycle, the memory can then be shuffled around to allow it to be reclaimed on an individual object basis. That loss of spatial locality can then really hurt performance if you need to do very efficient sequential loops. So storing an array of structs or primitive data types like int or float might be a useful optimization in some key loopy areas of your game.
As for the approach of separating fields, that's useful for SIMD processing and hot/cold field splitting. Hot/cold field splitting is separating data fields frequently-accessed away from others which aren't. For example, a particle system might spend the bulk of its time moving particles around and detecting if they collide. It has absolutely no interest in things like a particle's color in that case during the critical paths.
So an effective optimization in that case might be to avoid storing color directly inside a particle, and instead hoist it out and store it in its own separate, parallel array. This way the hot data that is constantly accessed can be loaded into a 64-byte cache line without irrelevant data, like color, being loaded into it needlessly and slowing down the critical paths by making them have to plow through more irrelevant data to get at the relevant data.
All non-trivial optimizations tend to boil down to exchanges that skew performance towards the common case at cost to the rare case. To strike a good bargain and find a good trade, you want to make the common case faster even if it makes the rare case a little bit slower. Beyond blatant inefficiencies, you generally can't make everything fast for everything, though you can achieve something that looks that way and seems super fast for everything to the user if you optimize the common cases, the critical paths, really well.
Memory Access and Representation
If we want to achieve more performance, we can separate position field
from the class:
public Vector3[] m_positions;
This would be working towards an SoA (Structure of Arrays) approach and makes sense if the bulk of your critical loops spend time accessing the position of a player in sequential or random-access patterns but not, say, rotation. If both rotation and position are accessed frequently together in a random-access pattern, a struct storing both using an AoS (array of structures) approach might make the most sense. If they're both accessed predominantly in a sequential access pattern with no random-access, then the SoA might perform better not because it reduces cache misses (it would be close to on par with AoS) but because it can allow more efficient instruction selection by the optimizer when you can, say, load 8 SPFP positions fields at once into an YMM register without rotation fields interleaved with more homogeneous vertical arithmetic in processing loops.
A full-blown SoA approach might even separate your position components. Instead of:
xyzxyzxyzxyz...
It might favor this if the access patterns for critical paths are all sequential and process a large amount of data:
xxxxxxxx...
yyyyyyyy...
zzzzzzzz...
Like so:
float[] m_x;
float[] m_y;
float[] m_z;
That tends to be the most friendly memory layout for all kinds of SIMD instructions (allows you or the optimizer to use SIMD in a way that looks identical to scalar code, only it's being applied to 4+ fields at one time), though you generally want sequential access patterns for this kind of layout. If it's random-access, you might end up with almost three times the cache misses.
As for what you choose, first you have to figure out how you're going to be accessing the data in your most critical loops to understand how to design and represent it very effectively. And you generally have to make some exchanges that will slow down the rare case in favor of the common case, since if you go with an SoA design, you might still have some places in the system that would benefit more from AoS, so you're speeding up the common case using SoA if your critical paths are sequential while slowing down the rare case if your non-critical paths use random access patterns. There's a lot to take in and compromises to be made to come up with the most efficient solutions, and naturally it helps to measure, but also to design things efficiently upfront, you have to think about memory access patterns as well. A good memory layout for one access pattern isn't necessarily good for another.
If in doubt, I'd favor the AoS until you hit a hotspot since it's generally easier to maintain than parallel arrays. Then you might selectively apply SoA optimizations in hindsight. The key is to find the breathing room to do that, which you will find if you design coarser objects like Image, not Pixel, like Humans, not Human, like ParticleSystem, not Particle. If you design little teeny objects you use everywhere, you might find yourself trapped into a sub-optimal representation you can't change without breaking everything.
Finally I do not know that it is a standard approach or you avoid it
to separate some fields from a class to get better performance [...]
Actually it's very common and is at least discussed fairly widely (at least it's not too esoteric knowledge) in fields like computer graphics and gaming when using languages that give you very explicit control over memory layouts like C++. However, the techniques are just as applicable even to languages using GC, since those languages can still give you plenty enough control over memory layouts in ways that matter provided they at least give you something like a struct which can be stored contiguously in an array.
All of this stuff about efficient memory access patterns relates predominantly to contiguity and spatial locality because we're dealing with loops and ideally when we load some data into a cache line, it covers not only data for that one element we're interested in but also the next one, and the next one, and so forth. We want as much relevant data to be there as possible and as little irrelevant data to be there as possible. Most of it becomes pointless (except for temporal locality) if nothing is contiguously stored since we'd be loading irrelevant data all over the place every time we load one element, but just about every language out there gives you some ways to store data in a purely contiguous fashion, even if you can't use objects for that.
I've actually seen a small interactive path tracer written in Java which rivaled the speed of any equally small interactive one written in C or C++. The main thing was that it avoided using objects for critical parts involving BVH traversal and ray/triangle intersection. There it used big arrays of floats and integers, but everything else used OOP. If you apply these types of optimizations discreetly in such languages, then they can start to give extremely impressive results from a performance standpoint without getting bogged down with maintenance issues.
Deciding between struct or class can be really tricky. Structs become value types while classes become reference types. Value types are passed by value, meaning that content of the struct is copied to arrays, function parameters, etc... Reference types are passed by reference, meaning only their reference is passed. The other performance issue might arise from the frequent boxing and un-boxing of value types.
I might not explain it well, fortunately MSDN has a pretty good guide here.
The one thing I would copy here from the guide is this:
✓ CONSIDER defining a struct instead of a class if instances of
the type are small and commonly short-lived or are commonly embedded
in other objects.
X AVOID defining a struct unless the type has all of the
following characteristics:
It logically represents a single value, similar to primitive types (int, double, etc.).
It has an instance size under 16 bytes.
It is immutable.
It will not have to be boxed frequently.
In all other cases, you should define your types as classes.
I'm pretty new to C# so bear with me.
One of the first things I noticed about C# is that many of the classes are static method heavy. For example...
Why is it:
Array.ForEach(arr, proc)
instead of:
arr.ForEach(proc)
And why is it:
Array.Sort(arr)
instead of:
arr.Sort()
Feel free to point me to some FAQ on the net. If a detailed answer is in some book somewhere, I'd welcome a pointer to that as well. I'm looking for the definitive answer on this, but your speculation is welcome.
Because those are utility classes. The class construction is just a way to group them together, considering there are no free functions in C#.
Assuming this answer is correct, instance methods require additional space in a "method table." Making array methods static may have been an early space-saving decision.
This, along with avoiding the this pointer check that Amitd references, could provide significant performance gains for something as ubiquitous as arrays.
Also see this rule from FXCOP
CA1822: Mark members as static
Rule Description
Members that do not access instance data or call instance methods can
be marked as static (Shared in Visual Basic). After you mark the
methods as static, the compiler will emit nonvirtual call sites to
these members. Emitting nonvirtual call sites will prevent a check at
runtime for each call that makes sure that the current object pointer
is non-null. This can achieve a measurable performance gain for
performance-sensitive code. In some cases, the failure to access the
current object instance represents a correctness issue.
Perceived functionality.
"Utility" functions are unlike much of the functionality OO is meant to target.
Think about the case with collections, I/O, math and just about all utility.
With OO you generally model your domain. None of those things really fit in your domain--it's not like you are coding and go "Oh, we need to order a new hashtable, ours is getting full". Utility stuff often just doesn't fit.
We get pretty close, but it's still not very OO to pass around collections (where is your business logic? where do you put the methods that manipulate your collection and that other little piece or two of data you are always passing around with it?)
Same with numbers and math. It's kind of tough to have Integer.sqrt() and Long.sqrt() and Float.sqrt()--it just doesn't make sense, nor does "new Math().sqrt()". There are a lot of areas it just doesn't model well. If you are looking for mathematical modeling then OO may not be your best bet. (I made a pretty complete "Complex" and "Matrix" class in Java and made them fairly OO, but making them really taught me some of the limits of OO and Java--I ended up "Using" the classes from Groovy mostly)
I've never seen anything anywhere NEAR as good as OO for modeling your business logic, being able to demonstrate the connections between code and managing your relationship between data and code though.
So we fall back on a different model when it makes more sense to do so.
The classic motivations against static:
Hard to test
Not thread-safe
Increases code size in memory
1) C# has several tools available that make testing static methods relatively easy. A comparison of C# mocking tools, some of which support static mocking: https://stackoverflow.com/questions/64242/rhino-mocks-typemock-moq-or-nmock-which-one-do-you-use-and-why
2) There are well-known, performant ways to do static object creation/logic without losing thread safety in C#. For example implementing the Singleton pattern with a static class in C# (you can jump to the fifth method if the inadequate options bore you): http://www.yoda.arachsys.com/csharp/singleton.html
3) As #K-ballo mentions, every method contributes to code size in memory in C#, rather than instance methods getting special treatment.
That said, the 2 specific examples you pointed out are just a problem of legacy code support for the static Array class before generics and some other code sugar was introduced back in C# 1.0 days, as #Inerdia said. I tried to answer assuming you had more code you were referring to, possibly including outside libraries.
The Array class isn't generic and can't be made fully generic because this would break backwards compatibility. There's some sort of magic going on where arrays implement IList<T>, but that's only for single-dimension arrays with a lower bound of 0 – "list-ish" arrays.
I'm guessing the static methods are the only way to add generic methods that work over any shape of array regardless of whether it qualifies for the above-mentioned compiler magic.
There is this famous quote that says
Procedural code gets information then
makes decisions. Object-oriented code
tells objects to do things. — Alec
Sharp
The subject of the post is precisely about that.
Let's assume we are developing a game in which we have a Game where there is a Board.
When facing the problem of deciding which methods are we going to implement on the Board class, I always think of two different ways:
The first approach is to
populate the Board class with getSize(), getPieceAt(x, y), setPieceAt(x, y, piece). This will seem reasonable and is what is generally found in libraries/frameworks. The Board class has a set of internal features that wants to share and has a set of methods that will allow the client of the class to control the class as he wishes. The client is supposed to ask for the things he needs and to decide what to do. If he wants to set all board pieces to black, he will "manually" iterate over them to accomplish that goal.
The second approach is about
looking for Board's dependent classes, and see what they are "telling" it to do. ClassA wants to count how many pieces are red, so I'd implement a calculateNumberOfRedPieces(). ClassB intends to clear all the pieces on the Board(set all of them to NullPiece, for example), so I'd add a clearBoard() method to the Board class. This approach is less general, but allows for a lot more flexibility on other aspects. If I "hide" Board behind an IBoard interface, and decide that I'd want to have a board with infinite size, doing in the first way, I'd be stuck, as I'd have to iterate over an infinite number of items! On the other hand, in this way, I could do fine (I could, for instance, assume all pieces are null other than the ones contained in a hashtable!).
So...
I am aware that if I intend to make a library, I am probably stuck with the first approach, as it is way more general. On the other hand, I'd like to know which approach to follow when I am in total control of the system that'll make use of the Board class -- when I am the one who is going to also design all the classes that'll make use of the Board. Currently, and in the future (won't the second approach raise problems if later I decide to add new classes that are dependent on the Board with different "desires"?).
The quote is really warning you away from data structures that don't do anything with the data they hold. So your Board class in the first approach might be able to be done away with and replaced by a generic collection.
Regardless, the Single Responsibility Principle still applies, so you need to treat the second approach with caution.
What I would do is invoke YAGNI (you aren't gonna need it) and try to see how far I could go using a generic collection rather than a Board class. If you find that later you do need the Board class its responsibility will likely be much more clear by then.
Let me offer the contrarian point of view. I think the second approach has legs. I agree with the single responsibility principle, but it seems to me that there's a defensible single mission/concern for a Board class: Maintaining the playing field.
I can imagine a very reasonable set of methods such as getSize(), getPiece(x,y), setPiece(x, y, color), removePiece(x, y), movePiece(x1,y1,x2,y2), clear(), countPieces(color), listPiecePositions(color), read(filename), write(filename), etc. that have a congent and clear shared mission. The handling of those board-management concerns in an abstracted way would allow other classes to implement game logic more cleanly, and for either Board or Game to be more readily extended in the future.
YAGNI is all well and good, but my understanding is that it urges you to not start building beautiful edifices with the hope that one day they'll be usefully occupied. For example, I wouldn't spend any time working toward the future possibility of an infinite playing surface, a 3D playing surface, or a playing surface that can be embedded onto a sphere. If I wanted to take YAGNI very seriously, I wouldn't write even straightforward Board methods until they were needed.
But that doesn't mean I would discard Board as a conceptual organization or possible class. And it certainly doesn't mean that I wouldn't put any thought at all into how to separate concerns in my program. At least YAGNI in my world doesn't require you start with the lowest-level data structures, little or nothing by way of encapsulation, and a completely procedural approach.
I disagree with the notion that the first approach is more general (in any useful way), or what appears to the the consensus that one should "just see how far you can get without abstracting anything." Honestly, that sounds like how we solved eight queens. In 1983. In Pascal.
YAGNI is a great guiding principle that helps avoid a lot of second system effect and similar bottoms-up, we-can-do-it-so-we-should mistakes. But YAGNI that's crossed the Agile Practice Stupidity Threshold is not a virtue.
CurtainDog is right, invoke Yagni and figure out what you actually need right now, implement that, then make sure it's not going to prevent any features that may be desirable in the future.
The second approach violates the principle that superclasses should not know about each of its subclasses. I think the element you're missing is that the base class can define template methods, like getBoardSize, countRedPieces, countBlackPieces, that can be overridden by subclasses and your superclass has code that uses those template methods, therefore telling its subclasses what to do, but not how to do it.
I've just coded a 700 line class. Awful. I hang my head in shame. It's as opposite to DRY as a British summer.
It's full of cut and paste with minor tweaks here and there. This makes it's a prime candidate for refactoring. Before I embark on this, I'd thought I'd ask when you have lots of repetition, what are the first refactoring opportunities you look for?
For the record, mine are probably using:
Generic classes and methods
Method overloading/chaining.
What are yours?
I like to start refactoring when I need to, rather than the first opportunity that I get. You might say this is somewhat of an agile approach to refactoring. When do I feel I need to? Usually when I feel that the ugly parts of my codes are starting to spread. I think ugliness is okay as long as they are contained, but the moment when they start having the urge to spread, that's when you need to take care of business.
The techniques you use for refactoring should start with the simplest. I would strongly recommand Martin Fowler's book. Combining common code into functions, removing unneeded variables, and other simple techniques gets you a lot of mileage. For list operations, I prefer using functional programming idioms. That is to say, I use internal iterators, map, filter and reduce(in python speak, there are corresponding things in ruby, lisp and haskell) whenever I can, this makes code a lot shorter and more self-contained.
#region
I made a 1,000 line class only one line with it!
In all seriousness, the best way to avoid repetition is the things covered in your list, as well as fully utilizing polymorphism, examine your class and discover what would best be done in a base class, and how different components of it can be broken away a subclasses.
Sometimes by the time you "complete functionality" using copy and paste code, you've come to a point that it is maimed and mangled enough that any attempt at refactoring will actually take much, much longer than refactoring it at the point where it was obvious.
In my personal experience my favorite "way of removing repetition" has been the "Extract Method" functionality of Resharper (although this is also available in vanilla Visual Studio).
Many times I would see repeated code (some legacy app I'm maintaining) not as whole methods but in chunks within completely separate methods. That gives a perfect opportunity to turn those chunks into methods.
Monster classes also tend to reveal that they contain more than one functionality. That in turn becomes an opportunity to separate each distinct functionality into its own (hopefully smaller) class.
I have to reiterate that doing all of these is not a pleasurable experience at all (for me), so I really would rather do it right while it's a small ball of mud, rather than let the big ball of mud roll and then try to fix that.
First of all, I would recommend refactoring much sooner than when you are done with the first version of the class. Anytime you see duplication, eliminate it ASAP. This may take a little longer initially, but I think the results end up being a lot cleaner, and it helps you rethink your code as you go to ensure you are doing things right.
As for my favorite way of removing duplication.... Closures, especially in my favorite language (Ruby). They tend to be a really concise way of taking 2 pieces of code and merging the similarities. Of course (like any "best practice" or tip), this can not be blindly done... I just find them really fun to use when I can use them.
One of the things I do, is try to make small and simple methods that I can see on a single page in my editor (visual studio).
I've learnt from experience that making code simple makes it easier for the compiler to optimise it. The larger the method, the harder the compiler has to work!
I've also recently seen a problem where large methods have caused a memory leak. Basically I had a loop very much like the following:
while (true)
{
var smallObject = WaitForSomethingToTurnUp();
var largeObject = DoSomethingWithSmallObject();
}
I was finding that my application was keeping a large amount of data in memory because even though 'largeObject' wasn't in scope until smallObject returned something, the garbage collector could still see it.
I easily solved this by moving the 'DoSomethingWithSmallObject()' and other associated code to another method.
Also, if you make small methods, your reuse within a class will become significantly higher. I generally try to make sure that none of my methods look like any others!
Hope this helps.
Nick
"cut and paste with minor tweaks here and there" is the kind of code repetition I usually solve with an entirely non-exotic approach- Take the similar chunk of code, extract it out to a seperate method. The little bit that is different in every instance of that block of code, change that to a parameter.
There's also some easy techniques for removing repetitive-looking if/else if and switch blocks, courtesy of Scott Hanselman:
http://www.hanselman.com/blog/CategoryView.aspx?category=Source+Code&page=2
I might go something like this:
Create custom (private) types for data structures and put all the related logic in there. Dictionary<string, List<int>> etc.
Make inner functions or properties that guarantee behaviour. If you’re continually checking conditions from a publically accessible property then create an private getter method with all of the checking baked in.
Split methods apart that have too much going on. If you can’t put something succinct into the or give it a good name, then start breaking the function apart until the code is (even if these “child” functions aren’t used anywhere else).
If all else fails, slap a [SuppressMessage("Microsoft.Maintainability", "CA1502:AvoidExcessiveComplexity")] on it and comment why.