High Performance Cloning

High Performance Cloning - c#

I'm after a means of deep cloning an object graph in a perfomant way. I'm going to have multiple threads cloning a graph extremely quickly such that they can play with some state and throw away the results if they're not interesting, returning to the original to try again.
I'm currently using a deep clone via binary serialization, which although it works, isn't amazingly fast. I've seen other libraries like protobuf, but the classes in my object graph may be defined in external assemblies, inheriting from classes in the main assembly and don't wish to add any complexity in those consuming assemblies if possible.
One of the interesting things I did come across was cloning using automatically generated IL. It seems it's not quite finished and I've posted to see if the author has done any more on it, but I'm guessing not. Has anyone else developed or seen a more fully functional way of deep cloning via IL? Or another method that is going to be fast?

Other than serialisation, I only consider three options:
Stick with serialisation, but customise it. This might be useful if you want to declaratively bin stuff and there are very likely performance gains to be had.
Reflection-based object walking, in conjunction with an IL emitter such as Fasterflect.
Code-gen or code your own cloning by literally assigning properties to each other (we have some old code that uses what we call a copy-constructor for this, takes an instance of itself and manually copies the properties / fields across).
We have some instances of code where we control the binary serialisation so that we can serialise an interned GUID table (we have lots of repeating GUIDs and serialise very large lists over .NET Remoting). It works well for us and we haven't needed a third party serialisation framework, however, it's hand-crafted stuff with a little code-gen.
The CSLA.NET framework features a class called UndoableBase that uses reflection to serialise a Hashtable of property/field values. Used for allowing rollbacks on objects in memory. This might fit with your "returning to original to try again" sentence.
Personally I'd look further into a reflection-based (preferably with emitted IL for better performance) solution, this then allows you to take advantage of class/member attributes for control over the cloning process. If performance is king, this may not cut it.

Related

Tracking Changes done to a Class without using Entity Framework

I am in need of tracking any changes done to a complex model (a very complex model must I say with all kinds of relationships). Once I have identified these changes, I must save them into a separate table, in order to be approved by an administrator at a later stage.
I've tried using the change tracker of Entity Framework and have even tried to customize it but it has just been giving me problem after problem.
What do you suggest I could use in order to track these changes, which does not involve Entity Framework?
UPDATE: I ended up solving this by creating my own custom checker. Took more time but in the end it was more worth it as I had total control over the changes.
Thanks for you opinions,
Steve :)

Sorry for not providing code example. As commented this is more of an idea (to broad for this Exchange) but it is a high level way that I have done before. Back when "reflection" was highly frowned upon we called it "meta data" but essentially employed reflection - and for that reason, today it is known as meta programming.
Your problem, is a lovely use case for meta programming. Reflection used to be very slow in "80's" only due to low memory and restricted CPU.
Serialises, such as JSON use reflection or the infamously slow XML (but not anymore)
Dependency Injection is the mother of meta programming
Helpers like auto mapper is mostly reflection too.
Today it has been highly optimised and works extremly well due to excellent computational power. As long as you do not write hacky code, or try to optimise it further you will be OK. You should trust the framework and compilers for that.
You can do some fancy things such as intercepting changes but that can get quite complex. To keep it a bit simpler all you have to do is follow a bit of DDD
Your classes should only allow changes via the properties you expose. Each Set or operation that mutates the state can then be sent your lovely state tracking code.
in NET 4.5 reflection is really fast and meta programming is already used in Dependency Injection allover the show.
To remember changes use an optimised collection like maybe a Dictionary or HashSet. Depends on your needs. Using GetType store that as the key and the value can be the new value, or a class that hold metadata like. Old Value, New Value, Version (for rolling back), etc etc.
Once you get that going in your class you then move all the logic into singleton, and define some generic methods that you will reuse on all your "entities"

Why we need Reflection at all?

I was studying Reflection, I got some of it but I am not getting everything related to this concept. Why do we need Reflection? What things we couldn't achieve that we need Reflection?

There are many, many scenarios that reflection enables, but I group them primarily into two buckets.
Reflection enables us to write code that analyzes other code.
Consider for example the most basic question about an assembly: what types are in it? Assemblies are self-describing and reflection is the mechanism by which that description is surfaced to other code.
Suppose for example you wanted to write a program which took an assembly and did a graphical display of the relationships between the various classes in that assembly, to help you understand that code. There are such tools. They're in Visual Studio. Someone wrote those tools. They did not appear by magic. Reflection is the mechanism designed into the .NET framework that enables you or me or anyone else to write tools that understand code.
Reflection enables us to move compile time bindings to runtime.
Suppose you have a static method Foo.Bar(). When you put a call to Foo.Bar() in your program, you know with 100% certainty that the method you think is going to be called is actually going to be called. We call static methods "static" because the binding from the name Bar to the code that gets called can be understood statically -- that is, without running the program.
Now consider a virtual method Blah() on a base class. When you call whatever.Blah() you don't know exactly which Blah() will be called at compile time, but you know that some method Blah() with no arguments will be called on some type that is the runtime type of whatever, and that type is equal to or derived from the type which declares Blah(). (In fact you know more: you know that it is equal to or derived from the compile time type of whatever.) Virtual binding is a form of dynamic binding, but it is not fully dynamic. There's no way for the user to decide that this call should be to a different method on a different type hierarchy.
Reflection enables us to make calls that are bound entirely at runtime, based entirely on user choices if we like. We pay a performance penalty, and we lose compile-time type safety, but we gain the flexibility to decide 100% at runtime what code we call. There are scenarios where that's a reasonable tradeoff.

Reflection is such a deep part of the .NET framework that you often don't know that you're doing it (see Attributes and LINQ for instance). And when you do know you're doing it, even if it feels wrong, it might be the only way to achieve a particular objective.
Apart from the two broad areas that Eric mentioned here are a few others. There are lots more, these are just some that come to mind immediately.
Serialization (and similar)
Whether you're using XML or JSON or rolling your own, serializing objects is much easier when you don't have to write specific code for each class to enable serialization. Reflection enables you to enumerate the properties in your object that have been flagged for (or not flagged against) serailization and write them to the output.
This isn't about saving state though. Reflection allows us to write generic methods that can produce business output too, like CSV or XLSX files from an arbitrary collection. I get a lot of mileage out of my ToCSV(...) and ToExcel(...) extensions for things like producing downloadable versions of data sets on my web-based reporting.
Accessing Hidden Data
Yes, I know, this is a dodgy one. And yeah, Eric is probably going to slap me for this, but...
There's a lot of code out there - I'm looking at you, ASP.NET - that hides interesting and useful stuff behind private or protected. Sometimes the only way to get them out is to use reflection. Sometimes it's not the only way, but it can be the simpler way.
Attributes
Every time you tag an Attribute onto one of your classes, methods, etc. you are implicitly providing data that is going to be accessed through reflection. Want to use those attributes yourself? Reflection is the only way you can get at them.
LINQ and Other Expressions
This is really important stuff these days. If you've ever used LINQ to SQL, Entity Frameworks, etc. then you've used Expression in some way. You write a simple little POCO to represent a row in your database table and everything else gets handled by reflection. When you write a predicate expression the system is using the reflection model to build structures that are then processed (visited) to build an SQL statement.
Expressions aren't just for LINQ either, you do some really interesting things yourself, once you know what you're doing. I have code to generate line parsers for CSV import that run pretty damn quickly when compiled to Func<string, TRecord>. These days I tend to use a mapper somebody else wrote, but at the time I needed to slice a few more % off the total import time for a file with 20K records that was uploaded to a website periodically.
P/Invoke Marshalling
This one is a big deal behind the scenes and occasionally in the foreground too. When you want to call a Windows API function or use a native DLL, P/Invoke gives you ways to achieve this without having to mess about with building memory buffers in both directions. The marshalling methods use reflection to do translation of certain things - strings and so on being the obvious example - so that you don't have to get your hands dirty. All based on the Type object that is the foundation of reflection.
Fact is, without reflection the .NET framework wouldn't be what it is. No Attributes, no Expressions, probably a lot less interop between the languages. No automatic marshalling. No LINQ... at least in the way we often use it now.

Why HashSet<T> class is not used to implement Enumerable.Distinct

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation
So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"
What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions
However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime
So, my questions are:
What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?

which is almost a classical implementation of a hash table with "open addressing"
Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.
a lot of code in Set is just a copy-paste from HashSet, with some omissions
AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.
It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells
In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.
As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.
Especially given the fact that both of these classes are in the same assembly System.Core
Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.
Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.
That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think #james-ko was considering testing the benefits of doing that.
It looks like HashSet<T> will perform better
For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).
so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.
Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

Should ConditionalWeakTable<TKey, TValue> be used for non-compiler purposes?

I've recently come across the ConditionalWeakTable<TKey,TValue> class in my search for an IDictionary which uses weak references, as suggested in answers here and here.
There is a definitive MSDN article which introduced the class and which states:
You can find the class ... in the System.Runtime.CompilerServices namespace. It’s in CompilerServices because it’s not a general-purpose dictionary type: we intend for it to only be used by compiler writers.
and later again:
...the conditional weak table is not intended to be a general purpose collection... But if you’re writing a .NET language of your own and need to expose the ability to attach properties to objects you should definitely look into the Conditional Weak Table.
In line with this, the MSDN entry description of the class reads:
Enables compilers to dynamically attach object fields to managed objects.
So obviously it was originally created for a very specific purpose - to help the DLR, and the System.Runtime.CompilerServices namespace embodies this. But it seems to have found a much wider use than that - even within the CLR. If I search for references of ConditionalWeakTable in ILSpy, for example, I can see that is used in the MEF class CatalogExportProvider and in the internal WPF DataGridHelper class, amongst others.
My question is whether it is okay to use ConditionalWeakTable outside of compiler writing and language tools, and whether there is any risk in doing so in terms of incurring additional overhead or of the implementation changing significantly in future .NET versions. (Or should it be avoided and a custom implementation like this one be used instead).
There is also further reading here, here and here about how the ConditionalWeakTable makes use of a hidden CLR implementation of ephemerons (via System.Runtime.Compiler.Services. DependentHandle) to deal with the problem of cycles between keys and values, and how this cannot easily be accomplished in a custom manner.

I don't see anything wrong with using ConditionalWeakTable. If you need ephemerons, you pretty much have no other choice.
I don't think future .NET versions will be a problem - even if only compilers would use this class, Microsoft still couldn't change it without breaking compatibility with existing binaries.
As for overhead - there certainly will be overhead compared to a normal Dictionary. Having many DependentHandles probably will be expensive similarly to how many WeakReferences are more expensive than normal references (the GC has to do additional work to scan them to see if they need to be nulled out). But that's not a problem unless you have lots (several million) of entries.

Code Generators or T4 Templates, are they really evil?

I have heard people state that Code Generators and T4 templates should not be used. The logic behind that is that if you are generating code with a generator then there is a better more efficient way to build the code through generics and templating.
While I slightly agree with this statement above, I have not really found effective ways to build templates that can say for instance instantiate themselves. In otherwords I can never do :
return new T();
Additionally, if I want to generate code based on database values I have found that using Microsoft.SqlServer.Management.SMO in conjunction with T4 templates have been wonderful at generating mass amounts of code without having to copy / paste or use resharper.
Many of the problems I have found with Generics too is that to my shock there are a lot of developers who do not understand them. When I do examine generics for a solution, there are times where it gets complicated because C# states that you cannot do something that may seem logical in my mind.
What are your thoughts? Do you prefer to build a generator, or do you prefer to use generics? Also, how far can generics go? I know a decent amount about generics, but there are traps and pitfalls that I always run into that cause me to resort to a T4 template.
What is the more proper way to handle scenarios where you need a large amount of flexibility? Oh and as a bonus to this question, what are good resources on C# and Generics?

You can do new T(); if you do this
public class Meh<T>
where T : new()
{
public static T CreateOne()
{
return new T();
}
}
As for code-generators. I use one every day without any problems. I'm using one right now in fact :-)
Generics solve one problem, code-generators solve another. For example, creating a business model using a UML editor and then generating your classes with persistence code as I do all of the time using this tool couldn't be achieved with generics, because each persistent class is completely different.
As for a good source on generics. The best has got to be Jon Skeet's book of course! :-)

As the originator of T4, I've had to defend this question quite a few times as you can imagine :-)
My belief is that at its best code generation is a step on the way to producing equivalent value using reusable libraries.
As many others have said, the key concept to maintain DRY is never, ever changing generated code manually, but rather preserving your ability to regenerate when the source metadata changes or you find a bug in the code generator. At that point the generated code has many of the characteristics of object code and you don't run into copy/paste type problems.
In general, it's much less effort to produce a parameterized code generator (especially with template-based systems) than it is to correctly engineer a high quality base library that gets the usage cost down to the same level, so it's a quick way to get value from consistency and remove repetition errors.
However, I still believe that the finished system would most often be improved by having less total code. If nothing else, its memory footprint would almost always be significantly smaller (although folks tend to think of generics as cost free in this regard, which they most certainly are not).
If you've realised some value using a code generator, then this often buys you some time or money or goodwill to invest in harvesting a library from the generated codebase. You can then incrementally reengineer the code generator to target the new library and hopefully generate much less code. Rinse and repeat.
One interesting counterpoint that has been made to me and that comes up in this thread is that rich, complex, parametric libraries are not the easiest thing in terms of learning curve, especially for those not deeply immersed in the platform. Sticking with code generation onto simpler basic frameworks can produce verbose code, but it can often be quite simple and easy to read.
Of course, where you have a lot of variance and extremely rich parameterization in your generator, you might just be trading off complexity an your product for complexity in your templates. This is an easy path to slide into and can make maintenance just as much of a headache - watch out for that.

Generating code isn't evil and it doesn't smell! The key is to generate the right code at the right time. I think T4 is great--I only use it occasionally, but when I do it is very helpful. To say, unconditionally, that generating code is bad is unconditionally crazy!

It seems to me code generators are fine as long as the code generation is part of your normal build process, rather than something you run once and then keep its output. I add this caveat because if just use the code generator once and discard the data that created it, you're just automatically creating a massive DRY violation and maintenance headache; whereas generating the code every time effectively means that whatever you are using to do the generating is the real source code, and the generated files are just intermediate compile stages that you should mostly ignore.
Lex and yacc are classic examples of tools of allow you to specify functionality in an efficient manner and generate efficient code from it. Trying to do their jobs by hand will lengthen your development time and probably produce less efficient and less readable code. And while you could certainly incorporate something like lex and yacc directly into your code and do their jobs at run time instead of at compile time, that would certainly add considerable complexity to your code and slow it down. If you actually need to change your specification at run time it might be worth it, but in most normal cases using lex/yacc to generate code for you at compile time is a big win.

A good percentage of what is in Visual Studio 2010 would not be possible without code generation. Entity Framework would not be possible. The simple act of dragging and dropping a control onto a form would not be possible, nor would Linq. To say that code generation should not be used is strange as so many use it without even thinking about it.

Maybe it is a bit harsh, but for me code generation smells.
That code generation is used means that there are numerous underlying common principles which may be expressed in a "Don't repeat yourself" fashion. It may take a bit longer, but it is satisfying when you end up with classes that only contain the bits that really change, based on an infrastructure that contains the mechanics.
As to Generics...no I don't have too many issues with it. The only thing that currently doesn't work is saying that
List<Animal> a = new List<Animal>();
List<object> o = a;
But even that will be possible in the next version of C#.

Code generation is for me a workaround for many problems found in language, frameworks, etc. They are not evil by themselves, I would say it is very very bad (i.e. evil) to release a language (C#) and framework which forces you to copy&paste (swap on properties, events triggering, lack of macros) or use magical numbers (wpf binding).
So, I cry, but I use them, because I have to.

I've used T4 for code generation and also Generics. Both are good, have their pros and cons, and are suited for different purposes.
In my case, I use T4 to generate Entities, DAL and BLL based on a database schema. However, DAL and BLL reference a mini-ORM I built, based on Generics and Reflection. So I think you can use them side by side, as long as you keep in control and keep it small and simple.
T4 generates static code, while Generics is dynamic. If you use Generics, you use Reflection which is said to be less performant than "hard-coded" solution. Of course you can cache reflection results.
Regarding "return new T();", I use Dynamic Methods like this:
public class ObjectCreateMethod
{
delegate object MethodInvoker();
MethodInvoker methodHandler = null;
public ObjectCreateMethod(Type type)
{
CreateMethod(type.GetConstructor(Type.EmptyTypes));
}
public ObjectCreateMethod(ConstructorInfo target)
{
CreateMethod(target);
}
void CreateMethod(ConstructorInfo target)
{
DynamicMethod dynamic = new DynamicMethod(string.Empty,
typeof(object),
new Type[0],
target.DeclaringType);
ILGenerator il = dynamic.GetILGenerator();
il.DeclareLocal(target.DeclaringType);
il.Emit(OpCodes.Newobj, target);
il.Emit(OpCodes.Stloc_0);
il.Emit(OpCodes.Ldloc_0);
il.Emit(OpCodes.Ret);
methodHandler = (MethodInvoker)dynamic.CreateDelegate(typeof(MethodInvoker));
}
public object CreateInstance()
{
return methodHandler();
}
}
Then, I call it like this:
ObjectCreateMethod _MetodoDinamico = new ObjectCreateMethod(info.PropertyType);
object _nuevaEntidad = _MetodoDinamico.CreateInstance();

More code means more complexity. More complexity means more places for bugs to hide, which means longer fix cycles, which in turn means higher costs throughout the project.
Whenever possible, I prefer to minimize the amount of code to provide equivalent functionality; ideally using dynamic (programmatic) approaches rather than code generation. Reflection, attributes, aspects and generics provide lots of options for a DRY strategy, leaving generation as a last resort.

Generics and code generation are two different things. In some cases you could use generics instead of code generation and for those I believe you should. For the other cases code generation is a powerful tool.
For all the cases where you simply need to generate code based on some data input, code generation is the way to go. The most obvious, but by no means the only example is the forms editor in Visual Studio. Here the input is the designer data and the output is the code. In this case generics is really no help at all, but it is very nice that VS simply generates the code based on the GUI layout.

Code generators could be considered a code smell that indicate a flaw or lack of functionality in the target langauge.
For example, while it has been said here that "Objects that persist can not be generalized", it would be better to think of it as "Objects in C# that automatically persist their data can not be generalized in C#", because I surely can in Python through the use of various methods.
The Python approach could, however, be emulated in static languages through the use of operator[ ](method_name as string), which either returns a functor or a string, depending on requirements. Unfortunately that solution is not always applicable, and returning a functor can be inconvenient.
The point I am making is that code generators indicate a flaw in a chosen language that are addressed by providing a more convenient specialised syntax for the specific problem at hand.

The copy/paste type of generated code (like ORMs make) can also be very useful...
You can create your database, and then having the ORM generate a copy of that database definition expressed in your favorite language.
The advantage comes when you change your original definition (the database), press compile and the ORM (if you have a good one) can re-generates your copy of the definition. Now all references to your database can be checked by the compilers type checker and your code will fail to compile when you're using tables or columns that do not exist anymore.
Think about this: If I call a method a few times in my code, am I not referring to the name I gave to this method originally? I keep repeating that name over and over... Language designers recognized this problem and came up with "Type-safety" as the solution. Not removing the copies (as DRY suggests we should do), but checking them for correctness instead.
The ORM generated code brings the same solution when referring to table and column names. Not removing the copies/references, but bringing the database definition into your (type-safe) language where you can refer to classes and properties instead. Together with the compilers type checking, this solves a similar problem in a similar way: Guarantee compile-time errors instead of runtime ones when you refer to outdated or misspelled tables (classes) or columns (properties).

quote:
I have not really found effective ways to build templates that can say for instance instantiate themselves. In otherwords I can never do :
return new T();
public abstract class MehBase<TSelf, TParam1, TParam2>
where TSelf : MehBase<TSelf, TParam1, TParam2>, new()
{
public static TSelf CreateOne()
{
return new TSelf();
}
}
public class Meh<TParam1, TParam2> : MehBase<Meh<TParam1, TParam2>, TParam1, TParam2>
{
public void Proof()
{
Meh<TParam1, TParam2> instanceOfSelf1 = Meh<TParam1, TParam2>.CreateOne();
Meh<int, string> instanceOfSelf2 = Meh<int, string>.CreateOne();
}
}

Why does being able to copy/paste really, really fast, make it any more acceptable?
That's the only justification for code generation that I can see.
Even if the generator provides all the flexibility you need, you still have to learn how to use that flexibility - which is yet another layer of learning and testing required.
And even if it runs in zero time, it still bloats the code.
I rolled my own data access class. It knows everything about connections, transactions, stored procedure parms, etc, etc, and I only had to write all the ADO.NET stuff once.
It's now been so long since I had to write (or even look at) anything with a connection object in it, that I'd be hard pressed to remember the syntax offhand.

Code generation, like generics, templates, and other such shortcuts, is a powerful tool. And as with most powerful tools, it amplifies the capaility of its user for good and for evil - they can't be separated.
So if you understand your code generator thoroughly, anticipate everything it will produce, and why, and intend it to do so for valid reasons, then have at it. But don't use it (or any of the other technique) to get you past a place where you're not to sure where you're headed, or how to get there.
Some people think that, if you get your current problem solved and some behavior implemented, you're golden. It's not always obvious how much cruft and opaqueness you leave in your trail for the next developer (which might be yourself.)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.