Should ConditionalWeakTable<TKey, TValue> be used for non-compiler purposes? - c#

I've recently come across the ConditionalWeakTable<TKey,TValue> class in my search for an IDictionary which uses weak references, as suggested in answers here and here.
There is a definitive MSDN article which introduced the class and which states:
You can find the class ... in the System.Runtime.CompilerServices namespace. It’s in CompilerServices because it’s not a general-purpose dictionary type: we intend for it to only be used by compiler writers.
and later again:
...the conditional weak table is not intended to be a general purpose collection... But if you’re writing a .NET language of your own and need to expose the ability to attach properties to objects you should definitely look into the Conditional Weak Table.
In line with this, the MSDN entry description of the class reads:
Enables compilers to dynamically attach object fields to managed objects.
So obviously it was originally created for a very specific purpose - to help the DLR, and the System.Runtime.CompilerServices namespace embodies this. But it seems to have found a much wider use than that - even within the CLR. If I search for references of ConditionalWeakTable in ILSpy, for example, I can see that is used in the MEF class CatalogExportProvider and in the internal WPF DataGridHelper class, amongst others.
My question is whether it is okay to use ConditionalWeakTable outside of compiler writing and language tools, and whether there is any risk in doing so in terms of incurring additional overhead or of the implementation changing significantly in future .NET versions. (Or should it be avoided and a custom implementation like this one be used instead).
There is also further reading here, here and here about how the ConditionalWeakTable makes use of a hidden CLR implementation of ephemerons (via System.Runtime.Compiler.Services. DependentHandle) to deal with the problem of cycles between keys and values, and how this cannot easily be accomplished in a custom manner.

I don't see anything wrong with using ConditionalWeakTable. If you need ephemerons, you pretty much have no other choice.
I don't think future .NET versions will be a problem - even if only compilers would use this class, Microsoft still couldn't change it without breaking compatibility with existing binaries.
As for overhead - there certainly will be overhead compared to a normal Dictionary. Having many DependentHandles probably will be expensive similarly to how many WeakReferences are more expensive than normal references (the GC has to do additional work to scan them to see if they need to be nulled out). But that's not a problem unless you have lots (several million) of entries.

Related

Why HashSet<T> class is not used to implement Enumerable.Distinct

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation
So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"
What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions
However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime
So, my questions are:
What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?
which is almost a classical implementation of a hash table with "open addressing"
Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.
a lot of code in Set is just a copy-paste from HashSet, with some omissions
AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.
It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells
In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.
As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.
Especially given the fact that both of these classes are in the same assembly System.Core
Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.
Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.
That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think #james-ko was considering testing the benefits of doing that.
It looks like HashSet<T> will perform better
For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).
so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.
Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

Tool for 'Flattening' (simplifying) C# Source

I need to provide a copy of the source code to a third party, but given it's a nifty extensible framework that could be easily repurposed, I'd rather provide a less OO version (a 'procedural' version for want of a better term) that would allow minor tweaks to values etc but not reimplementation using the full flexibility of how it is currently structured.
The code makes use of the usual stuff: classes, constructors, etc. Is there a tool or method for 'simplifying' this into what is still the 'source' but using only plain variables etc.
For example, if I had a class instance 'myclass' which initialised this.blah in the constructor, the same could be done with a variable called myclass_blah which would then be manipulated in a more 'flat' way. I realise some things like polymorphism would probably not be possible in such a situation. Perhaps an obfuscator, set to a 'super mild' setting would achieve it?
Thanks
My experience with nifty extensible frameworks has been that most shops have their own nifty extensible frameworks (usually more than one) and are not likely to steal them from vendor-provided source code. If you are under obligation to provide source code (due to some business relationship), then, at least in my mind, there's an ethical obligation to provide the actual source code, in a maintainable form. How you protect the source code is a legal matter and I can't offer legal advice, but really you should be including some license with your release and dealing with clients who are not going to outright steal your IP (assuming it's actually yours under the terms you're developing it.)
As had already been said, if this is a requirement based on restrictions of contracts then don't do it. In short, providing a version of the source that differs from what they're actually running becomes a liability and I doubt that it is one that your company should be willing to take. Proving that the code provided matches the code they are running is simple. This is also true if you're trying to avoid license restrictions of libraries your application uses (e.g. GPL).
If that isn't the case then why not provide a limited version of your extensibility framework that only works with internal types and statically compile any required extensions in your application? This will allow the application to continue to function as what they currently run while remaining maintainable without giving up your sacred framework. I've never done it myself but this sounds like something ILMerge could help with.
If you don't want to give out framework - just don't. Provide only source you think is required. Otherwise most likely you'll need to either support both versions in the future OR never work/interact with these people (and people they know) again.
Don't forget that non-obfuscated .Net assemblies have IL in easily de-compilable form. It is often easier to use ILSpy/Reflector to read someone else code than looking at sources.
If the reason to provide code is some sort of inspection (even simply looking at the code) you'd better have semi-decent code. I would seriously consider throwing away tool if its code looks written in FORTRAN-style using C# ( http://www.nikhef.nl/~templon/fortran/fortran_style ).
Side note: I believe "nifty extensible frameworks" are one of the roots of "not invented here" syndrome - I'd be more worried about comments on the framework (like "this code is ##### because it does not use YYY pattern and spacing is wrong") than reuse.

Immutable Data Structures in C#

I was reading some entries in Eric Lippert's blog about immutable data structures and I got to thinking, why doesn't C# have this built into the standard libraries? It seems strange for something with obvious reuse to not be already implemented out of the box.
EDIT: I feel I might be misunderstood on my question. I'm not asking how to implement one in C#, I'm asking why some of the basic data structures (Stack, Queue, etc.) aren't already available as immutable variants.
It does now.
.NET just shipped their first immutable collections, which I suggest you try out.
Any framework, language, or combination thereof that is not a purely experimental exercise has a market. Some purely experimental ones go on to develop a market.
In this sense, "market" does not necessarily refer to market economics, it's as true whether the producers of the framework/language/both are commercially or non-commercially oriented and distributing the framework/language/both (I'm just going to say "framework" for now on) at a cost or for free. Indeed, free-as-in-both-beer-and-speech projects can be even more heavily dependent on their markets than commercial projects in this way, because their producers are a subset of their market. The market is anyone and everyone who uses it.
The nature of this market will affect the framework in several ways both by organic processes (some parts prove more popular than others and get a bigger share of the mindspace within the community that educates its own members about them) and by fiat (someone decides a feature will serve the market and therefore adds it).
When .NET was developed, it was developed to serve its future market. Ideas about what would serve them therefore influenced decisions as to what should and should not be included in the FCL, the runtime, and the languages that work with it.
Someone decided that we'd quite likely need System.Collections.ArrayList. Someone decided we'd quite likely need System.IO.DirectoryInfo to have a Delete() method. Nobody decided we'd be likely to need a System.Collections.ImmutableStack.
Maybe nobody thought of it at all. Maybe someone did and even implemented it and then it was decided not to be of enough general use. Maybe it was debated at length within Microsoft. I've never worked for MS, so I don't have a clue.
What I can consider though, is the question as to what the people who were using the .NET framework in 2002 using in 2001.
Well, COM, ActiveX, ("Classic") ASP, and VB6 & VBScript is now much less used than it was, so it can be said to have been replaced by .NET. Indeed, that could be said to have been an intention.
As well as VB6 & VBScript, a considerable number who were writing in C++ and Java with Windows as a sole or primary target platform are now at least partly using .NET instead. Again, I think that could be said to be an intention, or at the very least I don't think MS were surprised that it went that way.
In COM we had an enumerator-object based foreach approach to iteration that had direct support in some languages (the VB family*), and .NET we have an enumerator-object based foreach approach to iteration that has direct support in some languages (C#, VB.NET, and others)†.
In C++ we had a rich set of collection types from the STL, and in .NET we have a rich set of collection types from the FCL (and typesafe generic types from .NET2.0 on).
In Java we had a strong everything-is-an-object style of OOP with a small set of methods provided by a common base-type and a boxing mechanism to allow for simple efficient primitives while keeping to this style. In .NET we have a strong everything-is-an-object style of OOP with a small set of methods provided by a common base-type and a (different) boxing mechanism to allow for simple efficient primitives while keeping to this style.
These cases show choices that are unsurprising considering who was likely to end up being the market for .NET (though such broad statements above shouldn't be read in a way that underestimates the amount of work and subtlety of issues within each of them). Another thing that relates to this is when .NET differs from COM or classic VB or C++ or Java, there may well be a bit of an explanation given in the documentation. When .NET differs from Haskell or Lisp, nobody feels the need to point it out!
Now, of course there are things done differently in .NET than to any of the above (or there'd have been no point and we could have stayed with COM etc.)
However, my point is that out of the near-infinite range of possible things that could end up in a framework like .NET, there are some complete no-brainers ("they might need some sort of string type..."), some close-to-obvious ("this is really easy to do in COM, so it must be easy to do in .NET"), some harder calls ("this will be more complicated than in VB6, but the benefits are worth it"), some improvements ("locale support could really be made a lot easier for developers, so lets build a new approach to the old issue") and some that were less related to the above.
At the other extreme, we can probably all imagine something that would be so out there as to be bizarre ("hey, all coders like Conway's Life - let's put a Conway's Life right into the framework") and hence there's no surprise at not finding it supported.
So far I've quickly brushed over a lot of hard work and difficult design balances in a way that makes the design they came up with seem simpler than it no doubt was. Most likely, the more "obvious" it seems to an outsider after the fact, the more difficult it was for the designers.
Immutable collection types falls into the large range of possible components to the FCL that while not as bizarre as a built-in-conway-support idea, was not as strongly called for by examining the market as a mutable list or a way to encapsulate locale information nicely. It would have been novel to much of the initial market, and therefore at risk of ending up not being used. In an alternate universe there's a .NET1.0 with immutable collections, but it's not very surprising that there wasn't one here.
*At least for consuming. Producing IEnumVARIANT implementations in VB6 wasn't simple, and could involve writing pointer values straight into v-tables in a rather nasty way that it suddenly occurs to me, is possibly not even allowed by today's DEP.
†With a sometimes impossible to implement .Reset() method. Is there any reason for this other than it was in IEnumVARIANT? Was it even ever much used in IEnumVARIANT?
I'll quote from that Eric Lippert blog that you've been reading:
because no one ever designed, specified, implemented, tested, documented and shipped that feature.
In other words, there is no reason other than it hasn't been high enough value or priority to get done ahead of all the other things they're working on.
Why can't you make an immutable struct? Consider the following:
public struct Coordinate
{
public int X
{
get { return _x; }
}
private int _x;
public int Y
{
get { return _y; }
}
private int _y;
public Coordinate(int x, int y)
{
_x = x;
_y = y;
}
}
It's an immutable value type.
It's hard to work with immutable data structures unless you have some functional programming constructs. Suppose you wanted to create an immutable vector containing every other capital letter. How would you do it unless you
A) had functions that did things like range(65, 91), filter(only even) and map(int -> char) to create the sequence in one shot and then turn it into an array
B) created the vector as mutable, added the characters in a loop, then then "froze" it, making it immutable?
By the way, C# does have the B option to some extent -- ReadOnlyCollection can wrap a mutable collection and prevent people from mutating it. However, it's a pain in the ass to do that all the time (and obviously it's hard to support sharing structure between copies when you don't know if something is going to become immutable or not.) A is a better option.
Remember, when C# 1.0 existed, it didn't have anonymous functions, it didn't have language support for generators or other laziness, it didn't have any functional APIs like LINQ -- not even map or filter -- it didn't have concise array initialization syntax (you couldn't write new int[] { 1, 2, 5 }) and it didn't have generic types; just putting stuff into and getting stuff out of collections normally was a pain. So I don't think it would have been a great choice to spend time on making robust immutable collections with such poor language support for using them.
It would be nice if .net had some really solid support for immutable data holders (classes and structures). One difficulty with adding really good support for such things, though, is that taking maximum advantage of mutable and immutable data structures would require some fundamental changes to the way inheritance works. While I would like to see such support in the next major object-oriented framework, I don't know that it can be efficiently worked into existing frameworks like .net or Java.
To see the problem, imagine that there are two basic data types: basicItem and deluxeItem (which is a basicItem with a few extra fields added). Each can exist in two concrete forms: mutable and immutable. Each can also be described in an abstract form: readable. Thus, there should be six data types; all but ReadableBasicItem should be substitutable for at least one other:
ReadableBasicItem: Not substitutable for anything
MutableBasicItem: ReadableBasicItem
ImmutableBasicItem: ReadableBasicItem
ReadableDeluxeItem: ReadableBasicItem
MutableDeluxeItem: ReadableDeluxeItem, MutableBasicItem (also ReadableBasicItem)
ImmutableDeluxeItem: ReadableDeluxeItem, ImmutableBasicItem (also ReadableBasicItem)
Even thought the underlying data type has just one base and one derived type, there inheritance graph has two "diamonds" since both "MutableDeluxeItem" and "ImmutableDeluxeItem" have two parents (MutableBasicItem and ReadableDeluxeItem), both of which inherit from ReadableBasicItem. Existing class architectures cannot effectively deal with that. Note that it wouldn't be necessary to support generalized multiple inheritance; merely to allow some specific variants such as those above (which, despite having "diamonds" in the inheritance graph, have an internal structure such that both ReadableDeluxeItem and MutableBasicItem would inherit from "the same" ReadableBasicItem).
Also, while support for that style of inheritance of mutable and immutable types might be nice, the biggest payoff wouldn't happen unless the system had a means of distinguishing heap-stored objects that should expose value semantics from those which should expose reference semantics, could distinguish mutable objects from immutable ones, and could allow objects to start out in an "uncommitted" state (neither mutable nor guaranteed immutable). Copying a reference to a heap object with mutable value semantics should perform a memberwise clone on that object and any nested objects with mutable value semantics, except in cases where the original reference would be guaranteed destroyed; the clones should start life as uncommitted, but be CompareExchange'd to mutable or immutable as needed.
Adding framework support for such features would allow copy-on-write value semantics to be implemented much more efficiently than would be possible without framework support, but such support would really have to be built into the framework from the ground up. I don't think it could very well be overlaid onto an existing framework.

C# classes - Why so many static methods?

I'm pretty new to C# so bear with me.
One of the first things I noticed about C# is that many of the classes are static method heavy. For example...
Why is it:
Array.ForEach(arr, proc)
instead of:
arr.ForEach(proc)
And why is it:
Array.Sort(arr)
instead of:
arr.Sort()
Feel free to point me to some FAQ on the net. If a detailed answer is in some book somewhere, I'd welcome a pointer to that as well. I'm looking for the definitive answer on this, but your speculation is welcome.
Because those are utility classes. The class construction is just a way to group them together, considering there are no free functions in C#.
Assuming this answer is correct, instance methods require additional space in a "method table." Making array methods static may have been an early space-saving decision.
This, along with avoiding the this pointer check that Amitd references, could provide significant performance gains for something as ubiquitous as arrays.
Also see this rule from FXCOP
CA1822: Mark members as static
Rule Description
Members that do not access instance data or call instance methods can
be marked as static (Shared in Visual Basic). After you mark the
methods as static, the compiler will emit nonvirtual call sites to
these members. Emitting nonvirtual call sites will prevent a check at
runtime for each call that makes sure that the current object pointer
is non-null. This can achieve a measurable performance gain for
performance-sensitive code. In some cases, the failure to access the
current object instance represents a correctness issue.
Perceived functionality.
"Utility" functions are unlike much of the functionality OO is meant to target.
Think about the case with collections, I/O, math and just about all utility.
With OO you generally model your domain. None of those things really fit in your domain--it's not like you are coding and go "Oh, we need to order a new hashtable, ours is getting full". Utility stuff often just doesn't fit.
We get pretty close, but it's still not very OO to pass around collections (where is your business logic? where do you put the methods that manipulate your collection and that other little piece or two of data you are always passing around with it?)
Same with numbers and math. It's kind of tough to have Integer.sqrt() and Long.sqrt() and Float.sqrt()--it just doesn't make sense, nor does "new Math().sqrt()". There are a lot of areas it just doesn't model well. If you are looking for mathematical modeling then OO may not be your best bet. (I made a pretty complete "Complex" and "Matrix" class in Java and made them fairly OO, but making them really taught me some of the limits of OO and Java--I ended up "Using" the classes from Groovy mostly)
I've never seen anything anywhere NEAR as good as OO for modeling your business logic, being able to demonstrate the connections between code and managing your relationship between data and code though.
So we fall back on a different model when it makes more sense to do so.
The classic motivations against static:
Hard to test
Not thread-safe
Increases code size in memory
1) C# has several tools available that make testing static methods relatively easy. A comparison of C# mocking tools, some of which support static mocking: https://stackoverflow.com/questions/64242/rhino-mocks-typemock-moq-or-nmock-which-one-do-you-use-and-why
2) There are well-known, performant ways to do static object creation/logic without losing thread safety in C#. For example implementing the Singleton pattern with a static class in C# (you can jump to the fifth method if the inadequate options bore you): http://www.yoda.arachsys.com/csharp/singleton.html
3) As #K-ballo mentions, every method contributes to code size in memory in C#, rather than instance methods getting special treatment.
That said, the 2 specific examples you pointed out are just a problem of legacy code support for the static Array class before generics and some other code sugar was introduced back in C# 1.0 days, as #Inerdia said. I tried to answer assuming you had more code you were referring to, possibly including outside libraries.
The Array class isn't generic and can't be made fully generic because this would break backwards compatibility. There's some sort of magic going on where arrays implement IList<T>, but that's only for single-dimension arrays with a lower bound of 0 – "list-ish" arrays.
I'm guessing the static methods are the only way to add generic methods that work over any shape of array regardless of whether it qualifies for the above-mentioned compiler magic.

High Performance Cloning

I'm after a means of deep cloning an object graph in a perfomant way. I'm going to have multiple threads cloning a graph extremely quickly such that they can play with some state and throw away the results if they're not interesting, returning to the original to try again.
I'm currently using a deep clone via binary serialization, which although it works, isn't amazingly fast. I've seen other libraries like protobuf, but the classes in my object graph may be defined in external assemblies, inheriting from classes in the main assembly and don't wish to add any complexity in those consuming assemblies if possible.
One of the interesting things I did come across was cloning using automatically generated IL. It seems it's not quite finished and I've posted to see if the author has done any more on it, but I'm guessing not. Has anyone else developed or seen a more fully functional way of deep cloning via IL? Or another method that is going to be fast?
Other than serialisation, I only consider three options:
Stick with serialisation, but customise it. This might be useful if you want to declaratively bin stuff and there are very likely performance gains to be had.
Reflection-based object walking, in conjunction with an IL emitter such as Fasterflect.
Code-gen or code your own cloning by literally assigning properties to each other (we have some old code that uses what we call a copy-constructor for this, takes an instance of itself and manually copies the properties / fields across).
We have some instances of code where we control the binary serialisation so that we can serialise an interned GUID table (we have lots of repeating GUIDs and serialise very large lists over .NET Remoting). It works well for us and we haven't needed a third party serialisation framework, however, it's hand-crafted stuff with a little code-gen.
The CSLA.NET framework features a class called UndoableBase that uses reflection to serialise a Hashtable of property/field values. Used for allowing rollbacks on objects in memory. This might fit with your "returning to original to try again" sentence.
Personally I'd look further into a reflection-based (preferably with emitted IL for better performance) solution, this then allows you to take advantage of class/member attributes for control over the cloning process. If performance is king, this may not cut it.

Categories