C# creating a fixed size hashtable - c#

I want to be able to create a fixed size hashmap of say 100 buckets, and if I need to store over 100 items then collisions and overwriting will just have to happen. The hashtable class has a IsFixedSize property however it is readonly.
Am I thinking about this completely wrongly, or is there a solution to this?

Collections in the .NET framework don't allow for a lot of fine-tuning. Although you might find one efficient enough for your needs. Try some viable ones out before optimizing.
If you don't roll your own then you might find a 3rd party alternative that has more fine-grained controls. For example, see The C5 Generic Collection Library
for C# and CLI as a possible start. Check into the various Hash* classes on their documentation page.
If you decide to roll your own then you'll want to implement some of the standard interfaces for collections and/or lists, enumerations, etc so they work as expected with C# foreach and language and .NET features.
You might also take an efficient C++ implementation if you have one and there are ways of using it in C#/.NET. It might take a bit of finagling but there are answers on SO about how to accomplish this kind of thing.

Related

.NET - Is there a singly-linked list in System.Collections.Immutable somewhere?

In functional programming, singly linked lists are extremely popular, because it is easy to reuse sub-lists without allocating any memory or copying values. This means you can add or remove items from one end without any allocation. The F# list is like this.
From what I've read, it sounds like System.Collections.Immutable.ImmutableList<T> is like an immutable version of System.Collections.Generic.List<T>, which is an abstraction over arrays. This is more optimized for random access than a linked list, but requires copying the entire list when adding or removing items.
System.Collections.Generic.LinkedList<T> is a mutable doubly-linked list, which means that either mutation and/or copying is required to add or remove items.
There is no System.Collections.Immutable.ImmutableLinkedList<T> that I've been able to find.
Is there really no immutable singly-linked list in the System.Collections.Immutable package? Is using Microsoft.FSharp.Core.List<T> the best option here?
There is (as of this writing), no great answer to this in the .NET base class library.
The closest thing in System.Collections.Immutable is ImmutableStack<T>, which is implemented as a singly-linked list under the hood. However, the exposed functions are very limited so unless you actually want a stack this is probably a poor choice. You are correct that ImmutableList<T> is a different datastructure with different operations and performance characteristics.
There is FSharpList<T> in the F# standard library, but due to being an F# primitive this is awkward to use from C# (it's possible, it just won't result in clean, idiomatic code). Also requires taking the full F# standard lib as a dependency.
Having faced the same problem in one of my projects, I wrote a NuGet package "ImmutableLinkedList" (github here) for this, so that could be an option as well.

Immutable Data Structures in C#

I was reading some entries in Eric Lippert's blog about immutable data structures and I got to thinking, why doesn't C# have this built into the standard libraries? It seems strange for something with obvious reuse to not be already implemented out of the box.
EDIT: I feel I might be misunderstood on my question. I'm not asking how to implement one in C#, I'm asking why some of the basic data structures (Stack, Queue, etc.) aren't already available as immutable variants.
It does now.
.NET just shipped their first immutable collections, which I suggest you try out.
Any framework, language, or combination thereof that is not a purely experimental exercise has a market. Some purely experimental ones go on to develop a market.
In this sense, "market" does not necessarily refer to market economics, it's as true whether the producers of the framework/language/both are commercially or non-commercially oriented and distributing the framework/language/both (I'm just going to say "framework" for now on) at a cost or for free. Indeed, free-as-in-both-beer-and-speech projects can be even more heavily dependent on their markets than commercial projects in this way, because their producers are a subset of their market. The market is anyone and everyone who uses it.
The nature of this market will affect the framework in several ways both by organic processes (some parts prove more popular than others and get a bigger share of the mindspace within the community that educates its own members about them) and by fiat (someone decides a feature will serve the market and therefore adds it).
When .NET was developed, it was developed to serve its future market. Ideas about what would serve them therefore influenced decisions as to what should and should not be included in the FCL, the runtime, and the languages that work with it.
Someone decided that we'd quite likely need System.Collections.ArrayList. Someone decided we'd quite likely need System.IO.DirectoryInfo to have a Delete() method. Nobody decided we'd be likely to need a System.Collections.ImmutableStack.
Maybe nobody thought of it at all. Maybe someone did and even implemented it and then it was decided not to be of enough general use. Maybe it was debated at length within Microsoft. I've never worked for MS, so I don't have a clue.
What I can consider though, is the question as to what the people who were using the .NET framework in 2002 using in 2001.
Well, COM, ActiveX, ("Classic") ASP, and VB6 & VBScript is now much less used than it was, so it can be said to have been replaced by .NET. Indeed, that could be said to have been an intention.
As well as VB6 & VBScript, a considerable number who were writing in C++ and Java with Windows as a sole or primary target platform are now at least partly using .NET instead. Again, I think that could be said to be an intention, or at the very least I don't think MS were surprised that it went that way.
In COM we had an enumerator-object based foreach approach to iteration that had direct support in some languages (the VB family*), and .NET we have an enumerator-object based foreach approach to iteration that has direct support in some languages (C#, VB.NET, and others)†.
In C++ we had a rich set of collection types from the STL, and in .NET we have a rich set of collection types from the FCL (and typesafe generic types from .NET2.0 on).
In Java we had a strong everything-is-an-object style of OOP with a small set of methods provided by a common base-type and a boxing mechanism to allow for simple efficient primitives while keeping to this style. In .NET we have a strong everything-is-an-object style of OOP with a small set of methods provided by a common base-type and a (different) boxing mechanism to allow for simple efficient primitives while keeping to this style.
These cases show choices that are unsurprising considering who was likely to end up being the market for .NET (though such broad statements above shouldn't be read in a way that underestimates the amount of work and subtlety of issues within each of them). Another thing that relates to this is when .NET differs from COM or classic VB or C++ or Java, there may well be a bit of an explanation given in the documentation. When .NET differs from Haskell or Lisp, nobody feels the need to point it out!
Now, of course there are things done differently in .NET than to any of the above (or there'd have been no point and we could have stayed with COM etc.)
However, my point is that out of the near-infinite range of possible things that could end up in a framework like .NET, there are some complete no-brainers ("they might need some sort of string type..."), some close-to-obvious ("this is really easy to do in COM, so it must be easy to do in .NET"), some harder calls ("this will be more complicated than in VB6, but the benefits are worth it"), some improvements ("locale support could really be made a lot easier for developers, so lets build a new approach to the old issue") and some that were less related to the above.
At the other extreme, we can probably all imagine something that would be so out there as to be bizarre ("hey, all coders like Conway's Life - let's put a Conway's Life right into the framework") and hence there's no surprise at not finding it supported.
So far I've quickly brushed over a lot of hard work and difficult design balances in a way that makes the design they came up with seem simpler than it no doubt was. Most likely, the more "obvious" it seems to an outsider after the fact, the more difficult it was for the designers.
Immutable collection types falls into the large range of possible components to the FCL that while not as bizarre as a built-in-conway-support idea, was not as strongly called for by examining the market as a mutable list or a way to encapsulate locale information nicely. It would have been novel to much of the initial market, and therefore at risk of ending up not being used. In an alternate universe there's a .NET1.0 with immutable collections, but it's not very surprising that there wasn't one here.
*At least for consuming. Producing IEnumVARIANT implementations in VB6 wasn't simple, and could involve writing pointer values straight into v-tables in a rather nasty way that it suddenly occurs to me, is possibly not even allowed by today's DEP.
†With a sometimes impossible to implement .Reset() method. Is there any reason for this other than it was in IEnumVARIANT? Was it even ever much used in IEnumVARIANT?
I'll quote from that Eric Lippert blog that you've been reading:
because no one ever designed, specified, implemented, tested, documented and shipped that feature.
In other words, there is no reason other than it hasn't been high enough value or priority to get done ahead of all the other things they're working on.
Why can't you make an immutable struct? Consider the following:
public struct Coordinate
{
public int X
{
get { return _x; }
}
private int _x;
public int Y
{
get { return _y; }
}
private int _y;
public Coordinate(int x, int y)
{
_x = x;
_y = y;
}
}
It's an immutable value type.
It's hard to work with immutable data structures unless you have some functional programming constructs. Suppose you wanted to create an immutable vector containing every other capital letter. How would you do it unless you
A) had functions that did things like range(65, 91), filter(only even) and map(int -> char) to create the sequence in one shot and then turn it into an array
B) created the vector as mutable, added the characters in a loop, then then "froze" it, making it immutable?
By the way, C# does have the B option to some extent -- ReadOnlyCollection can wrap a mutable collection and prevent people from mutating it. However, it's a pain in the ass to do that all the time (and obviously it's hard to support sharing structure between copies when you don't know if something is going to become immutable or not.) A is a better option.
Remember, when C# 1.0 existed, it didn't have anonymous functions, it didn't have language support for generators or other laziness, it didn't have any functional APIs like LINQ -- not even map or filter -- it didn't have concise array initialization syntax (you couldn't write new int[] { 1, 2, 5 }) and it didn't have generic types; just putting stuff into and getting stuff out of collections normally was a pain. So I don't think it would have been a great choice to spend time on making robust immutable collections with such poor language support for using them.
It would be nice if .net had some really solid support for immutable data holders (classes and structures). One difficulty with adding really good support for such things, though, is that taking maximum advantage of mutable and immutable data structures would require some fundamental changes to the way inheritance works. While I would like to see such support in the next major object-oriented framework, I don't know that it can be efficiently worked into existing frameworks like .net or Java.
To see the problem, imagine that there are two basic data types: basicItem and deluxeItem (which is a basicItem with a few extra fields added). Each can exist in two concrete forms: mutable and immutable. Each can also be described in an abstract form: readable. Thus, there should be six data types; all but ReadableBasicItem should be substitutable for at least one other:
ReadableBasicItem: Not substitutable for anything
MutableBasicItem: ReadableBasicItem
ImmutableBasicItem: ReadableBasicItem
ReadableDeluxeItem: ReadableBasicItem
MutableDeluxeItem: ReadableDeluxeItem, MutableBasicItem (also ReadableBasicItem)
ImmutableDeluxeItem: ReadableDeluxeItem, ImmutableBasicItem (also ReadableBasicItem)
Even thought the underlying data type has just one base and one derived type, there inheritance graph has two "diamonds" since both "MutableDeluxeItem" and "ImmutableDeluxeItem" have two parents (MutableBasicItem and ReadableDeluxeItem), both of which inherit from ReadableBasicItem. Existing class architectures cannot effectively deal with that. Note that it wouldn't be necessary to support generalized multiple inheritance; merely to allow some specific variants such as those above (which, despite having "diamonds" in the inheritance graph, have an internal structure such that both ReadableDeluxeItem and MutableBasicItem would inherit from "the same" ReadableBasicItem).
Also, while support for that style of inheritance of mutable and immutable types might be nice, the biggest payoff wouldn't happen unless the system had a means of distinguishing heap-stored objects that should expose value semantics from those which should expose reference semantics, could distinguish mutable objects from immutable ones, and could allow objects to start out in an "uncommitted" state (neither mutable nor guaranteed immutable). Copying a reference to a heap object with mutable value semantics should perform a memberwise clone on that object and any nested objects with mutable value semantics, except in cases where the original reference would be guaranteed destroyed; the clones should start life as uncommitted, but be CompareExchange'd to mutable or immutable as needed.
Adding framework support for such features would allow copy-on-write value semantics to be implemented much more efficiently than would be possible without framework support, but such support would really have to be built into the framework from the ground up. I don't think it could very well be overlaid onto an existing framework.

How do I avoid 'Binding to Large CLR Objects'?

This guide on optimizing DataBinding says:
There is a significant performance impact when you data bind to a single CLR object with thousands of properties. You can minimize this impact by dividing the single object into multiple CLR objects with fewer properties.
What does this mean? I am still trying to get familiar with DataBinding, but my analogy here is that properties are like SQL table fields, and objects are rows. This advice then translates to "to avoid problems with a large number of fields, use less fields and create more rows". As this doesn't make any sense to me, possibly my understanding of databinding is completely askew?
Does this advice actually apply? I am unsure if it is specific to .NET 4/WPF, while i am using 3.5 and a custom WinForms based control library (DevExpress)
As an aside: am I correct in thinking DataBinding uses reflection when an using IList style datasource?
This is not just a academic question. I am currently trying to speed up loading a XtraGridView (DevExpress Control) with ~100,000 objects with 50 properties or so.
This advice then translates to "to avoid problems with a large number of fields, use less fields and create more rows"
I think it should translate to "use less fields and create smaller tables" (i.e. with less fields). And the original advice should read "[...]dividing the single class into multiple classes", with fewer properties. As you correctly noted, it wouldn't make sense to create more "rows"...
Anyway, if you do have a class that exposes hundreds or thousands of properties, you have a far more serious problem than binding performance... This is a serious design flaw that you should fix after reading some OO principles.
Does this advice actually apply? I am unsure if it is specific to .NET 4/WPF, while i am using 3.5 and a custom WinForms based control library (DevExpress)
Well, the page you mentioned is about WPF, but I think the idea of binding to smaller objects can apply to WinForms too (because the more properties need to be watched, the slower it will be)
As an aside: am I correct in thinking DataBinding uses reflection when an using IList style datasource?
You're partially correct... it actually uses TypeDescriptor, which in turns uses reflection to examine regular CLR objects. But this mechanism is much more flexible than reflection: a type can implement ICustomTypeDescriptor to provide its own description, list of members, etc (DataTable is one example of such a type)
You are solving the wrong problem. It will take a typical user well over a week to find back what she is looking for when she's got 5 million fields to search through. The speed of your UI becomes irrelevant. Only a machine can do a better job finding the data back.
You've got one. Help the user narrow down what she is searching for by letting her enter search terms so that the total query result doesn't contain more than, say, a hundred rows. The dbase engine helps you make that fast. And it automatically solves your grid perf problem.

Definition of C# data structures and algorithms

This may be a silly question (with MSDN and all), but maybe some of you will be able to help me sift through amazing amounts of information.
I need to know the specifics of the implementations of common data structures and algorithms in C#. That is, for example, I need to know, say, how Linked Lists are handled and represented, how they and their methods are defined.
Is there a good centralized source of documentation for this (with code), or should I just reconstruct it? Have you ever had to know the specifics of these things to decide what to use?
Regards, and thanks.
Scott Mitchell has a great 6-part article that covers many .NET data structures:
An Extensive Examination of Data Structures
For an algorithmic overview of data structures, I suggest reading the algorithm textbook: "Introduction to Algorithms" by Cormen, et al..
For details on each .NET data structure the MSDN page on that specific class is good.
When all of them fail to address issues, Reflector is always there. You can use it to dig through the actual source and see things for yourself.
If you really want to learn it, try making your own.
Googling for linked lists will give you a lot of hits and sample code to go off of. Wikipedia will also be a good resource.
Depends on the language. Most languages have the very basics now pre-built with them, but that doesn't mean their implementations are the same. The same named object--LinkedList in C# is completely different than the LinkedList in Java or C++. Even the String library is different. C# for instance is known to create a new String object every time you assign a string a new value...this becomes something you learn quickly when it brings your program to a crashing halt when you're working with substrings in C# for the first time.
So the answer to your question is massively complicated because I don't know quite what you're after. If you're just going to be teaching a class what a generic version of these algorithms and data structures are, you can present them without getting into the problems I mentioned above. You'll just need to select, lookup, read about a particular type of implementation of them. Like for LinkedList you need to be able to instantiate the list, destroy the list, copy the list, add to the list somewhere (usually front/back), remove from the list, etc. You could get fancy and add as many methods as you want.

Should I be concerned about .NET dictionary speed?

I will be creating a project that will use dictionary lookups and inserts quite a bit. Is this something to be concerned about?
Also, if I do benchmarking and such and it is really bad, then what is the best way of replacing dictionary with something else? Would using an array with "hashed" keys even be faster? That wouldn't help on insert time though will it?
Also, I don't think I'm micro-optimizing because this really will be a significant part of code on a production server, so if this takes an extra 100ms to complete, then we will be looking for new ways to handle this.
You are micro-optimizing. Do you even have working code yet? Remember, "If it doesn't work, it doesn't matter how fast it doesn't work." (Mich Ravera) http://www.codingninja.co.uk/best-programmers-quotes/.
You have no idea where the bottlenecks will be, and already you're focused on Dictionary. What if the problem is somewhere else?
How do you know how the Dictionary class is implemented? Maybe it already uses an array with hashed keys!
P.S. It's really ".NET Dictionaries", not "C# Dictionaries", because C# is just one of several programming languages that use the framework.
Hello, I will be creating a project
that will use dictionary lookups and
inserts quite a bit. Is this something
to be concerned about?
Yes. It is always wise to consider performance factors up front.
The form that your concern should take is as follows: your concern should be encouraging you to write realistic, user-focused performance specifications. It should be encouraging you to start writing performance tests early, and running them often, so that you can see how every single change to the product affects performance. That way you will be informed immediately when a code change causes a user-affecting change in performance. And it should be encouraging you to run profiles often, so that you are reasoning about performance based on empirical measurements, rather than random guesses and hunches.
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
The best way to do this is to build a reasonable abstraction layer. If you have a class (or interface) which represents the "insert" and "lookup" abstract data type, then you can replace its internals without changing any of the callers.
Note that adding a layer of abstraction itself has a performance cost. If your profiling shows that the abstraction layer is too expensive, if the extra couple nanoseconds per call is too much, then you might have to get rid of the abstraction layer. Again, this decision will be driven by real-world performance data.
Would using an array with "hashed"
keys even be faster? That wouldn't
help on insert time though will it?
Neither you nor anyone reading this can possibly know which one is faster until you write it both ways and then benchmark it both ways under real-world conditions. Doing it under "lab" conditions will skew your results; you'll need to understand how things work when the GC is under realistic memory pressure, and so on. You might as well ask us which horse will run faster in next year's Kentucky Derby. If we knew the answer just by looking at the racing form, we'd all be rich already. You can't possibly expect anyone to know which of two entirely hypothetical, unwritten pieces of code will be faster under unspecified conditions!
The Dictionary<TKey, TValue> class is actually implemented as a hash table which makes lookups very fast (close to O(1)). See the API documentation for more information. I doubt you could make a better implementation yourself.
Wait and see if the performance of your application is below expectations
If it is then use a profiler to determine if the Dictionary lookup is the source of the problem
If it is then do some tests with representative data to see if another choice of list would be quicker.
In short - no, in general you shouldn't worry about the performance of implementation details until after you have a problem.
I would do a benchmark of the Dictionary, HashTable (HashSet in .NET), and perhaps a home grown class, and see which works out best under your typical usage conditions.
Normally I would say it's fine (insert StackOverflow's favorite premature ejaculation quote here), but if this is a core peice of the application, Benchmark, Benchmark, Benchmark.
The only concern that I can think of is that the speed of the dictionary relies on the key class having a reasonably fast GetHashCode method. Lookups and inserts are really fast, so you shouldn't have any problem there.
Regarding using an array, that's what the Dictionary class does already. Actually it uses two arrays, one for the keys and one for the values.
If you would have any performance problems with a Dictionary, it would be quite easy to make a wrapper for any kind of storage, that has the same methods and behaviour as a Dictionary so that you can replace it seamlessly.
I'm not sure that anyone has really answered this part yet:
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
For this, wherever possible, declare your variables as IDictionary<TKey, TValue>. That's the main interface that Dictionary derives from. (I'm assuming that if you care that much about performance, then you aren't considering non-generic collections.) Then, in the future, you can change the underlying implementation class without having to change any of the code that uses that dictionary. For example:
IDictionary<string, int> myDict = new Dictionary<string, int>();
If your application is multithreaded then the key part of performance is going to be synchronizing this Dictionary correctly.
If it is single-threaded then almost certainly bottleneck will be elsewhere. Such as reading these objects from wherever you are reading them.
I use Dictionary for UDP relay server . Each time packet arrives it performs Dictionary.ContainsKey and Dictionary[Key] , and it works great (massive number of clients). I had concerns when I was making the thing but it turned out that was last thing I should worry about.
Have a look at C# HybridDictionary Usage
HybridDictionary Class
This class is recommended for cases
where the number of elements in a
dictionary is unknown. It takes
advantage of the improved performance
of a ListDictionary with small
collections, and offers the
flexibility of switching to a
Hashtable which handles larger
collections better than ListDictionary
You may consider using the C5 library. I've found it to be very fast and thoughtfully designed. Others on stackoverflow have found the same. With C5 you have the option of using general type interfaces (with a captial I), or directly the data structures underneath. Naturally the interfaces allow you to swap out different implementations, but I have found in performance testing that the interfaces will cost you.
You may want to look at the KeyedCollection class in System.ObjectModel. From the MSDN description, "provides the abstract base class for a collection whose keys are embedded in the values."

Categories