Consider the following
A list of approximately 10,000 folders
Of these folders, a list of rules determine if they qualify to go to the next stage
The rules are a text based comparison such that
if folder name contains (...any of the following from a list of exceptions) - such that there is a one to many comparison for each folder, but the folder name string must contain (or must NOT contain) any of the strings it is compared do
I'm relatively new to C# so I'm not entirely sure what's under the hood of each class
Any advice in some general direction would be greatly appreciated.
Do you have a performance problem, or are you trying to optimize the code before it has been written?
The Comparer class is typically not the topmost performant class of the .NET framework, but it has to cater for quite a lot of scenarios.
If you know the source and target types, you're usually better off implementing your own specific comparer class.
However, unless you know that you have a performance problem, I wouldn't worry too much about it.
First of 10 K folders is not a large number... So you might not want to worry about performance already.
So dont optimize...
After that you might want to consider the way you search your names... Instead of a seach for every single element, you could create a regexp that will perform all searches at once, but that is optimization without a real need...
First you need a reason to change the code
Related
I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation
So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"
What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions
However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime
So, my questions are:
What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?
which is almost a classical implementation of a hash table with "open addressing"
Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.
a lot of code in Set is just a copy-paste from HashSet, with some omissions
AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.
It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells
In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.
As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.
Especially given the fact that both of these classes are in the same assembly System.Core
Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.
Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.
That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think #james-ko was considering testing the benefits of doing that.
It looks like HashSet<T> will perform better
For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).
so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.
Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).
I have the following problem. I have to store a list of unique words in multiple languages in memory and of course when I add new words I have to check whether the new word already exist.
Of course this needs to be blazingly fast, primarily because of the huge number of words.
I was thinking about implementing a Suffix Tree, but I wondered whether there is an easier approach with some already implemented internal structures.
P.S. Number of words ≈ 107.
First, note that Suffix Trees might be an overkill here, since they allow fast search for any suffix of any word, which might be a bit too much than what you are looking for. A trie is a very similar DS, that also allows fast search for a word, but since it does not support fast search for any suffix - its creation is simpler (both to program and efficiency).
Another simpler alternative is using a simple hash table, which is implemented in C# as a HashSet. While a HashSet is on theory slower on worst case - the average case for each lookup takes constant time, and it might be enough for your application.
My suggestion is:
First try using a HashSet, which requires much less effort to implement, benchmark it and check if it is enough.
Make sure your DS is moddable, so you can switch it with very little effort if you later decide to. This is usually done by introducing an interface that is responsible to the addition and lookup of words, and if you need it changed - just introduce a different implementation to the interface.
If you do decide to add suffix tree or trie - use community resources, no need to reinvent the wheel - someone has already implemented most of these data structures, and they are available online.
I have "successfully" implemented a non recombining trinomial tree to price certain fixed-income derivatives. (Something like shown in the picture below - but with three branches that don't reconnect)
Unfortunately it turned out that the number of nodes I can use was severely limited by the available memory. If I build a tree with 20 time-steps this results in 3^19 nodes (so 1,1 Billion nodes)
The nodes of each time step are saved in List<Node> and these arrays are stored in a Dictionary<double,List<Node>>
Each node is instantiated via new Node(...). I also instantiate each of the lists and the dictionary via new Class() Perhaps this is the source of my error.
Also System.OutOfMemoryException isn't thrown because of the Dictionary/List-Object being to large (as is often the case) but because I seem to have too many Nodes - after a while new Node(...) can't allocate any further memory. Eventually the 2GB max List-Capacity will also kick in I think - seeing as how List grows exponentially larger with each time step.
Perhaps my data-structure is too wasteful or not really suited for the task at hand.
A possible solution could be to save the tree to a text-file thus avoiding the memory-problem completely. This however would necessitate a HUGE workaround.
Edit:
To add some more background. I need the tree to price path dependant products. This means that unfortunately I will have to access all the nodes. What is more after the tree has been build I start from the leaves and go backwards in time to determine the price. I also already only generate the nodes I need.
Edit2:
I have given the topic some though and also considered the various responses. Could it be that I just need to serialize the respective tree levels to the hard-drive. So basically - I create one time-step (List<Node>) write it to Disk etc. Later on when I start from the leaves - I will just have to load it in reverse oder.
You basically have two choices. evaluate only the branches you care about (Andrew's yield) and don't store results or build up your tree and save it to disk and implement a custom collection interface on top of it that accesses the right part of the disk. In this case you are still going to keep a minimal amount of data in your process memory and rely on the OS to do proper disk caching to make access fast. If you start working with large data sets the second option is a good tool to have in your tool belt, so you should probably write this with reuse in mind.
What we have here is a classic problem of doing an enormous amount of processing up front... and then storing EVERYTHING into memory to be processed at a later time.
While simple, given harsh enough conditions (like having a billion entries), it will eat up all the memory.
Now, the OP didn't really specify what the intention of the tree was or how it was going to be used... but I would propose that instead of building it all at once... build it as you need it.
Lazy Evaluation with yield
Instead of doing everything all at once and having to store it... it might be ideal to do it ONLY when you actually require it. Check out this post for more info and examples of using yield.
This won't work great though if you need to traverese the tree a bunch of times... but it might still allow you to have deeper depth than you currently do.
I don't think Serializing to disk will help much. One, when you attempt to deserialize the list you will still run out of memory (as, to the best of my knowledge, there is no way to partially deserialize an object).
Have you considered changing your data structure into a relational database model and storing it in a SQLEXPRESS database?
This would give you the added benefit of performing queries with indexes instead of your custom tree traversal logic.
I will be creating a project that will use dictionary lookups and inserts quite a bit. Is this something to be concerned about?
Also, if I do benchmarking and such and it is really bad, then what is the best way of replacing dictionary with something else? Would using an array with "hashed" keys even be faster? That wouldn't help on insert time though will it?
Also, I don't think I'm micro-optimizing because this really will be a significant part of code on a production server, so if this takes an extra 100ms to complete, then we will be looking for new ways to handle this.
You are micro-optimizing. Do you even have working code yet? Remember, "If it doesn't work, it doesn't matter how fast it doesn't work." (Mich Ravera) http://www.codingninja.co.uk/best-programmers-quotes/.
You have no idea where the bottlenecks will be, and already you're focused on Dictionary. What if the problem is somewhere else?
How do you know how the Dictionary class is implemented? Maybe it already uses an array with hashed keys!
P.S. It's really ".NET Dictionaries", not "C# Dictionaries", because C# is just one of several programming languages that use the framework.
Hello, I will be creating a project
that will use dictionary lookups and
inserts quite a bit. Is this something
to be concerned about?
Yes. It is always wise to consider performance factors up front.
The form that your concern should take is as follows: your concern should be encouraging you to write realistic, user-focused performance specifications. It should be encouraging you to start writing performance tests early, and running them often, so that you can see how every single change to the product affects performance. That way you will be informed immediately when a code change causes a user-affecting change in performance. And it should be encouraging you to run profiles often, so that you are reasoning about performance based on empirical measurements, rather than random guesses and hunches.
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
The best way to do this is to build a reasonable abstraction layer. If you have a class (or interface) which represents the "insert" and "lookup" abstract data type, then you can replace its internals without changing any of the callers.
Note that adding a layer of abstraction itself has a performance cost. If your profiling shows that the abstraction layer is too expensive, if the extra couple nanoseconds per call is too much, then you might have to get rid of the abstraction layer. Again, this decision will be driven by real-world performance data.
Would using an array with "hashed"
keys even be faster? That wouldn't
help on insert time though will it?
Neither you nor anyone reading this can possibly know which one is faster until you write it both ways and then benchmark it both ways under real-world conditions. Doing it under "lab" conditions will skew your results; you'll need to understand how things work when the GC is under realistic memory pressure, and so on. You might as well ask us which horse will run faster in next year's Kentucky Derby. If we knew the answer just by looking at the racing form, we'd all be rich already. You can't possibly expect anyone to know which of two entirely hypothetical, unwritten pieces of code will be faster under unspecified conditions!
The Dictionary<TKey, TValue> class is actually implemented as a hash table which makes lookups very fast (close to O(1)). See the API documentation for more information. I doubt you could make a better implementation yourself.
Wait and see if the performance of your application is below expectations
If it is then use a profiler to determine if the Dictionary lookup is the source of the problem
If it is then do some tests with representative data to see if another choice of list would be quicker.
In short - no, in general you shouldn't worry about the performance of implementation details until after you have a problem.
I would do a benchmark of the Dictionary, HashTable (HashSet in .NET), and perhaps a home grown class, and see which works out best under your typical usage conditions.
Normally I would say it's fine (insert StackOverflow's favorite premature ejaculation quote here), but if this is a core peice of the application, Benchmark, Benchmark, Benchmark.
The only concern that I can think of is that the speed of the dictionary relies on the key class having a reasonably fast GetHashCode method. Lookups and inserts are really fast, so you shouldn't have any problem there.
Regarding using an array, that's what the Dictionary class does already. Actually it uses two arrays, one for the keys and one for the values.
If you would have any performance problems with a Dictionary, it would be quite easy to make a wrapper for any kind of storage, that has the same methods and behaviour as a Dictionary so that you can replace it seamlessly.
I'm not sure that anyone has really answered this part yet:
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
For this, wherever possible, declare your variables as IDictionary<TKey, TValue>. That's the main interface that Dictionary derives from. (I'm assuming that if you care that much about performance, then you aren't considering non-generic collections.) Then, in the future, you can change the underlying implementation class without having to change any of the code that uses that dictionary. For example:
IDictionary<string, int> myDict = new Dictionary<string, int>();
If your application is multithreaded then the key part of performance is going to be synchronizing this Dictionary correctly.
If it is single-threaded then almost certainly bottleneck will be elsewhere. Such as reading these objects from wherever you are reading them.
I use Dictionary for UDP relay server . Each time packet arrives it performs Dictionary.ContainsKey and Dictionary[Key] , and it works great (massive number of clients). I had concerns when I was making the thing but it turned out that was last thing I should worry about.
Have a look at C# HybridDictionary Usage
HybridDictionary Class
This class is recommended for cases
where the number of elements in a
dictionary is unknown. It takes
advantage of the improved performance
of a ListDictionary with small
collections, and offers the
flexibility of switching to a
Hashtable which handles larger
collections better than ListDictionary
You may consider using the C5 library. I've found it to be very fast and thoughtfully designed. Others on stackoverflow have found the same. With C5 you have the option of using general type interfaces (with a captial I), or directly the data structures underneath. Naturally the interfaces allow you to swap out different implementations, but I have found in performance testing that the interfaces will cost you.
You may want to look at the KeyedCollection class in System.ObjectModel. From the MSDN description, "provides the abstract base class for a collection whose keys are embedded in the values."
In our desktop application, we have implemented a simple search engine using an inverted index.
Unfortunately, some of our users' datasets can get very large, e.g. taking up ~1GB of memory before the inverted index has been created. The inverted index itself takes up a lot of memory, almost as much as the data being indexed (another 1GB of RAM).
Obviously this creates problems with out of memory errors, as the 32 bit Windows limit of 2GB memory per application is hit, or users with lesser spec computers struggle to cope with the memory demand.
Our inverted index is stored as a:
Dictionary<string, List<ApplicationObject>>
And this is created during the data load when each object is processed such that the applicationObject's key string and description words are stored in the inverted index.
So, my question is: is it possible to store the search index more efficiently space-wise? Perhaps a different structure or strategy needs to be used? Alternatively is it possible to create a kind of CompressedDictionary? As it is storing lots of strings I would expect it to be highly compressible.
If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.
Here is a project that provides a .net interface to it:
http://sourceforge.net/projects/libdb-dotnet
I see a few solutions:
If you have the ApplicationObjects in an array, store just the index - might be smaller.
You could use a bit of C++/CLI to store the dictionary, using UTF-8.
Don't bother storing all the different strings, use a Trie
I suspect you may find you've got a lot of very small lists.
I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).
If none of this makes much sense, let me know and I'll try to come up with some code.
I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?
You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.
On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.
By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.
How about using Memory Mapped File Win32 API to transparently back your memory structure?
http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.
Is the index only added to or do you remove keys from it as well?