Help with C#.NET generic collections performance and optimization

Help with C#.NET generic collections performance and optimization - c#

I am trying to optimize a piece of .NET 2.0 C# code that looks like this:
Dictionary<myType, string> myDictionary = new Dictionary<myType, string>();
// some other stuff
// inside a loop check if key is there and if not add element
if(!myDictionary.ContainsKey(currentKey))
{
myDictionary.Add(currentKey, "");
}
Looks like the Dictionary has been used by whoever wrote this piece of code even if not needed (only the key is being used to store a list of unique values) because faster than a List of myType objects for search.
This seems obviously wrong as only the key of the dictionary but I am trying to understand what's the best way to fix it.
Questions:
1) I seem to understand I would get a good performance boost even just using .NET 3.5 HashSet. Is this correct?
2) What would be the best way to optimize the code above in .NET 2.0 and why?
EDIT:
This is existing code I am trying to optimize, it's looping through dozens of thousands items and for each one of them is calling a ContainsKey. There's gotta be a better way of doing it (even in .NET 2.0)! :)

I think you need to break this down into 2 questions
Is Dictionary<myType,string> the best available type for this scenario
No. Based on your breakdown, HashSet<myType> is clearly the better choice because it's usage pattern more accurately fits the scenario
Will switching to Hashset<myType> give me a performance boost?
This is really subjective and only a profiler can give you the answer to this question. Likely you'll see a very minor memory size improvement per element in the collection. But in terms of raw computing power I doubt you'll see a huge difference. Only a profiler can tell you if there is one.
Before you ever make a performance related change to your code remember the golden rule.
Don't make any performance related changes until a profiler has told you precisely what is wrong with your code.
Making changes which violate this rule are just guesses. A profiler is the only way to measure success of a performance fix.

1) No. A dictionary does a hash on the key so your lookup should be O(1). A Hashset should result in less memory needed though. But honestly, it isn't that much that you will really see a performance boost.
2) Give us some more detail as to what you are trying to accomplish. The code you posted is pretty simple. Have you measured yet? Are you seeing that this method is slow? Don't forget "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." -- Donald Knuth

Depending on the size of your keys, you may actually see performance degrade.
One way in 2.0 would be to try and insert it and catch the exception (of course, this depends on how many duplicate keys you plan on having:
foreach(string key in keysToAdd)
{
try
{
dictionary.Add(key, "myvalue");
}
catch(ArgumentException)
{
// do something about extra key
}
}

The obvious mistake (if we discuss performance) I can see is the double work done when calling ContainsKey and then adding the key-value pair. When the pair is added using Add method, the key is again internally checked for presense. The whole if block can be safely replaced by this:
...
myDictionary[currentKey] = "";
...
If the key already exists there, the value will be just replaces and no exception will get thrown. Moreover, if the value is not used at all I would personally use null values to fill it. Can see no reason for using any string constant there.

The possible performance degrade mentioned by scottm is not for doing simple lookups. It is for calculating the intersection between 2 sets. HashSet does have slightly faster lookups than Dictionary. The performance difference really is going to be very small, though, as everyone says -- the lookup takes most of the time & creating the KeyValuePair takes very little.
For 2.0, you could make the "Value" object one of these:
public struct Empty {}
It may do slightly better than the "".
Or you could try making a reference to System.Core.dll in your 2.0 project, so you can use the HashSet.
Also, make sure that GetHashCode and Equals are as efficient as possible for MyType. I've been bitten by using a dictionary on something with a really slow GetHashCode (I believe we tried to use a delegate as a key or something like that.)

Related

Maintaining data locality in a Dictionary<TKey,TValue>

I'm making a game and I decided that for reasons, I'd give each game object an int entity ID that I could easily search them by instead of having to linearly search a list or worse, many lists. The idea was inspired by the ECS pattern and I figured if I made sure to re-use ints when they were destroyed, it would help keep all the data close together in memory and reduce cache misses by a bit. (I know that depends more on access order, just thinking in the abstract here). The problem is I'm now doubting myself and I've read so much that I can't keep the ideas straight in my head.
The question is essentially if I keep endlessly adding higher numbered keys to a Dictionary<int, SomeClass>, will the speed/memory usage be worse than if I try to re-use lower numbers?
Note: I feel like the answer is going to be "write your own class" but I was trying to avoid that and I don't think I'd do a good job if I don't understand this concept.

No, it makes no difference at all. From MSDN:
The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
So, the speed will always be O(1) because it internally uses a hash table, the value of the key doesn't affects it at all.
The only problem you can face is if you reach int.MaxValue, that's up to your scenerio.

Okay here's my best effort at answering this myself, apologies if I get anything wrong.
Short answer: No. If you add higher numbers they just get stuck somewhere into the array until it's full. The solution to the example problem is to just replace the dictionary with a GameObject array and use the int as an index, and if necessary write a class to handle expanding it.
Longer answer: I think my confusion came from reading somewhere that a dictionary was just a pair of parallel arrays or something like that. I guess that's true but since it's indexed by hash codes, it's not intended for contiguous index values. So it's doing a bunch of redundant work to handle cases that I'm never going to use it for.

Mass Update a property on multiple records inside a dictionary (VB.NET / C#)

I have a Dictionary (of Long, Class), where Class has multiple properties (assume we have a property called Updated as Boolean).
I want to update this (Updated) property to (True) at once for let's say all Odd key records (or based on any specific rule). What is the best way to do so?
My thoughts are to use Linq to fetch those records then (for each) them, but is there any better way to do so like doing a mass update where a condition happens (like what we do in the database)?
An example of my approach is below. Appreciate it if there is a better way to do such an update...
Thanks
Dim ReturnedObjs = From Obj In Dictionary Where Obj.Key Mod 2 = 1
For Each item As KeyValuePair(Of Long, Class) In ReturnedObjs
item.Value.Updated = True
Next

First, this sounds like a obvious case for the speed rant:
https://ericlippert.com/2012/12/17/performance-rant/
Second:
The best way is to keep this in the Database. You are not going to beat the speed of a DB Query with Indexes designed for quick matching, by transfering the data over the network twice (once to get it, once to return it) and doubling the search load (once to get all odd ones, once to update all the ones you just changed). My standing advice is to always keep as much work as possible on the DB side. Your client code will never be able to beat it.
Third:
If you do need to use client side processing:
Now a lot of my answer depend on details of the implementation, how the JiT and general Compiler optimsiations work, etc.
Foreach uses works on enumerators, not Collections. But if you feed a collection to foreaach, a Enumerator is implicitly created. Now enumerators do have two properties:
If the collection changes, the Enumerator becomes invalid. Most people learn about them because they ran into this issue.
It is a extra function call and set of checks for accessing a collection. So it will be a slowdown. How much is hard to say, as the Optimisations and JiT are pretty good.
So you propably want to use for loop instead.
If you could turn the Dictionary into a collection where the Primary Key is used as Index, it might be a bit faster. But hat has the danger of running into a lot of "dry spells" regarding data, so it depends a lot on your source data.

C# Efficiency for method parameters

Am I correct in saying that this:
public static void MethodName{bool first, bool second, bool third}
{
//Do something
}
Is more efficient than this:
public static void MethodName{bool [] boolArray}
{
bool first = boolArray[0];
bool second = boolArray[1];
bool third = boolArray[2];
//Do something
}
My thoughts are that for both they would have to declare first, second and third - just in different places. But for the second one it has to add it into an array and then unpack it again.
Unless you declared the array like this:
MethodName(new[] { true, true, true });
In which case I am not sure which is faster?
I ask because I am thinking of using the second one but wanted to know if/what the implications are on performance.
In this case performance is not particularly important, but it would be helpful for me to clarify this point.
Also, the second one has the advantage that you can pass as many values as you like to it, and it is also easier to read I think?
The reason I am thinking of using this is because there are already about 30 parameters being passed into the method and I feel it is becoming confusing to keep adding more. All these bools are closely related so I thought it may make the code more manageable to package them up.
I am working on existing code and it is not in my project scope to spend time reworking the method to decrease the number of parameters that are passed into the method, but I thought it would be good practice to understand the implications of this change.

In terms of performance, there's just an answer for your question:
"Programmers waste enormous amounts of time thinking about, or
worrying about, the speed of noncritical parts of their programs, and
these attempts at efficiency actually have a strong negative impact
when debugging and maintenance are considered. We should forget about
small efficiencies, say about 97% of the time: premature optimization
is the root of all evil. Yet we should not pass up our opportunities
in that critical 3%."
In terms of productivity, parameters > arrays.
Side note
Everyone should know that that was said by Donald Knuth in 1974. More than 40 years after this statement, we still fall on premature optimization (or even pointless optimization) very often!
Further reading
I would take a look at this other Q&A on Software Engineering

Am I correct in saying that this:
Is more efficient than this:
In isolation, yes. Unless the caller already has that array, in which case the second is the same or even (for larger argument types or more arguments) minutely faster.
I ask because I am thinking of using the second one but wanted to know if/what the implications are on performance.
Why are you thinking about the second one? If it is more natural at the point of the call then the reasons making it more natural are likely going to also have a performance impact that makes the second the better one in the wider context that outweighs this.
If you're starting off with three separate bools and you're wrapping them just to unwrap them again then I don't see what this offers in practice except for more typing.
So your reason for considering this at all is the more important thing here.
In this case performance is not particularly important
Then really don't worry about it. It's certainly known for hot-path code that hits params to offer overloads that take set numbers of individual parameters, but it really does only make a difference in hot paths. If you aren't in a hot path the lifetime saving of computing time of picking whichever of the two is indeed more efficient is unlikely to add up to the
amount of time it took you to write your post here.
If you are in a hot path and really need to shave off every nanosecond you can because you're looping so much that it will add up to something real, then you have to measure. Isolated changes have non-isolated effects when it comes to performance, so it doesn't matter whether the people on the Internet tell you A is faster than B if the wider context means the code calling A is slower than B. Measure. Measurement number one is "can I even notice?", if the answer to that measurement is "no" then leave it alone and find somewhere where the performance impact is noticeable to optimise instead.
Write "natural" code to start with, before seeing if little tweaks can have a performance impact in the bits that are actually hurting you. This isn't just because of the importance of readability and so on, but also because:
The more "natural" code in a given language very often is the more efficient. Even if you think it can't be, it's more likely to benefit from some compiler optimisation behind the scenes.
The more "natural" code is a lot easier to tweak for performance when it is necessary than code doing a bunch of strange things.

I don't think this would affect the performance of your app at all.
Personally
I'd go with the first option for two reasons:
Naming each parameter: if the project is a large scale project and there is a lot of coding or for possible future edits and enhancements.
Usability: if you are sending a list of similar parameters then you must use an array or a list, if it just a couple of parameters that happened to be of the same type then you should be sending them separately.

Third way would be use of params, Params - MSDN
In the end I dont think it will change much in performance.
array[] though inheritates from abstract Array class which implements IEnumerable and IEnumerable<t> (ICloneable, IList, ICollection,
IEnumerable, IStructuralComparable, IStructuralEquatable), this means objects are more blown up than three value type Parameters, which will make then slower obviously
Array - MSDN

You could test performance differences on both, but I doubt there would be much difference.
You have to consider maintainability, is another programmer, or even yourself going to understand why you did it that way in a few weeks, or a few months time when it's time for review? Is it easily extended, can you pass different object types through to your method?
If your passing a collection of items, then certainly packing them into an array would be quicker than specifying a new parameter for each additional item?
If you have to, you can do it that way, but have you considered param array??
Why use the params keyword?
public static void MethodName{params bool [] boolAarray}
{
//extract data here
}

Agreed with Matias' answer.
I also want to add that you need to add error checking, as you are passed an array, and nowhere is stated how many elements in your array you will receive. So you must first check that you have three elements in your array. This will balance the small perf gain that you may have earned.
Also, if you ever want to make this method available to other developers (as part of an API, public or private), intellisense will not help them at all in which parameters they're suppposed to set...
While using three parameters, you can do this :
///<summary>
///This method does something
///</summary>
///<param name="first">The first parameter</param>
///<param name="second">The second parameter</param>
///<param name="third">The third parameter</param>
public static void MethodName{bool first, bool second, bool third}
{
//Do something
}
And it will be displayed nicely and helpfully to others...

I would take a different approach and use Flags;
public static void MethodName(int Flag)
{
if (Flag & FIRST) { }
}
Chances are the compiler will do its own optimizations;
Check http://rextester.com/QRFL3116 Added method from Jamiec comment
M1 took 5ms
M2 took 23ms
M3 took 4ms

Fastest way to get any element from a Dictionary

I'm implementing A* in C# (not for pathfinding) and I need Dictionary to hold open nodes, because I need fast insertion and fast lookup. I want to get the first open node from the Dictionary (it can be any random node). Using Dictionary.First() is very slow. If I use an iterator, MoveNext() is still using 15% of the whole CPU time of my program. What is the fastest way to get any random element from a Dictionary?

I suggest you use a specialized data structure for this purpose, as the regular Dictionary was not made for this.
In Java, I would probably recommend LinkedHashMap, for which there are custom C# equivalents (not built-in sadly) (see).
It is, however, rather easy to implement this yourself in a reasonable fashion. You could, for instance, use a regular dictionary with tuples that point to the next element as well as the actual data. Or you could keep a secondary stack that simply stores all keys in order of addition. Just some ideas. I never did implemented nor profiled this myself, but I'm sure you'll find a good way.
Oh, and if you didn't already, you might also want to check the hash code distribution, to make sure there is no problem there.

Finding the first (or an index) element in a dictionary is actually O(n) because it has to iterate over every bucket until a non-empty one is found, so MoveNext will actually be the fastest way.
If this were a problem, I would consider using something like a stack, where pop is an O(1) operation.

Try
Enumerable.ToList(dictionary.Values)[new Random().next(dictionary.Count)].
Should have pretty good performance but watch out for memory usage if your dictionary is huge. Obviously take care of not creating the random object every time and you might be able to cache the return value of Enumerable.ToList if its members don't change too frequently.

Dictionary Performance

Whats the difference between the teo snippets?
Snippet 1:
{
Dictionary<MyCLass, bool> dic;
MyFunc(out dic);
}
Snippet 2:
{
Dictionary<MyCLass, bool> dic = null;
MyFunc(out dic);
}
Is snippet 2 better in performance?

Technically speaking the second code snippet will likely execute more instructions than the first by doing a redundant null set. I'm hedging with likely here because the C# spec may allow for the flexibility of ignoring this set. I don't know off hand.
However I would seriously doubt that would ever noticeably affect performance of an application. I certainly would not code for that optimization but would instead prefer the solution which I found more understandable.

Do not worry about these when you haven't measured the performance of the application.
Things like this are very unlikely to have a huge impact, in fact, most of the time things like this will not be noticeable compared to other lines you wrote.
Measure first, them worry about performance.

I like snippet 2, it's slower but better practice to reduce errors, overall a good habit to have - to init variables explicitly. Maybe even the JIT can optimize it away at access time so you only lose a little bit of performance at compile & load time not at execution (but I haven't verified this debugger/disassembler but the JIT is quite 'smart' for a computer program so it maybe able to do it)

Compile them both and compare the IL. I imagine it would be the same. The storage for the out parameter should be initialized to zero (null, if it is a reference type) before it its passed to the called method.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.