C# Dictionary Loop Enhancment

C# Dictionary Loop Enhancment - c#

I have a dictionary with around 1 milions items. I am constantly looping throw the dictionnary :
public void DoAllJobs()
{
foreach (KeyValuePair<uint, BusinessObject> p in _dictionnary)
{
if(p.Value.MustDoJob)
p.Value.DoJob();
}
}
The execution is a bit long, around 600 ms, I would like to deacrese it. Here is the contraints :
MustDoJob values mostly stay the same beetween two calls to DoAllJobs()
60-70% of the MustDoJob values == false
From time to times MustDoJob change for 200 000 pairs.
Some p.Value.DoJob() can not be computed at the same time (COM object call)
Here, I do not need the key part of the _dictionnary objet but I really do need it somewhere else
I wanted to do the following :
Parallelizes but I am not sure is going to be effective due to 4.
Sorts the dictionnary since 1. and 2. (and stop want I find the first MustDoJob == false) but I am wondering what 3. would result in
I did not implement any of the previous ideas since it could be a lot of job and I would like to investigate others options before. So...any ideas ?

What I would suggest is that your business object could raise an event to indicate that it needs to do a job when MustDoJob becomes true and you can subscribe to that event and store references to those objects in a simple list and then process the contents of that list when the DoAllJobs() method is called

My first suggestion would be to use just the values from the dictionary:
foreach (BusinessObject> value in _dictionnary.Values)
{
if(value.MustDoJob)
{
value.DoJob();
}
}
With LINQ this could be even easier:
foreach (BusinessObject value in _dictionnary.Values.Where(v => v.MustDoJob))
{
value.DoJob();
}
That makes it clearer. However, it's not clear what else is actually causing you a problem. How quickly do you need to be able to iterate over the dictionary? I expect it's already pretty nippy... is anything actually wrong with this brute force approach? What's the impact of it taking 600ms to iterate over the collection? Is that 600ms when nothing needs to do any work?
One thing to note: you can't change the contents of the dictionary while you're iterating over it - whether in this thread or another. That means not adding, removing or replacing key/value pairs. It's okay for the contents of a BusinessObject to change, but the dictionary relationship between the key and the object can't change. If you want to minimise the time during which you can't modify the dictionary, you can take a copy of the list of references to objects which need work doing, and then iterate over that:
foreach (BusinessObject value in _dictionnary.Values
.Where(v => v.MustDoJob)
.ToList())
{
value.DoJob();
}

Try using a profiler first. 4 makes me curious - 600ms may not be that much if the COM object uses most of the time, and then it is either paralellize or live with it.
I would get sure first - with a profiler run - that you dont target the totally wrong issue here.

Having established that the loop really is the problem (see TomTom's answer), I would maintain a list of the items on which MustDoJob is true -- e.g., when MustDoJob is set, add it to the list, and when you process and clear the flag, remove it from the list. (This might be done directly by the code manipulating the flag, or by raising an event when the flag changes; depends on what you need.) Then you loop through the list (which is only going to be 60-70% of the length), not the dictionary. The list might contain the object itself or just its key in the dictionary, although it will be more efficient if it holds the object itself as you avoid the dictionary lookup. It does depend on how frequently you're queuing 200k of them, and how time-critical the queuing vs. the execution is.
But again: Step 1 is make sure you're solving the right problem.

The use of a dictionary to me implies that the intention is to find items by a key, rather than visit every item. On the other hand, 600ms for looping through a million items is respectable.
Perhaps alter your logic so that you can simply pick the relevant items satisfying the condition directly out of the dictionary.

Use a List of KeyValuePairs instead. This means you can iterate over it super-quickly by doing
List<KeyValuePair<string,object>> list = ...;
int totalItems = list.Count;
for (int x = 0; x < totalItems; x++)
{
// whatever you plan to do with them, you have access to both KEY and VALUE.
}
I know this post is old, but I was looking for a way to iterate over a dictionary without the increased overhead of the Enumerator being created (GC and all), or generally a faster way to iterate over it.

Related

Mass Update a property on multiple records inside a dictionary (VB.NET / C#)

I have a Dictionary (of Long, Class), where Class has multiple properties (assume we have a property called Updated as Boolean).
I want to update this (Updated) property to (True) at once for let's say all Odd key records (or based on any specific rule). What is the best way to do so?
My thoughts are to use Linq to fetch those records then (for each) them, but is there any better way to do so like doing a mass update where a condition happens (like what we do in the database)?
An example of my approach is below. Appreciate it if there is a better way to do such an update...
Thanks
Dim ReturnedObjs = From Obj In Dictionary Where Obj.Key Mod 2 = 1
For Each item As KeyValuePair(Of Long, Class) In ReturnedObjs
item.Value.Updated = True
Next

First, this sounds like a obvious case for the speed rant:
https://ericlippert.com/2012/12/17/performance-rant/
Second:
The best way is to keep this in the Database. You are not going to beat the speed of a DB Query with Indexes designed for quick matching, by transfering the data over the network twice (once to get it, once to return it) and doubling the search load (once to get all odd ones, once to update all the ones you just changed). My standing advice is to always keep as much work as possible on the DB side. Your client code will never be able to beat it.
Third:
If you do need to use client side processing:
Now a lot of my answer depend on details of the implementation, how the JiT and general Compiler optimsiations work, etc.
Foreach uses works on enumerators, not Collections. But if you feed a collection to foreaach, a Enumerator is implicitly created. Now enumerators do have two properties:
If the collection changes, the Enumerator becomes invalid. Most people learn about them because they ran into this issue.
It is a extra function call and set of checks for accessing a collection. So it will be a slowdown. How much is hard to say, as the Optimisations and JiT are pretty good.
So you propably want to use for loop instead.
If you could turn the Dictionary into a collection where the Primary Key is used as Index, it might be a bit faster. But hat has the danger of running into a lot of "dry spells" regarding data, so it depends a lot on your source data.

Poor use of dictionary?

I've read on here that iterating though a dictionary is generally considered abusing the data structure and to use something else.
However, I'm having trouble coming up with a better way to accomplish what I'm trying to do.
When a tag is scanned I use its ID as the key and the value is a list of zones it was seen in. About every second I check to see if a tag in my dictionary has been seen in two or more zones and if it has, queue it up for some calculations.
for (int i = 0; i < TagReads.Count; i++)
{
var tag = TagReads.ElementAt(i).Value;
if (tag.ZoneReads.Count > 1)
{
Report.Tags.Enqueue(tag);
Boolean success = false;
do
{
success = TagReads.TryRemove(tag.Epc, out TagInfo outTag);
} while (!success);
}
}
I feel like a dictionary is the correct choice here because there can be many tags to look up but something about this code nags me as being poor.
As far as efficiency goes. The speed is fine for now in our small scale test environment but I don't have a good way to find out how it will work on a massive scale until it is put to use, hence my concern.

I believe that there's an alternative approach which doesn't involve iterating a big dictionary.
First of all, you need to create a HashSet<T> of tags on which you'll store those tags that have been detected in more than 2 zones. We'll call it tagsDetectedInMoreThanTwoZones.
And you may refactor your code flow as follows:
A. Whenever you detect a tag in one zone...
Add the tag and the zone to the main dictionary.
Create an exclusive lock against tagsDetectedInMoreThanTwoZones to avoid undesired behaviors in B..
Check if the key has more than one zone. If this is true, add it to tagsDetectedInMoreThanTwoZones.
Release the lock against tagsDetectedInMoreThanTwoZones.
B. Whenever you need to process a tag which has been detected in more than one zone...
Create an exclusive lock against tagsDetectedInMoreThanTwoZones to avoid more than a thread trying to process them.
Iterate tagsDetectedInTwoOrMoreZones.
Use each tag in tagsDetectedInMoreThanTwoZones to get the zones in your current dictionary.
Clear tagsDetectedInMoreThanTwoZones.
Release the exclusive lock against tagsDetectedInMoreThanTwoZones.
Now you'll iterate those tags that you already know that have been detected in more than a zone!
In the long run, you can even make per-region partitions so you never get a tagsDetectedInMoreThanTwoZones set with too many items to iterate, and each set could be consumed by a dedicated thread!

If you are going to do a lot of lookup in your code and only sometimes iterate through the whole thing, then I think the dictionary use is ok. I would like to point out thought that your use of ElementAt is more alarming. ElementAt performs very poorly when used on objects that do not implement IList<T> and the dictionary does not. For IEnumerables<T> that do not implement IList the way the nth element is found is through iteration, so your for-loop will iterate the dictionary once for each element. You are better off with a standard foreach.

I feel like this is a good use for a dictionary, giving you good access speed when you want to check if an ID is already in the collection.

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?

This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.

Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .

This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

Performance Comparison: Check and set a flag Vs. Just set the flag

I have a dictionary of flags and by each event, I set the associated flag. Some events are from same type and have same id, so it's possible to set their flags multiple times. Would it be more fast, if I check a flag before setting it?
Dictionary<int, bool> flags = new Dictionary<string, bool>();
foreach(var eventType in eventTypes)
{
flags.Add(eventType.id, false);
}
for (int i = 0; i < iterations; i++)
{
// resetting flags for current iteration.
foreach(var k in flags.Keys)
{
flags[k] = false;
}
Event[] events = GetEvents();
foreach(var e in events)
{
flags[e.id] = True; // Would it be better to check flag before set it?
}
// Do related works for events occurred
}

The operation which is the most expensive in this case happens when you index the dictionary. Whenever you try to add, update or read a value for a specific key, Dictionary needs to compute the hash value of your key, then get the appropriate bucket, and then iterate through elements in that bucket (if there are more than one doe to hash collisions) and compare them to the key.
If you checked the value before updating it, you would need to do this twice, in case that you need to change the value.
On a side note, this is exactly why TryGetValue is the preferred (more performant) method for getting a value which may or may not be in the dictionary: because calling ContainsKey and then indexing the dictionary would take two operations (or, even worse, catching the KeyNotFoundException).
Also, if the range of your int keys is limited, you may consider using a BitArray instead of a Dictionary.
[Edit]
Presumably, you use the dictionary to check if a given event occurred at least once. To do this, you can simply write:
var eventIds = new HashSet<string>(GetEvents().Select(e => e.Id));
To check if an event occurred efficiently, you would query the hashset:
if (eventIds.Contains(id))
{
// do something
}

In any Dictionary based operation, lookup time is likely to far surpass the time needed to set a flag, which typically will involve simply setting a single value in memory.
If you just set the flag without checking, then that involves only one lookup.
If you check first, then this potentially involves two lookups. However it's quite likely that the compiler will optimize this down to just a single lookup, so the speed may be similar.
Conclusion:
Not checking is potentially faster, but this will only be noticeable if the compiler isn't good at optimzation
If you're really concerned about performance, consider whether you can use arrays instead of dictionaries in your application, since arrays tend to be faster.
However that's just based on my knowledge and instincts. If you're really serious about performance, then the only real solution is to write some code to test how fast various approaches are!

Only setting the flags to false if they are true is slower, because you would need to query the dictionary twice.
If you set the flag directly it is only one operation.
Additionaly you would also need to prevent race-conditions if you have a multi-threaded environment by e.g. putting a lock around the "check&set" logic which would slow it down even more.
The only reason for doing a check I can think of would be if you have a custom dictionary which does trigger something on a set event even if it is the same value and you want to avoid that.

What is the fastest way of changing Dictionary<K,V>?

This is an algorithmic question.
I have got Dictionary<object,Queue<object>>. Each queue contains one or more elements in it. I want to remove all queues with only one element from the dictionary. What is the fastest way to do it?
Pseudo-code: foreach(item in dict) if(item.Length==1) dict.Remove(item);
It is easy to do it in a loop (not foreach, of course), but I'd like to know which approach is the fastest one here.
Why I want it: I use that dictionary to find duplicate elements in a large set of objects. The Key in dictionary is kind of a hash of the object, the Value is a queue of all objects found with the same hash. Since I want only duplicates, I need to remove all items with just a single object in associated queue.
Update:
It may be important to know that in a regular case there are just a few duplicates in a large set of objects. Let's assume 1% or less. So possibly it could be faster to leave the Dictionary as is and create a new one from scatch with just selected elements from the first one... and then deelte the first Dictionary completely. I think it depends on the comlpexity of computational Dictionary class's methods used in particular algorithms.
I really want to see this problem on a theoretical level because as a teacher I want to discuss it with students. I didn't provide any concrete solution myself because I think it is really easy to do it. The question is which approach is the best, the fastest.

var itemsWithOneEntry = dict.Where(x => x.Value.Count == 1)
.Select(x => x.Key)
.ToList();
foreach (var item in itemsWithOneEntry) {
dict.Remove(item));
}

It stead of trying to optimize the traversing of the collection how about optimizing the content of the collection so that it only includes the duplicates? This would require changing your collection algorithm instead to something like this
var duplicates = new Dictionary<object,Queue<object>>;
var possibleDuplicates = new Dictionary<object,object>();
foreach(var item in original){
if(possibleDuplicates.ContainsKey(item)){
duplicates.Add(item, new Queue<object>{possibleDuplicates[item],item});
possibleDuplicates.Remove(item);
} else if(duplicates.ContainsKey(item)){
duplicates[item].Add(item);
} else {
possibleDuplicates.Add(item);
}
}

Note that you should probably measure the impact of this on the performance in a realistic scenario before you bother to make your code any more complex than it really needs to be. Most imagined performance problems are not in fact the real cause of slow code.
But supposing you do find that you could get a speed advantage by avoiding a linear search for queues of length 1, you could solve this problem with a technique called indexing.
As well as your dictionary containing all the queues, you maintain an index container (probably another dictionary) that only contains the queues of length 1, so when you need them they are already available separately.
To do this, you need to enhance all the operations that modify the length of the queue, so that they have the side-effect of updating the index container.
One way to do it is to define a class ObservableQueue. This would be a thin wrapper around Queue except it also has a ContentsChanged event that fires when the number of items in the queue changes. Use ObservableQueue everywhere instead of the plain Queue.
Then when you create a new queue, enlist on its ContentsChanged event a handler that checks to see if the queue only has one item. Based on this you can either insert or remove it from the index container.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.