What is the fastest way to compare these two collections? - c#

I am noticing a huge performance issue with trying to get a list of keys in a ConcurrentDictionary value object that exist in an IEnumerable collection as follows:
Customer object has:
string CustomerNumber;
string Location;
var CustomerDict = ConcurrentDictionary<string, Customer>();
var customers = IEnumerable<string>();
I am trying to get a list of the keys in the dictionary where the customers.CustomerNumber is in the dictionary. What I have is below the removeItems takes a very long time to return:
var removeItems = CustomerDict
.Where(w => customers.Any(c => c == w.Value.CustomerNumber))
.Select(s => s.Key)
.ToList();
foreach(var item in removeItems)
{
CustomerDict.TryRemove(item, out _);
}
Any help would be much appreciated what best to do with this.

Make customers a HashSet<string>, who's Contains method is O(1):
var customers = HashSet<string>();
var removeItems = CustomerDict
.Where(w => customers.Contains(w.Value.CustomerNumber))
.Select(s => s.Key);
Currently, Any is iterating over customers every time which has an O(n) complexity.
Also you're call to ToList is superfluous: it adds an additional, unnecessary iteration over customers, not to mention increased memory usage.

I think its better to create HashSet from customers in order to look faster,
HashSet<string> customersHashSet = new HashSet<string>(customers);
var removeItems = CustomerDict
.Where(c => customersHashSet.Contains(c.Value.CustomerNumber))
.Select(s => s.Key);
foreach (var item in removeItems)
{
CustomerDict.TryRemove(item, out _);
}
When removing consider if you have many items in the HashSet ( relatively to the dictionary ) its maybe better to iterate over the dictionary and search in the HashSet, like this :
foreach (var item in CustomerDict.ToArray())
{
if (customersHashSet.Contains(item.Value.CustomerNumber))
CustomerDict.TryRemove(item.Key, out _);
}

The problem is that .Any will do a linear scan of the underlying collection, which in your case is the key collection of your concurrent dictionary. This takes linear effort. It would be better to dump the keys into a local HashSet and then check the inclusion via .Contains(w.Value.CustomerNumber). This becomes nearly constant effort.

Why not just simply do this:
foreach(var customer in customers) //enumerate customers
CustomerDict.TryRemove(customer, out _); //trytoremove the customer, won't do anything if the customer isn't found

Related

Can I use LINQ to find out if a given property in a list repeats itself?

I have a list of objects that have a name field on them.
I want to know if there's a way to tell if all the name fields are unique in the list.
I could just do two loops and iterate over the list for each value, but I wanted to know if there's a cleaner way to do this using LINQ?
I've found a few examples where they compare each item of the list to a hard coded value but in my case I want to compare the name field on each object between each other and obtain a boolean value.
A common "trick" to check for uniqueness is to compare the length of a list with duplicates removed with the length of the original list:
bool allNamesAreUnique = myList.Select(x => x.Name).Distinct().Count() == myList.Count();
Select(x => x.Name) transforms your list into a list of just the names, and
Distict() removes the duplicates.
The performance should be close to O(n), which is better than the O(n²) nested-loop solution.
Another option is to group your list by the name and check the size of those groups. This has the additional advantage of telling you which values are not unique:
var duplicates = myList.GroupBy(x => x.Name).Where(g => g.Count() > 1);
bool hasDuplicates = duplicates.Any(); // or
List<string> duplicateNames = duplicates.Select(g => g.Key).ToList();
While you can use LINQ to group or create a distinct list, and then compare item-wise with the original list, that incurs a bit of overhead you might not want, especially for a very large list. A more efficient solution would store the keys in a HashSet, which has better lookup capability, and check for duplicates in a single loop. This solution still uses a little bit of LINQ so it satisfies your requirements.
static public class ExtensionMethods
{
static public bool HasDuplicates<TItem,TKey>(this IEnumerable<TItem> source, Func<TItem,TKey> func)
{
var found = new HashSet<TKey>();
foreach (var key in source.Select(func))
{
if (found.Contains(key)) return true;
found.Add(key);
}
return false;
}
}
If you are looking for duplicates in a field named Name, use it like this:
var hasDuplicates = list.HasDuplicates( item => item.Name );
If you want case-insensitivity:
var hasDuplicates = list.HasDuplicates( item => item.Name.ToUpper() );

Categorical sorting optimization

Question: What is the best way to sort items(T) into buckets(ConcurrentBag)?
Ok, so I have not yet taken an Algorithms class, so I am unsure of the best approach to the problem I have come across.
Preconditions:
Each bucket has a unique identifier (within each sBucket).
Each sBucket has a unique identifier.
Each item has a unique identifier.
Each item has a property (bucketId) corresponding to the bucket it
belongs to.
Each item has a property (sBucketId) corresponding to the
superBucket it belongs to.
Bucket and sBucket id's are unique.
I have a ConcurrentBag of items I wish to sort into these
buckets.
There are several hundred items.
There are several dozen buckets.
There are 3 super-buckets which contain the buckets.
Each super-bucket contains the same buckets, though with different
items within the buckets.
I am currently using brute force via a Parallel.foreach loop on the collection of items to compare the item's bucketId to each individual bucket using linq. This is incredibly slow and cumbersome though, so I'd like to find a better method.
I have thought about sorting the items based on their superBucket then Bucket, and then iterating through each superbucket->bucket to insert the items. Should this be the path I take?
Thanks for any help you can provide.
Example of current code
ConcurrentBag<Item> items ...
List<SuperBuckets> ListOfSuperBuckets ...
Parallel.ForEach(items, item =>
{
ListOfSuperBuckets
.Where(sBucket => sBucket.id == item.sBucketId)
.First()
.buckets
.Where(bucket => bucket.id == item.bucketId)
.First()
.items
.Add(item);
});
I wouldn't use parallelism for this, but there are a bunch of options.
var groupedBySBucket = ListOfSuperBuckets
.GroupJoin(items, a => a.id, b => b.sBucketId, (a,b) => new
{
sBucket = a,
buckets = a.buckets
.GroupJoin(b, c => c.id, x => x.bucketId, (c, x) => new
{
bucket = c,
items = x
});
});
foreach (var g in groupedBySBucket)
{
// We benefit here from that the collection types are passed by reference.
foreach (var b in g.buckets)
{
b.bucket.AddRange(b.items);
}
}
Or if that's too much code for you, this is comparable.
var groupedByBucket = ListOfSuperBuckets
.SelectMany(c => c.buckets, (a,b) => new { sBucketId = a.id, bucket = b })
.GroupJoin(items, a => new { a.sBucketId, bucketId = a.bucket.id }, b => new { b.sBucketId, b.bucketId }, (a, b) => new
{
bucket = a.bucket,
items = b
}));
foreach (var g in groupedByBucket)
{
// We benefit here from that the collection types are passed by reference.
g.bucket.AddRange(b.items);
}
This is also assuming ListOfSuperBuckets is a given. If that was simply an artifact of your implementation, there'd be a simpler way even yet. This builds the list.
Beware, of course, because these are different--this one won't have any empty buckets for no data, but the first implementation could. We're also creating new buckets, which the first implementation doesn't; good if we need to, bad if you've already created them elsewhere. The first one could easily be modified to create them, of course.
var ListOfSuperBuckets = items
.GroupBy(c => new { c.bucketId, c.sBucketId })
.GroupBy(c => c.Key.sBucketId)
.Select(c => new SuperBucket
{
id = c.Key,
buckets = c.Select(b => new Bucket
{
id = b.Key.bucketId,
items = b.ToList()
}).ToList()
})
.ToList();
For what it's worth, all these ToList calls are meant to preserve the contract I assume you have. If you don't need them, you could benefit from LINQ's deferred execution by leaving them off. It's really a matter of how you're using the code, but that's worth consideration.
You should use Dictionary so you can look up buckets and SuperBuckets by ID instead of searching for them.
SuperBucket should have a Dictionary<id_type,Bucket> that you can use to look up buckets by ID, and should should keep the SuperBuckets in a Dictionary<id_type,SuperBucket>. (id_type is the type of your IDs -- probably string or int, but I can't tell from your code)
If you don't want to modify the existing classes, then build a Dictionary<id_type, Dictionary<id_type, Bucket>> and use that.

Finding the list of common objects between two lists

I have list of objects of a class for example:
class MyClass
{
string id,
string name,
string lastname
}
so for example: List<MyClass> myClassList;
and also I have list of string of some ids, so for example:
List<string> myIdList;
Now I am looking for a way to have a method that accept these two as paramets and returns me a List<MyClass> of the objects that their id is the same as what we have in myIdList.
NOTE: Always the bigger list is myClassList and always myIdList is a smaller subset of that.
How can we find this intersection?
So you're looking to find all the elements in myClassList where myIdList contains the ID? That suggests:
var query = myClassList.Where(c => myIdList.Contains(c.id));
Note that if you could use a HashSet<string> instead of a List<string>, each Contains test will potentially be more efficient - certainly if your list of IDs grows large. (If the list of IDs is tiny, there may well be very little difference at all.)
It's important to consider the difference between a join and the above approach in the face of duplicate elements in either myClassList or myIdList. A join will yield every matching pair - the above will yield either 0 or 1 element per item in myClassList.
Which of those you want is up to you.
EDIT: If you're talking to a database, it would be best if you didn't use a List<T> for the entities in the first place - unless you need them for something else, it would be much more sensible to do the query in the database than fetching all the data and then performing the query locally.
That isn't strictly an intersection (unless the ids are unique), but you can simply use Contains, i.e.
var sublist = myClassList.Where(x => myIdList.Contains(x.id));
You will, however, get significantly better performance if you create a HashSet<T> first:
var hash = new HashSet<string>(myIdList);
var sublist = myClassList.Where(x => hash.Contains(x.id));
You can use a join between the two lists:
return myClassList.Join(
myIdList,
item => item.Id,
id => id,
(item, id) => item)
.ToList();
It is kind of intersection between two list so read it like i want something from one list that is present in second list. Here ToList() part executing the query simultaneouly.
var lst = myClassList.Where(x => myIdList.Contains(x.id)).ToList();
you have to use below mentioned code
var samedata=myClassList.where(p=>p.myIdList.Any(q=>q==p.id))
myClassList.Where(x => myIdList.Contains(x.id));
Try
List<MyClass> GetMatchingObjects(List<MyClass> classList, List<string> idList)
{
return classList.Where(myClass => idList.Any(x => myClass.id == x)).ToList();
}
var q = myClassList.Where(x => myIdList.Contains(x.id));

Using LINQ to select a list of every item in a collection, except if it exists in another collection

This works, but not only looks bad, but doesn't seem terribly efficient (I have not evaluated performance yet since I know there must be a better way to do this).
public IEnumerable<Observation> AvailableObservations
{
get
{
foreach (var observation in db.Observations)
{
if (Observations.Any(x => x.Id == observation.Id))
{
}
else
{
yield return observation;
}
}
}
}
Essentially, I want everything in the list db.Observations
(which pulled from db via EF6) and remove all the entries currently selected in this.Observations, which is an ICollection<Observations>
I've tried using .Except(this.Observations) but get an error that I believe might be related to using except with an ICollection on an entity that is an IEnumerable.
Anything that will remove the foreach loop would be a good start.
Well your loop is equivalent to:
return db.Observations.Where(o => !Observations.Any(oo => oo.Id == o.Id));
but that's no more efficient that what you have.
A more efficient method would be to create a HashSet of IDs and filter off of that:
HashSet<int> ids = new HashSet<int>(Observations.Select(o => o.Id));
return db.Observations.Where(o => !ids.Contains(o.Id));
That way you're only traversing the main list once in order to create a HashSet that can be searched in O(1) time.
You can do two optimizations here:
Limit the number of observations you fetch from the database
Make the lookup of the selected observation IDs quicker
The current implementation has a O(N * M) complexity, where N is the number of items in db.Observations and M is the number of items in this.Observations.
What would help performance would be to first create a HashSet of the IDs in this.Observations:
var observationIds = new HashSet<int>(this.Observations.Select(x => x.Id));
This will allow you to do quick lookups on the IDs.
Combine this with a where clause (using LINQ's Where()) to get an efficient query:
public IEnumerable<Observation> AvailableObservations
{
get
{
var observationIds = new HashSet<int>(this.Observations.Select(x => x.Id));
return db.Observations.Where(x => !observationIds.Contains(x.Id));
}
}

Adding IEnumerable<Type> to IList<Type> where IList doesn't contain primary key - LAMBDA

I have an IList<Price> SelectedPrices. I also have an IEnumerable<Price> that gets retrieved at a later date. I would like to add everything from the latter to the former where the former does NOT contain the primary key defined in the latter. So for instance:
IList<Price> contains Price.ID = 1, Price.ID = 2, and IEnumerable<Price> contains Price.ID = 2, Price.ID = 3, and Price.ID = 4. What's the easiest way to use a lambda to add those items so that I end up with the IList containing 4 unique Prices? I know I have to call ToList() on the IList to get access to the AddRange() method so that I can add multiple items at once, but how do I select only the items that DON'T exist in that list from the enumerable?
I know I have to call ToList() on the IList to get access to the AddRange() method
This is actually not safe. This will create a new List<T>, so you won't add the items to your original IList<T>. You'll need to add them one at a time.
The simplest option is just to loop and use a contains:
var itemsToAdd = enumerablePrices.Where(p => !SelectedPrices.Any(sel => sel.ID == p.ID));
foreach(var item in itemsToAdd)
{
SelectedPrices.Add(item);
}
However, this is going to be quadratic in nature, so if the collections are very large, it may be slow. Depending on how large the collections are, it might actually be better to build a set of the IDs in advance:
var existing = new HashSet<int>(SelectedPrices.Select(p => p.ID));
var itemsToAdd = enumerablePrices.Where(p => !existing.Contains(p.ID));
foreach(var item in itemsToAdd)
{
SelectedPrices.Add(item);
}
This will prevent the routine from going quadratic if your collection (SelectedPrices) is large.
You can try that:
var newPrices = prices.Where(p => !SelectedPrices.Any(sp => sp.ID == p.ID));
foreach(var p in newPrices)
SelectedPrices.Add(p);
I know I have to call ToList() on the IList to get access to the AddRange() method so that I can add multiple items at once
ToList will create a new instance of List<Price>, so you will be modifying another list, not the original one... No, you need to add the items one by one.
Try yourEnumerable.Where(x => !yourList.Any(y => y.ID == x.ID)) for the selection part of your question.
If you want to add new elements to the existing list and do that in a most performant way you should probably do it in a conventional way. Like this:
IList<Price> selectedPrices = ...;
IEnumerable<Price> additionalPrices = ...;
IDictionary<int, Price> pricesById = new Dictionary<int, Price>();
foreach (var price in selectedPrices)
{
pricesById.Add(price.Id, price);
}
foreach (var price in additionalPrices)
{
if (!pricesById.ContainsKey(price.Id))
{
selectedPrices.Add(price);
}
}

Categories