Link to find collection items from second collection very slow - c#

Please let me know if this is already answered somewhere; I can't find it.
In memory, I have a collection of objects in firstList and related objects in secondList. I want to find all items from firstList where the id's match the secondList item's RelatedId. So it's fairly straightforward:
var items = firstList.Where(item => secondList.Any(secondItem => item.Id == secondItem.RelatedId));
But when both first and second lists get even moderately larger, this call is extremely expensive and takes many seconds to complete. Is there a way to break this up, or redesign it into multiple queries to make it more efficient?

The reason this code is so inefficient is that for every element of the first list, it has to iterate over (in the worst case) the entirety of the second list to look for an item with matching id.
A more efficient way to do this using LINQ would be using the Join method as follows:
var items = firstList.Join(secondList, item => item.Id, secondItem => secondItem.RelatedId, (item, _) => item);
If the second collection may contain duplicate IDs, you will additionally have to run Distinct() (possibly with some changes depending on the equality semantics for the members of the first list) on the result to maintain the semantics of your original code.
This code resulted in a roughly 100x speedup for me with a test using two lists of 10000 elements each.
If you're running this operation often and one of the collections does not change, you could consider caching the Ids or RelatedIds in a HashSet instead.

Related

How to Update a list with another list efficiently C#

I have two lists that have object elements, one big list let's call it List1 and another small list List2.
I need to update values in List1 with values in List2 based on a condition that is defined in a function that returns a boolean based on the values in the objects.
I have come up with the following implementation which is really taking a lot of time for larger lists.
function to check whether an item will be updated
private static bool CheckMatch(Item item1, Item item2) {
//do some stuff here and return a boolean
}
query I'm using to update the items
In the snippet below, I need to update List1(larger list) with some values in List2(small list)
foreach(var item1 in List1)
{
var matchingItems = List2.Where(item2 => CheckMatch(item1, item2));
if (matchingItems.Any())
{
item1.IsExclude = matchingItems.First().IsExcluded;
item1.IsInclude = matchingItems.First().IsIncluded;
item1.Category = matchingItems.First().Category;
}
}
I'm hoping I will get a solution that is much better than this. I also need to maintain the position of elements in List1
Here is sample of what I'm doing
Here is sample of what I'm doing
As LP13's answer points out, you're doing a large amount of re-computation by re-executing a query instead of executing it once and caching the result.
But the larger problem here is that if you have n items in List1 and m potential matches in List2, and you are looking for any match, then worst case you will definitely do n * m matches. If n and m are large, their product is rather larger. And since we're looking for any match, the worst case is when there is no match; you'll definitely try all m possibilities.
Is this cost avoidable? Maybe, but only if we know some trick to take advantage of, and you've made the problem so abstract -- we have two lists and a relation, and no information about either the lists or the relation -- that there is no structure that we can take advantage of.
That said: if you happen to know that there is an element in List2 that is likely to match many items in List1 then put that element first. Any, or FirstOrDefault, will stop executing the Where query after getting the first match, so you can turn an O(n * m) problem into an O(n) problem.
Without knowing more about what the relation is, it's hard to say how to improve the performance.
UPDATE: A commenter points out that we can do better if we know that the relation is an equivalence relation. Is it an equivalence relation? That is, suppose we have your method that checks two items. Are we guaranteed the following?
The relation is reflexive: CheckMatch(a, a) is always true.
The relation is symmetric: CheckMatch(a, b) is always the same as CheckMatch(b, a)
The relation is transitive: if CheckMatch(a, b) is true and CheckMatch(b, c) is true then CheckMatch(a, c) is always true
If we have those three conditions then you can do considerably better. Such a relation partitions elements into equivalence classes. What you do is associate each item in List1 and List2 with a canonical value. That canonical value is the same for every member of the equivalence class. From that dictionary you can then do fast lookups and solve your problem quickly.
But if your relation is not an equivalence relation, this does not work.
Can you try this? When you do only .Where it produces IEnumerable and then you are doing First() and Any() on IEnumerable
foreach(var item1 in List1)
{
var matchingItem = List2.Where(item2 => CheckMatch(item1, item2)).FirstOrDefault();
if (matchingItem != null)
{
item1.IsExclude = matchingItem.IsExcluded;
item1.IsInclude = matchingItem.IsIncluded;
item1.Category = matchingItem.Category;
}
}

How to filter a list by another list

I have two lists :
List<myObject> mainList;
And
List<myObject> blackList;
I'm trying to make a new list by a condition and another condition that the elements should not be in the blacklist.
Here is my attempt :
List<myObject> newList = mainList.Where(x => x.Id == 5 && !blackList.Contains(x)).ToList();
This newList is generated inside a loop, in the first round of the loop, blackList is empty and it works, in the second round blackList contains about 200k elements. And when the above line works, it doesn't move next, it stays there for minutes. How can I do filtering more efficiently so that I wouldn't get elements which are in the blackList? Thanks.
The problem you're facing is due to the way List<T> implements Contains - it searches linearly through which is quite slow and inefficient for long lists.
To get better performance you could use a better structure for the blacklist - one with a faster/better implementation for long lists like a HashSet<T>
var blackList = new HashSet<myObject>(theBlackList);

Most efficient collection for storing data from LINQ to Entities?

I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.

Best way to get an ordered list of groups by value from an unordered list

I'd like to know if there's a more efficient way to get an ordered list of groups by value from an initially unordered list, than using GroupBy() followed by OrderBy(), like this:
List<int> list = new List<int>();
IEnumerable<IEnumerable<int>> orderedGroups = list.GroupBy(x => x).OrderBy(x => x.Key);
For more detail, I have a large List<T> which I'd like to sort, however there are lots of duplicate values so I want to return the results as IEnumerable<IEnumerable<T>>, much as GroupBy() returns an IEnumerable of groups. If I use OrderBy(), I just get IEnumerable<T>, with no easy way to know whether the value has changed from one item to the next. I could group the list then sort the groups, but the list is large so this ends up being slow. Since OrderBy() returns an OrderedEnumerable which can then be sorted on a secondary field using ThenBy(), it must internally distinguish between adjacent items with the same or different values.
Is there any way I can make use of the fact that OrderedEnumerable<T> must internally group its results by value (in order to facilitate ThenBy()), or otherwise what's the most efficient way to use LINQ to get an ordered list of groups?
You can use ToLookup, which returns an IEnumerable<IGrouping<TKey, TElement> and then do OrderBy for values of each key on demand. This will be O(n) to create the lookup and O(h) to order elements under each group (values for a key) assuming h is the number of elements under a group
You can improve the performance to amortized O(n) by using IDictionary<TKey, IOrderedEnumerable<T>>. But if you want to order by multiple properties, it will again by O(h) on the group. See this answer for more info on IOrderedEnumerable. You can also use SortedList<TKey, TValue> instead of IOrderedEnumerable
[Update]:
Here is another answer which you can take a look. But again, it involves doing OrderBy on top of the result.
Further, you can come up with your own data structure as I don't see any data structure available on BCL meeting this requrement.
One possible implementation:
You can have a Binary Search Tree which does search/delete/insert in O(longN) on an average. And doing an in-order traversal will give you sorted keys. Each node on the tree will have an ordered collection for example, for the values.
node roughly looks like this:
public class MyNode
{
prop string key;
prop SortedCollection myCollection;
}
You can traverse over the initial collection once and create this special data structure which can be queried to get fast results.
[Update 2]:
if you have possible keys below 100k, then I feel implementing your own data structure is an overkill. Generally an order by will return pretty fast and the time taken is tiny. Unless you have large data and you do order by multiple times, ToLookup should work fairly well.
Honestly, you're not going to do much better than
items.GroupBy(i => i.KeyProperty).OrderBy(g => g.Key);
GroupBy is an O(n) operation. The OrderBy is then O(k log k) where k is the number of groups.
If you call OrderBy first... well, firstly, your O(n log n) is now in your number of items rather than your number of groups, so it's already slower than the above.
And secondly, an IOrderedEnumerable doesn't have the internal magic you think it does. It isn't an ordered sequence that contains groups of same-ordered items which can then by reordered with ThenBy; it's an unordered sequence with a list of sort keys which ThenBy adds to, which is eventually ordered by each of those keys when you iterate over it.
You may be able to eke out a little more speed by rolling your own "group and sort" loop, maybe manually adding to an SortedDictionary<TKey, IList<TItem>>, but I don't think you're going to get a better big O than what out-of-the-box LINQ gets you.LINQ
I think iterating thru the list for(;;) as you populate Dictionary<T, int>, where value is count of repeated elements will be faster.

Converting IEnumerable<T> to List<T> on a LINQ result, huge performance loss

On a LINQ-result you like this:
var result = from x in Items select x;
List<T> list = result.ToList<T>();
However, the ToList<T> is Really Slow, does it make the list mutable and therefore the conversion is slow?
In most cases I can manage to just have my IEnumerable or as Paralell.DistinctQuery but now I want to bind the items to a DataGridView, so therefore I need to as something else than IEnumerable, suggestions on how I will gain performance on ToList or any replacement?
On 10 million records in the IEnumerable, the .ToList<T> takes about 6 seconds.
.ToList() is slow in comparison to what?
If you are comparing
var result = from x in Items select x;
List<T> list = result.ToList<T>();
to
var result = from x in Items select x;
you should note that since the query is evaluated lazily, the first line doesn't do much at all. It doesn't retrieve any records. Deferred execution makes this comparison completely unfair.
It's because LINQ likes to be lazy and do as little work as possible. This line:
var result = from x in Items select x;
despite your choice of name, isn't actually a result, it's just a query object. It doesn't fetch any data.
List<T> list = result.ToList<T>();
Now you've actually requested the result, hence it must fetch the data from the source and make a copy of it. ToList guarantees that a copy is made.
With that in mind, it's hardly surprising that the second line is much slower than the first.
No, it's not creating the list that takes time, it's fetching the data that takes time.
Your first code line doesn't actually fetch the data, it only sets up an IEnumerable that is capable of fetching the data. It's when you call the ToList method that it will actually get all the data, and that is why all the execution time is in the second line.
You should also consider if having ten million lines in a grid is useful at all. No user is ever going to look through all the lines, so there isn't really any point in getting them all. Perhaps you should offer a way to filter the result before getting any data at all.
I think it's because of memory reallocations: ToList cannot know the size of the collection beforehand, so that it could allocate enough storage to keep all items. Therefore, it has to reallocate the List<T> as it grows.
If you can estimate the size of your resultset, it'll be much faster to preallocate enough elements using List<T>(int) constructor overload, and then manually add items to it.

Categories