How to Update a list with another list efficiently C#

How to Update a list with another list efficiently C# - c#

I have two lists that have object elements, one big list let's call it List1 and another small list List2.
I need to update values in List1 with values in List2 based on a condition that is defined in a function that returns a boolean based on the values in the objects.
I have come up with the following implementation which is really taking a lot of time for larger lists.
function to check whether an item will be updated
private static bool CheckMatch(Item item1, Item item2) {
//do some stuff here and return a boolean
}
query I'm using to update the items
In the snippet below, I need to update List1(larger list) with some values in List2(small list)
foreach(var item1 in List1)
{
var matchingItems = List2.Where(item2 => CheckMatch(item1, item2));
if (matchingItems.Any())
{
item1.IsExclude = matchingItems.First().IsExcluded;
item1.IsInclude = matchingItems.First().IsIncluded;
item1.Category = matchingItems.First().Category;
}
}
I'm hoping I will get a solution that is much better than this. I also need to maintain the position of elements in List1
Here is sample of what I'm doing
Here is sample of what I'm doing

As LP13's answer points out, you're doing a large amount of re-computation by re-executing a query instead of executing it once and caching the result.
But the larger problem here is that if you have n items in List1 and m potential matches in List2, and you are looking for any match, then worst case you will definitely do n * m matches. If n and m are large, their product is rather larger. And since we're looking for any match, the worst case is when there is no match; you'll definitely try all m possibilities.
Is this cost avoidable? Maybe, but only if we know some trick to take advantage of, and you've made the problem so abstract -- we have two lists and a relation, and no information about either the lists or the relation -- that there is no structure that we can take advantage of.
That said: if you happen to know that there is an element in List2 that is likely to match many items in List1 then put that element first. Any, or FirstOrDefault, will stop executing the Where query after getting the first match, so you can turn an O(n * m) problem into an O(n) problem.
Without knowing more about what the relation is, it's hard to say how to improve the performance.
UPDATE: A commenter points out that we can do better if we know that the relation is an equivalence relation. Is it an equivalence relation? That is, suppose we have your method that checks two items. Are we guaranteed the following?
The relation is reflexive: CheckMatch(a, a) is always true.
The relation is symmetric: CheckMatch(a, b) is always the same as CheckMatch(b, a)
The relation is transitive: if CheckMatch(a, b) is true and CheckMatch(b, c) is true then CheckMatch(a, c) is always true
If we have those three conditions then you can do considerably better. Such a relation partitions elements into equivalence classes. What you do is associate each item in List1 and List2 with a canonical value. That canonical value is the same for every member of the equivalence class. From that dictionary you can then do fast lookups and solve your problem quickly.
But if your relation is not an equivalence relation, this does not work.

Can you try this? When you do only .Where it produces IEnumerable and then you are doing First() and Any() on IEnumerable
foreach(var item1 in List1)
{
var matchingItem = List2.Where(item2 => CheckMatch(item1, item2)).FirstOrDefault();
if (matchingItem != null)
{
item1.IsExclude = matchingItem.IsExcluded;
item1.IsInclude = matchingItem.IsIncluded;
item1.Category = matchingItem.Category;
}
}

Related

Link to find collection items from second collection very slow

Please let me know if this is already answered somewhere; I can't find it.
In memory, I have a collection of objects in firstList and related objects in secondList. I want to find all items from firstList where the id's match the secondList item's RelatedId. So it's fairly straightforward:
var items = firstList.Where(item => secondList.Any(secondItem => item.Id == secondItem.RelatedId));
But when both first and second lists get even moderately larger, this call is extremely expensive and takes many seconds to complete. Is there a way to break this up, or redesign it into multiple queries to make it more efficient?

The reason this code is so inefficient is that for every element of the first list, it has to iterate over (in the worst case) the entirety of the second list to look for an item with matching id.
A more efficient way to do this using LINQ would be using the Join method as follows:
var items = firstList.Join(secondList, item => item.Id, secondItem => secondItem.RelatedId, (item, _) => item);
If the second collection may contain duplicate IDs, you will additionally have to run Distinct() (possibly with some changes depending on the equality semantics for the members of the first list) on the result to maintain the semantics of your original code.
This code resulted in a roughly 100x speedup for me with a test using two lists of 10000 elements each.
If you're running this operation often and one of the collections does not change, you could consider caching the Ids or RelatedIds in a HashSet instead.

Best way to get an ordered list of groups by value from an unordered list

I'd like to know if there's a more efficient way to get an ordered list of groups by value from an initially unordered list, than using GroupBy() followed by OrderBy(), like this:
List<int> list = new List<int>();
IEnumerable<IEnumerable<int>> orderedGroups = list.GroupBy(x => x).OrderBy(x => x.Key);
For more detail, I have a large List<T> which I'd like to sort, however there are lots of duplicate values so I want to return the results as IEnumerable<IEnumerable<T>>, much as GroupBy() returns an IEnumerable of groups. If I use OrderBy(), I just get IEnumerable<T>, with no easy way to know whether the value has changed from one item to the next. I could group the list then sort the groups, but the list is large so this ends up being slow. Since OrderBy() returns an OrderedEnumerable which can then be sorted on a secondary field using ThenBy(), it must internally distinguish between adjacent items with the same or different values.
Is there any way I can make use of the fact that OrderedEnumerable<T> must internally group its results by value (in order to facilitate ThenBy()), or otherwise what's the most efficient way to use LINQ to get an ordered list of groups?

You can use ToLookup, which returns an IEnumerable<IGrouping<TKey, TElement> and then do OrderBy for values of each key on demand. This will be O(n) to create the lookup and O(h) to order elements under each group (values for a key) assuming h is the number of elements under a group
You can improve the performance to amortized O(n) by using IDictionary<TKey, IOrderedEnumerable<T>>. But if you want to order by multiple properties, it will again by O(h) on the group. See this answer for more info on IOrderedEnumerable. You can also use SortedList<TKey, TValue> instead of IOrderedEnumerable
[Update]:
Here is another answer which you can take a look. But again, it involves doing OrderBy on top of the result.
Further, you can come up with your own data structure as I don't see any data structure available on BCL meeting this requrement.
One possible implementation:
You can have a Binary Search Tree which does search/delete/insert in O(longN) on an average. And doing an in-order traversal will give you sorted keys. Each node on the tree will have an ordered collection for example, for the values.
node roughly looks like this:
public class MyNode
{
prop string key;
prop SortedCollection myCollection;
}
You can traverse over the initial collection once and create this special data structure which can be queried to get fast results.
[Update 2]:
if you have possible keys below 100k, then I feel implementing your own data structure is an overkill. Generally an order by will return pretty fast and the time taken is tiny. Unless you have large data and you do order by multiple times, ToLookup should work fairly well.

Honestly, you're not going to do much better than
items.GroupBy(i => i.KeyProperty).OrderBy(g => g.Key);
GroupBy is an O(n) operation. The OrderBy is then O(k log k) where k is the number of groups.
If you call OrderBy first... well, firstly, your O(n log n) is now in your number of items rather than your number of groups, so it's already slower than the above.
And secondly, an IOrderedEnumerable doesn't have the internal magic you think it does. It isn't an ordered sequence that contains groups of same-ordered items which can then by reordered with ThenBy; it's an unordered sequence with a list of sort keys which ThenBy adds to, which is eventually ordered by each of those keys when you iterate over it.
You may be able to eke out a little more speed by rolling your own "group and sort" loop, maybe manually adding to an SortedDictionary<TKey, IList<TItem>>, but I don't think you're going to get a better big O than what out-of-the-box LINQ gets you.LINQ

I think iterating thru the list for(;;) as you populate Dictionary<T, int>, where value is count of repeated elements will be faster.

C# Matching items in different lists

I have two different lists of objects, one of them an IQueryable set (rolled up into an array) and the other a List set. Objects in both sets share a field called ID; each of the objects in the second set will match an object in the first set, but not necessarily vice versa. I need to be able to handle both groups (matched and unmatched). The size of both collections is between 300 and 350 objects in this case (for reference, the XML generated for the objects in the second set is usually no more than 7k, so think maybe half to two-thirds of that size for the actual memory used by each object in each set).
The way I have it currently set up is a for-loop that iterates through an array representation of the IQueryable set, using a LINQ statement to query the List set for the matching record. This takes too much time; I'm running a Core i7 with 10GB of RAM and it's taking anywhere from 10 seconds to 2.5 minutes to match and compare the objects. Task Manager doesn't show any huge memory usage--a shade under 25MB. None of my system threads are being taxed either.
Is there a method or algorithm that would allow me to pair up the objects in each set one time and thus iterate through the pairs and unmatched objects at a faster pace? This set of objects is just a small subset of the 8000+ this program will have to chew through each day once it goes live...
EDIT: Here's the code I'm actually running...
for (int i = 0; i < draftRecords.Count(); i++)
{
sRecord record = (from r in sRecords where r.id == draftRecords.ToArray()[i].ID select r).FirstOrDefault();
if (record != null)
{ // Do stuff with the draftRecords element based on the rest of the content of the sRecord object

You should use a method such as Enumerable.Join or Enumerable.GroupJoin to match items from the two collections. This will be far faster than doing nested for loops.
Since you want to match a collection of keys to an item in the second list which may or may not exist, GroupJoin is likely more appropriate. This would look something like:
var results = firstSet.GroupJoin(secondSet, f => f.Id, s => s.Id, (f,sset) => new {First = f, Seconds = sset});
foreach(var match in results)
{
Console.WriteLine("Item {0} matches:", match.First);
foreach(var second in item.Seconds)
Console.WriteLine(" {0}", second); // each second item matching, one at a time
}

Your question is lacking in sample code/information but I would personally look to use methods like; Join, Intersect, or Contains. If necessary use Select to do a projection of the fields you want to match or define a custom IEqualityComparer.

Get indexes of all matching values from list using Linq

Hey Linq experts out there,
I just asked a very similar question and know the solution is probably SUPER easy, but still find myself not being able to wrap my head around how to do this fairly simple task in the most efficient manner using linq.
My basic scenario is that I have a list of values, for example, say:
Lst1:
a
a
b
b
c
b
a
c
a
And I want to create a new list that will hold all the indexes from Lst1 where, say, the value = "a".
So, in this example, we would have:
LstIndexes:
0
1
6
8
Now, I know I can do this with Loops (which I would rather avoid in favor of Linq) and I even figured out how to do this with Linq in the following way:
LstIndexes= Lst1.Select(Function(item As String, index As Integer) index) _
.Where(Function(index As Integer) Lst1(index) = "a").ToList
My challenge with this is that it iterates over the list twice and is therefore inefficient.
How can I get my result in the most efficient way using Linq?
Thanks!!!!

First off, your code doesn't actually iterate over the list twice, it only iterates it once.
That said, your Select is really just getting a sequence of all of the indexes; that is more easily done with Enumerable.Range:
var result = Enumerable.Range(0, lst1.Count)
.Where(i => lst1[i] == "a")
.ToList();
Understanding why the list isn't actually iterated twice will take some getting used to. I'll try to give a basic explanation.
You should think of most of the LINQ methods, such as Select and Where as a pipeline. Each method does some tiny bit of work. In the case of Select you give it a method, and it essentially says, "Whenever someone asks me for my next item I'll first ask my input sequence for an item, then use the method I have to convert it into something else, and then give that item to whoever is using me." Where, more or less, is saying, "whenever someone asks me for an item I'll ask my input sequence for an item, if the function say it's good I'll pass it on, if not I'll keep asking for items until I get one that passes."
So when you chain them what happens is ToList asks for the first item, it goes to Where to as it for it's first item, Where goes to Select and asks it for it's first item, Select goes to the list to ask it for its first item. The list then provides it's first item. Select then transforms that item into what it needs to spit out (in this case, just the int 0) and gives it to Where. Where takes that item and runs it's function which determine's that it's true and so spits out 0 to ToList, which adds it to the list. That whole thing then happens 9 more times. This means that Select will end up asking for each item from the list exactly once, and it will feed each of its results directly to Where, which will feed the results that "pass the test" directly to ToList, which stores them in a list. All of the LINQ methods are carefully designed to only ever iterate the source sequence once (when they are iterated once).
Note that, while this seems complicated at first to you, it's actually pretty easy for the computer to do all of this. It's not actually as performance intensive as it may seem at first.

This works, but arguably not as neat.
var result = list1.Select((x, i) => new {x, i})
.Where(x => x.x == "a")
.Select(x => x.i);

How about this one, it works pretty fine for me.
static void Main(string[] args)
{
List<char> Lst1 = new List<char>();
Lst1.Add('a');
Lst1.Add('a');
Lst1.Add('b');
Lst1.Add('b');
Lst1.Add('c');
Lst1.Add('b');
Lst1.Add('a');
Lst1.Add('c');
Lst1.Add('a');
var result = Lst1.Select((c, i) => new { character = c, index = i })
.Where(list => list.character == 'a')
.ToList();
}

Select an element by index from a .NET HashSet

At the moment I am using a custom class derived from HashSet. There's a point in the code when I select items under certain condition:
var c = clusters.Where(x => x.Label != null && x.Label.Equals(someLabel));
It works fine and I get those elements. But is there a way that I could receive an index of that element within the collection to use with ElementAt method, instead of whole objects?
It would look more or less like this:
var c = select element index in collection under certain condition;
int index = c.ElementAt(0); //get first index
clusters.ElementAt(index).RunObjectMthod();
Is manually iterating over the whole collection a better way? I need to add that it's in a bigger loop, so this Where clause is performed multiple times for different someLabel strings.
Edit
What I need this for? clusters is a set of clusters of some documents collection. Documents are grouped into clusters by topics similarity. So one of the last step of the algorithm is to discover label for each cluster. But algorithm is not perfect and sometimes it makes two or more clusters with the same label. What I want to do is simply merge those cluster into big one.

Sets don't generally have indexes. If position is important to you, you should be using a List<T> instead of (or possibly as well as) a set.
Now SortedSet<T> in .NET 4 is slightly different, in that it maintains a sorted value order. However, it still doesn't implement IList<T>, so access by index with ElementAt is going to be slow.
If you could give more details about why you want this functionality, it would help. Your use case isn't really clear at the moment.

In the case where you hold elements in HashSet and sometimes you need to get elements by index, consider using extension method ToList() in such situations. So you use features of HashSet and then you take advantage of indexes.
HashSet<T> hashset = new HashSet<T>();
//the special situation where we need index way of getting elements
List<T> list = hashset.ToList();
//doing our special job, for example mapping the elements to EF entities collection (that was my case)
//we can still operate on hashset for example when we still want to keep uniqueness through the elements

There's no such thing as an index with a hash set. One of the ways that hash sets gain efficincy in some cases is by not having to maintain them.
I also don't see what the advantage is here. If you were to obtain the index, and then use it this would be less efficient than just obtaining the element (obtaining the index would be equally efficient, and then you've an extra operation).
If you want to do several operations on the same object, just hold onto that object.
If you want to do something on several objects, do so on the basis of iterating through them (normal foreach or doing foreach on the results of a Where() etc.). If you want to do something on several objects, and then do something else on those several same objects, and you have to do it in such batches, rather than doing all the operations in the same foreach then store the results of the Where() in a List<T>.

why don't use a dictionary?
Dictionary<string, int> dic = new Dictionary<string, int>();
for (int i = 0; i < 10; i++)
{
dic.Add("value " + i, dic.Count + 1);
}
string find = "value 3";
int position = dic[find];
Console.WriteLine("the position of " + find + " is " + position);
example

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.