C# Matching items in different lists

C# Matching items in different lists - c#

I have two different lists of objects, one of them an IQueryable set (rolled up into an array) and the other a List set. Objects in both sets share a field called ID; each of the objects in the second set will match an object in the first set, but not necessarily vice versa. I need to be able to handle both groups (matched and unmatched). The size of both collections is between 300 and 350 objects in this case (for reference, the XML generated for the objects in the second set is usually no more than 7k, so think maybe half to two-thirds of that size for the actual memory used by each object in each set).
The way I have it currently set up is a for-loop that iterates through an array representation of the IQueryable set, using a LINQ statement to query the List set for the matching record. This takes too much time; I'm running a Core i7 with 10GB of RAM and it's taking anywhere from 10 seconds to 2.5 minutes to match and compare the objects. Task Manager doesn't show any huge memory usage--a shade under 25MB. None of my system threads are being taxed either.
Is there a method or algorithm that would allow me to pair up the objects in each set one time and thus iterate through the pairs and unmatched objects at a faster pace? This set of objects is just a small subset of the 8000+ this program will have to chew through each day once it goes live...
EDIT: Here's the code I'm actually running...
for (int i = 0; i < draftRecords.Count(); i++)
{
sRecord record = (from r in sRecords where r.id == draftRecords.ToArray()[i].ID select r).FirstOrDefault();
if (record != null)
{ // Do stuff with the draftRecords element based on the rest of the content of the sRecord object

You should use a method such as Enumerable.Join or Enumerable.GroupJoin to match items from the two collections. This will be far faster than doing nested for loops.
Since you want to match a collection of keys to an item in the second list which may or may not exist, GroupJoin is likely more appropriate. This would look something like:
var results = firstSet.GroupJoin(secondSet, f => f.Id, s => s.Id, (f,sset) => new {First = f, Seconds = sset});
foreach(var match in results)
{
Console.WriteLine("Item {0} matches:", match.First);
foreach(var second in item.Seconds)
Console.WriteLine(" {0}", second); // each second item matching, one at a time
}

Your question is lacking in sample code/information but I would personally look to use methods like; Join, Intersect, or Contains. If necessary use Select to do a projection of the fields you want to match or define a custom IEqualityComparer.

Related

Link to find collection items from second collection very slow

Please let me know if this is already answered somewhere; I can't find it.
In memory, I have a collection of objects in firstList and related objects in secondList. I want to find all items from firstList where the id's match the secondList item's RelatedId. So it's fairly straightforward:
var items = firstList.Where(item => secondList.Any(secondItem => item.Id == secondItem.RelatedId));
But when both first and second lists get even moderately larger, this call is extremely expensive and takes many seconds to complete. Is there a way to break this up, or redesign it into multiple queries to make it more efficient?

The reason this code is so inefficient is that for every element of the first list, it has to iterate over (in the worst case) the entirety of the second list to look for an item with matching id.
A more efficient way to do this using LINQ would be using the Join method as follows:
var items = firstList.Join(secondList, item => item.Id, secondItem => secondItem.RelatedId, (item, _) => item);
If the second collection may contain duplicate IDs, you will additionally have to run Distinct() (possibly with some changes depending on the equality semantics for the members of the first list) on the result to maintain the semantics of your original code.
This code resulted in a roughly 100x speedup for me with a test using two lists of 10000 elements each.
If you're running this operation often and one of the collections does not change, you could consider caching the Ids or RelatedIds in a HashSet instead.

How to Update a list with another list efficiently C#

I have two lists that have object elements, one big list let's call it List1 and another small list List2.
I need to update values in List1 with values in List2 based on a condition that is defined in a function that returns a boolean based on the values in the objects.
I have come up with the following implementation which is really taking a lot of time for larger lists.
function to check whether an item will be updated
private static bool CheckMatch(Item item1, Item item2) {
//do some stuff here and return a boolean
}
query I'm using to update the items
In the snippet below, I need to update List1(larger list) with some values in List2(small list)
foreach(var item1 in List1)
{
var matchingItems = List2.Where(item2 => CheckMatch(item1, item2));
if (matchingItems.Any())
{
item1.IsExclude = matchingItems.First().IsExcluded;
item1.IsInclude = matchingItems.First().IsIncluded;
item1.Category = matchingItems.First().Category;
}
}
I'm hoping I will get a solution that is much better than this. I also need to maintain the position of elements in List1
Here is sample of what I'm doing
Here is sample of what I'm doing

As LP13's answer points out, you're doing a large amount of re-computation by re-executing a query instead of executing it once and caching the result.
But the larger problem here is that if you have n items in List1 and m potential matches in List2, and you are looking for any match, then worst case you will definitely do n * m matches. If n and m are large, their product is rather larger. And since we're looking for any match, the worst case is when there is no match; you'll definitely try all m possibilities.
Is this cost avoidable? Maybe, but only if we know some trick to take advantage of, and you've made the problem so abstract -- we have two lists and a relation, and no information about either the lists or the relation -- that there is no structure that we can take advantage of.
That said: if you happen to know that there is an element in List2 that is likely to match many items in List1 then put that element first. Any, or FirstOrDefault, will stop executing the Where query after getting the first match, so you can turn an O(n * m) problem into an O(n) problem.
Without knowing more about what the relation is, it's hard to say how to improve the performance.
UPDATE: A commenter points out that we can do better if we know that the relation is an equivalence relation. Is it an equivalence relation? That is, suppose we have your method that checks two items. Are we guaranteed the following?
The relation is reflexive: CheckMatch(a, a) is always true.
The relation is symmetric: CheckMatch(a, b) is always the same as CheckMatch(b, a)
The relation is transitive: if CheckMatch(a, b) is true and CheckMatch(b, c) is true then CheckMatch(a, c) is always true
If we have those three conditions then you can do considerably better. Such a relation partitions elements into equivalence classes. What you do is associate each item in List1 and List2 with a canonical value. That canonical value is the same for every member of the equivalence class. From that dictionary you can then do fast lookups and solve your problem quickly.
But if your relation is not an equivalence relation, this does not work.

Can you try this? When you do only .Where it produces IEnumerable and then you are doing First() and Any() on IEnumerable
foreach(var item1 in List1)
{
var matchingItem = List2.Where(item2 => CheckMatch(item1, item2)).FirstOrDefault();
if (matchingItem != null)
{
item1.IsExclude = matchingItem.IsExcluded;
item1.IsInclude = matchingItem.IsIncluded;
item1.Category = matchingItem.Category;
}
}

Efficiency of C# Find on 1000+ records

I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}

If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html

If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.

I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.

Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)

Is there any way to loop through my sql results and store certain name/value pairs elsewhere in C#?

I have a large result set coming from a pretty complex SQL query. Among the values are a string which represents a location (that will later help me determine the page location that the value came from), an int which is a priority number calculated for each row based on other values from the row, and another string which contains a value I must remember for display later.
The problem is that the sql query is so complex (it has UNIONS, JOINS, and complex calculations with aliases) that I can't logically fit anything else into it without messing with the way it works.
Suffice it to say, though, after the query is done and the calculations performed, I need something that perhaps aggregate functions might solve, but that IS NOT an option, as all the columns do not come from other aggregate functions.
I have been wracking my brain for days now as to how I can iterate through the results, store a pair of values in a list (or two separate lists tied together somehow) where one value is the sum of all the priority values for each location and the other value is a distinct location value (i.e., as the results are looped through, it will not create another list item with the same location value that has been used before, HOWEVER, it does still need the sum of all of the other priority values from locations that ARE identical). Also, the results need to be ordered by priority in Descending order (hence the problem with using two lists).
EXAMPLE:
EDIT: I forgot, the preserved value should be the value from the row with the highest priority from the sql query.
If I had the following results:
location priority value
--------------------------------------------------------------------------------
page1 1 some text!
page2 3 more text!
page2 4 even more text!
page3 3 text again
page3 1 text
page3 1 still more text!
page4 6 text
If I was able to do what I wanted I would be able to achieve something like this after iteration (and in this order):
location priority value
--------------------------------------------------------------------------------
page2 7 even more text!
page4 6 text
page3 5 text again
page1 1 some text!
I have done research after research after research but absolutely nothing really even gets close to solving this dilemma.
Is what I'm asking too tough for even the powerful C# language?
THINGS I HAVE CONSIDERED:
Looping through the sql results and checking each location for repeats, adding together all priority values as I go, and storing these two plus value in two or three separate lists.
Why I still need help
I can't use a foreach because the logic didn't pan out, and I can't use a for loop because I can't access an IEnumerable (or whatever type it is that stores what's returned from Database.Open.Query() by index. (this makes sense, of course). Also, I need to sort on priority, but can't get one list out of sync with the others.
Using LINQ to select and store what I need
Why I still need help
I don't know LINQ (at all!) mainly because I don't understand lambda expressions (no matter HOW MUCH I read up about it).
Using an instantiated class to store the name/value pairs
Why I still need help
Not only do I expect sorting on this sort of thing to be impossible, and while I do now how to use .cs files in my C#.net webpages with WebMatrix environment, I have mainly only ever used static classes and would also need a little refresher course on constructors and how to set this up appropriately.
Somehow fitting this functionality into the already sizeable and complex SQL query
Why I still need help
While this is probably where I would ideally like this functionality to be, I stress again that this IS NOT AN OPTION. I have tried using aggregate functions, but only get an error saying how not all the other columns come from aggregate functions.
Making another query based on values from the first query's result set
Why I still need help
I can't select distinct results based on only one column (i.e., location) alone.
Assuming I could get the loop logic correct, storing the values in a 3 dimensional array
Why I still need help
I can't declare the array, because I do not know all of its dimensions before I need to use it.

Your post has amazed me in a number of ways like saying to 'mostly using static classes' and 'expecting instantiate a class/object to be impossible'.. really strange things you say. I can only respond in a quote from Charles Babbage:
I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
Anyways.. As you say you find lambdas hard, let's trace the problem in the classic 'manual' way.
Let's assume you have a list of ROWS that contains LOCATIONS and PRIORITIES.
List<DataRow> rows = .... ; // datatable, sqldatareader, whatever
You say you need:
list of unique locations
a "list" of locations paired up with summed up priorites
Let's start with the first objective.
To gather a list of unique 'values', a HashSet is just perfect:
HashSet<string> locations = new HashSet<string>();
foreach(var row in rows)
locations.Add( (string)rows["LOCATION"] );
well, and that's all. After that, the locations hashset will only remember all the unique locations. The "Add" does not result in duplicate elements. The HashSet checks and "uniquifies" all values that are put inside it. Small tricky thing is the hashset does not have the [index] operator. You'll have to enumerate the hashset to get the values:
foreach(string loc in locations)
{
Console.WriteLine(loc);
}
or convert/rewrite it to a list:
List<string> locList = new List<string>(locations);
Console.WriteLine(locList[2]); // of course, assuming there were at least three..
Let's get to the second objective.
To gather a list of values related to some thing behaving like a "logical key", a Dictionary<Key,Val> may be useful. It allows you to store/associate a "value" with some "key", ie:
Dictionary<string, double> dict = new Dictionary<string, double>();
dict["mamma"] = 123.45;
double d = dict["mamma"]; // d == 123.45
    dict["mamma"] += 101; // possible!
double e = dict["mamma"]; // d == 224.45
However, it has a behavior of happily throwing exceptions when you try to read from an unknown key:
Dictionary<string, double> dict = new Dictionary<string, double>();
dict["mamma"] = 123.45;
double d = dict["daddy"]; // throws KeyNotBlarghException
    dict["daddy"] += 101; // would throw too! tries to read the old/current value!
So, one have to be very careful with it with "keys" that it does not yet know. Fortunatelly, you can always ask the dictionary if it already knows a key:
Dictionary<string, double> dict = new Dictionary<string, double>();
dict["mamma"] = 123.45;
bool knowIt = dict.ContainsKey("daddy"); // == false
So you can easily check-and-initialize-when-unknown:
Dictionary<string, double> dict = new Dictionary<string, double>();
bool knowIt = dict.ContainsKey("daddy"); // == false
if( !knowIt )
dict["daddy"] = 5;
dict["daddy"] += 101; // now 106
So.. let's try summing up the priorities location-wise:
Dictionary<string, double> prioSums = new Dictionary<string, double>();
foreach(var row in rows)
{
string location = (string)rows["LOCATION"];
double priority = (double)rows["PRIORITY"];
if( ! prioSums.ContainsKey(location) )
// make sure that dictionary knows the location
prioSums[location] = 0.0;
prioSums[location] += priority;
}
And, really, that's all. Now the prioSums will know all locations and all sums of priorities:
var sss = prioSums["NewYork"]; // 9123, assuming NewYork was some location
However, that'd be quite useless to have to hardcode all locations. Hence, you also can ask the dictionary about what keys does it curently know
foreach(string key in prioSums.Keys)
Console.WriteLine(key);
and you can immediatelly use it:
foreach(string key in prioSums.Keys)
{
Console.WriteLine(key);
Console.WriteLine(prioSums[key]);
}
that should print all locations with all their sums.
You might already noticed an interesting thing: the dictionary can tell you what keys has it remembered. Hence, you do not need the HashSet from the first objective. Simply by summing up the priorities inside the Dictionary, you get the uniquized list of location by free: just ask the dict for its keys.
EDIT:
I noticed you've had a few more requests (like sort-descending or find-highest-prio-value), but I think I'll leave them for now. If you understand how I used a dictionary to collect the priorities, then you will easily build a similar Dictionary<string,string> to collect the highest-ranking value for a location. And the 'descending order' is done very easily if only you take the values out of dictionary and sort them as a i.e. List.. So I'll skip that for now.. This text got far tl;dr already I think :)

LINQ is really the tool to use for this kind of problems.
Suppose you have a variable pages which is an IEnumerable<Page>, where Page is a class with properties location, priority and value you could do
var query = from page in pages
group page by page.location into grp
select new { location = grp.Key,
priority = grp.Sum(page => page.priority),
value = grp.OrderByDescending(page => page.priority)
.First().value
}
You say you don't understand LINQ, so let me try to begin explain this statement.
The rows are group by location, which results in 4 groups of pages of which page.location is the key:
location priority value
--------------------------------------
page1 1 some text!
page2 3 more text!
4 even more text!
page3 1 text
1 still more text!
3 text again
page4 6 text
The select loops through these 4 groups and for each group it creates an anonymous type with 3 properties:
location: the key of the group
priority: the sum of priorities in one group
value: the first value in one group when its pages are sorted by priority in descending order.
The lamba expressions are a way to express which property should be used for a LINQ function like Sum. In short they say "transform page to page.priority": page => page.priority.
You want these new rows in descending order of priority, so finally you can do
result = query.OrderByDescending(x => x.priority).ToList();
The x is just an arbitrary placeholder representing one item in the collection in hand, query (likewise in the query above page could have been any word or character).

Select an element by index from a .NET HashSet

At the moment I am using a custom class derived from HashSet. There's a point in the code when I select items under certain condition:
var c = clusters.Where(x => x.Label != null && x.Label.Equals(someLabel));
It works fine and I get those elements. But is there a way that I could receive an index of that element within the collection to use with ElementAt method, instead of whole objects?
It would look more or less like this:
var c = select element index in collection under certain condition;
int index = c.ElementAt(0); //get first index
clusters.ElementAt(index).RunObjectMthod();
Is manually iterating over the whole collection a better way? I need to add that it's in a bigger loop, so this Where clause is performed multiple times for different someLabel strings.
Edit
What I need this for? clusters is a set of clusters of some documents collection. Documents are grouped into clusters by topics similarity. So one of the last step of the algorithm is to discover label for each cluster. But algorithm is not perfect and sometimes it makes two or more clusters with the same label. What I want to do is simply merge those cluster into big one.

Sets don't generally have indexes. If position is important to you, you should be using a List<T> instead of (or possibly as well as) a set.
Now SortedSet<T> in .NET 4 is slightly different, in that it maintains a sorted value order. However, it still doesn't implement IList<T>, so access by index with ElementAt is going to be slow.
If you could give more details about why you want this functionality, it would help. Your use case isn't really clear at the moment.

In the case where you hold elements in HashSet and sometimes you need to get elements by index, consider using extension method ToList() in such situations. So you use features of HashSet and then you take advantage of indexes.
HashSet<T> hashset = new HashSet<T>();
//the special situation where we need index way of getting elements
List<T> list = hashset.ToList();
//doing our special job, for example mapping the elements to EF entities collection (that was my case)
//we can still operate on hashset for example when we still want to keep uniqueness through the elements

There's no such thing as an index with a hash set. One of the ways that hash sets gain efficincy in some cases is by not having to maintain them.
I also don't see what the advantage is here. If you were to obtain the index, and then use it this would be less efficient than just obtaining the element (obtaining the index would be equally efficient, and then you've an extra operation).
If you want to do several operations on the same object, just hold onto that object.
If you want to do something on several objects, do so on the basis of iterating through them (normal foreach or doing foreach on the results of a Where() etc.). If you want to do something on several objects, and then do something else on those several same objects, and you have to do it in such batches, rather than doing all the operations in the same foreach then store the results of the Where() in a List<T>.

why don't use a dictionary?
Dictionary<string, int> dic = new Dictionary<string, int>();
for (int i = 0; i < 10; i++)
{
dic.Add("value " + i, dic.Count + 1);
}
string find = "value 3";
int position = dic[find];
Console.WriteLine("the position of " + find + " is " + position);
example

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.