LINQ Intersection of two different types - c#

I have two different list types. I need to remove the elements from list1 that is not there in list2 and list2 element satisfies certain criteria.
Here is what I tried, seems to work but each element is listed twice.
var filteredTracks =
from mtrack in mTracks
join ftrack in tracksFileStatus on mtrack.id equals ftrack.Id
where mtrack.id == ftrack.Id && ftrack.status == "ONDISK" && ftrack.content_type_id == 234
select mtrack;
Ideally I don't want to create a new copy of the filteredTracks, is it possible modify mTracks in place?

If you're getting duplicates, it's because your id fields are not unique in one or both of the two sequences. Also, you don't need to say where mtrack.id == ftrack.Id since that condition already has to be met for the join to succeed.
I would probably use loops here, but if you are dead set on LINQ, you may need to group tracksFileStatus by its Id field. It's hard to tell by what you posted.
As far as "modifying mTracks in place", this is probably not possible or worthwhile (I'm assuming that mTracks is some type derived from IEnumerable<T>). If you're worried about the efficiency of this approach, then you may want to consider using another kind of data structure, like a dictionary with Id values as the keys.

Since the Q was about lists primarily...
this is probably better linq wise...
var test = (from m in mTracks
from f in fTracks
where m.Id == f.Id && ...
select m);
However you should optimize, e.g.
Are your lists sorted? If they are, see e.g. Best algorithm for synchronizing two IList in C# 2.0
If it's coming from Db (it's not clear here), then you need to build your linq query based on the SQL / relations and indexes you have in the Db and go a bit different route.
If I were you, I'd make a query (for each of the lists, presuming it's not Db bound) so that tracks are sorted in the first place (and sort on whatever is used to compare them, usually),
then enumerate in parallel (using enumerators), comparing other things in the process (like in that link).
that's likely the most efficient way.
if/when it comes from database, optimize at the 'source' - i.e. fetch data already sorted and filtered as much as you can. And basically, build an SQL first, or inspect the returned SQL from the linq query (let me know if you need the link).

Related

Most efficient collection for storing data from LINQ to Entities?

I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.

Efficient design to search for objects by multiple parameters with range

I have a set of objects of the same type in memory and each has multiple immutable int properties (but not only them).
I need to find an object there (or multiple) whose properties are in small range near specified values. E.g. a == 5+-1 && b == 21+-2 && c == 9 && any d.
What's the best way to store objects so I can efficiently retrieve them like this?
I thought about making SortedList for each property and using BinarySearch but I have a lot of properties so I would like to have a more generic way instead of so many SortedLists.
It's important that the set itself is not immutable: I need an ability to add/remove items.
Is there something like memory db for objects (not just data)?
Just to expand on #j_random_hacker's answer a bit: The usual approach to 'estimates of the selectivity' is to build a histogram for the index. But, you might already intuitively know which criteria is going to yield the smallest initial result set out of "a == 5+-1 && b == 21+-2 && c == 9". Most likely it is "c == 9" unless there's an exceptionally high number of duplicate values and small universe of potential values for 'c'.
So, a simple analysis of the predicates would be an easy starting point. Equality conditions are highly likely to be the most selective (exhibit the highest selectivity).
From that point, RDBMS' will conduct a sequential scan of the records in the result set to filter for the remaining predicates. That's probably your best approach, too.
Or, there's any number of in-memory, small footprint SQL-capable DBMS that will do the heavy lifting for you (eXtremeDB, SQLite, RDM,... google is your friend) and/or that have lower-level interfaces that won't do all the work for you (still, most) but also won't impose SQL on you.
First, having lots of SortedLists is not bad design. It's essentially the way that all modern RDBMSes solve the same problem.
Further to this: If there was a simple, general, close-to-optimally-efficient way to answer such queries, RDBMSes would not bother with the comparatively complicated and slow hack of query plan optimisation: that is, generating large numbers of candidate query plans and then heuristically estimating which one will take the least time to execute.
Admittedly, queries with many joins between tables are what tends to make the space of possible plans huge in practice with RDBMSes, and you don't seem to have those here. But even with just a single table (set of objects), if there are k fields that can be used for selecting rows (objects), then you could theoretically have k! different indices (SortedLists of (key, value) pairs in which the key is some ordered sequence of the k field values, and the value is e.g. a memory pointer to the object) to choose from. If the outcome of the query is a single object (or alternatively, if the query contains a non-range clause for all k fields) then the index used won't matter -- but in every other case, each index will in general perform differently, so a query planner would need to have accurate estimates of the selectivity of each clause in order to choose the best index to use.

EF Many to Many select intersection

I'm trying to implement a tagging system with C# entity framework. I cannot get the query required for the case that two or more tags are expected to all be present to return a result. I have a many to many relationship (just FKs, DB first) and I am attempting to get an object when all selected tags exist. Object - LookupTable - Attributes.
I parse the selected tags into a list and then try to get only those objects for which all tags in this list are present. It appears to result in what I'd expect from an "Any" operator, not the "All".
List<string> intersectTags = new List<string>();
foreach (object i in ef.objects.Where(o => o.Attributes.All(attribute =>
intersectTags.Contains(attribute.AttributeNK))))
Update: Also needed to get instances where ef.Object had more tags than intersectTags. Filtering for instances where intersectTags is a subset of Object.Attributes.
Your code fails in case your Attributes is a subset of selected tags.
If you are looking to match when intersectTags is a subset of o.Attributes, try reversing the check.
Unfortunately, Linq to Entity does not support this kind of syntax, we need ToList() to load the objects and perform Linq To Objects.
It should work but there is a performance implications (I'll post an update if I have a better solution):
List<string> intersectTags = new List<string>();
foreach (object i in ef.objects.ToList().Where(intersectTags.All(tags =>
o.Attributes.Any(attribute => attribute.AttributeNK == tags))))
I don't know if I understood well, if so I can give a solution in plain SQL. You have to lookup for all the records that contain one of the requested tag and then group them by the productId with the clause HAVING COUNT equals the number of tags you are passing.
SELECT ProductId FROM ProductTag
WHERE TagId IN (2,3,4)
GROUP BY ProductId
HAVING COUNT(*) = 3
Here's a demo:
http://sqlfiddle.com/#!3/dd4023/3
I'm sorry, currently I cannot give you an implementation in EF (don't have Visual Studio with me), I did something similar for LINQ TO SQL and it uses the PredicateBuilder class, you can find it here:
http://www.codeproject.com/Articles/36178/How-to-manage-product-options-with-different-price
Paolo

How do I apply the LINQ to SQL Distinct() operator to a List<T>?

I have a serious(it's getting me crazy) problem with LINQ to SQL. I am developing an ASP.NET MVC3 application using c# and Razor in Visual Studio 2010.
I have two database tables, Product and Categories:
Product(Prod_Id[primary key], other attributes)
Categories((Dept_Id, Prod_Id) [primary keys], other attributes)
Obviously Prod_Id in Categories is a foreign key. Both classes are mapped using the Entity Framework (EF). I do not mention the context of the application for simplicity.
In Categories there are multiple rows containing the Prod_Id. I want to make a projection of all Distinct Prod_Id in Categories. I did it using plain (T)SQL in SQL Server MGMT Studio according to this (really simple) query:
SELECT DISTINCT Prod_Id
FROM Categories
and the result is correct. Now I need to make this query in my application so I used:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
I go to check the result of my query by using:
query.Select(m => m.Prod_Id);
or
foreach(var item in query)
{
item.Prod_Id;
//other instructions
}
and it does not work. First of all the Intellisense when I attempt to write query.Select(m => m. or item.shows just suggestions about methods (such as Equals, etc...) and not properties. I thought that maybe there was something wrong with Intellisense (I guess most of you many times hoped that Intellisense was wrong :-D) but when I launch the application I receive an error at runtime.
Before giving your answer keep in mind that;
I checked many forums, I tried the normal LINQ to SQL (without using lambdas) but it does not work. The fact that it works in (T)SQL means that there is something wrong with the LINQ to SQL instruction (other queries in my application work perfectly).
For application related reasons, I used a List<T> variable instead of _StoreDB.Categories and I thought that was the problem. If you can offer me a solution without using a List<T> is appreciated as well.
This line:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
Your LINQ query most likely returns IEnumerable... of ints (judging by Select(m => m.Prod_Id)). You have list of integers, not list of entity objects. Try to print them and see what you got.
Calling _StoreDB.Categories.Select(m => m.Prod_Id) means that query will contain Prod_Id values only, not the entire entity. It would be roughly equivalent to this SQL, which selects only one column (instead of the entire row):
SELECT Prod_Id FROM Categories;
So when you iterate through query using foreach (var item in query), the type of item is probably int (or whatever your Prod_Id column is), not your entity. That's why Intellisense doesn't show the entity properties that you expect when you type "item."...
If you want all of the columns in Categories to be included in query, you don't even need to use .Select(m => m). You can just do this:
var query = _StoreDB.Categories.Distinct();
Note that if you don't explicitly pass an IEqualityComparer<T> to Distinct(), EqualityComparer<T>.Default will be used (which may or may not behave the way you want it to, depending on the type of T, whether or not it implements System.IEquatable<T>, etc.).
For more info on getting Distinct to work in situations similar to yours, take a look at this question or this question and the related discussions.
As has been explained by the other answers, the error that the OP ran into was because the result of his code was a collection of ints, not a collection of Categories.
What hasn't been answered was his question about how to use the collection of ints in a join or something in order to get at some useful data. I will attempt to do that here.
Now, I'm not really sure why the OP wanted to get a distinct list of Prod_Ids from Categories, rather than just getting the Prod_Ids from Projects. Perhaps he wanted to find out what Products are related to one or more Categories, thus any uncategorized Products would be excluded from the results. I'll assume this is the case and that the desired result is a collection of distinct Products that have associated Categories. I'll first answer the question about what to do with the Prod_Ids first, and then offer some alternatives.
We can take the collection of Prod_Ids exactly as they were created in the question as a query:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
Then we would use join, like so:
var products = query.Join(_StoreDB.Products, id => id, p => p.Prod_Id,
(id,p) => p);
This takes the query, joins it with the Products table, specifies the keys to use, and finally says to return the Product entity from each matching set. Because we know that the Prod_Ids in query are unique (because of Distinct()) and the Prod_Ids in Products are unique (by definition because it is the primary key), we know that the results will be unique without having to call Distinct().
Now, the above will get the desired results, but it's definitely not the cleanest or simplest way to do it. If the Category entities are defined with a relational property that returns the related record from Products (which would likely be called Product), the simplest way to do what we're trying to do would be the following:
var products = _StoreDB.Categories.Select(c => c.Product).Distinct();
This gets the Product from each Category and returns a distinct collection of them.
If the Category entity doesn't have the Product relational property, then we can go back to using the Join function to get our Products.
var products = _StoreDB.Categories.Join(_StoreDB.Products, c => c.Prod_Id,
p => p.Prod_Id, (c,p) => p).Distinct();
Finally, if we aren't just wanting a simple collection of Products, then some more though would have to go into this and perhaps the simplest thing would be to handle that when iterating through the Products. Another example would be for getting a count for the number of Categories each Product belongs to. If that's the case, I would reverse the logic and start with Products, like so:
var productsWithCount = _StoreDB.Products.Select(p => new { Product = p,
NumberOfCategories = _StoreDB.Categories.Count(c => c.Prod_Id == p.Prod_Id)});
This would result in a collection of anonymous typed objects that reference the Product and the NumberOfCategories related to that Product. If we still needed to exclude any uncatorized Products, we could append .Where(r => r.NumberOfCategories > 0) before the semicolon. Of course, if the Product entity is defined with a relational property for the related Categories, you wouldn't need this because you could just take any Product and do the following:
int NumberOfCategories = product.Categories.Count();
Anyway, sorry for rambling on. I hope this proves helpful to anyone else that runs into a similar issue. ;)

Are groups in Linq to Sql already sorted by Count() descending?

It appears so, but I can't find any definitive documentation on the subject.
What I'm asking is if the result of this query:
from x
in Db.Items
join y in Db.Sales on x.Id equals y.ItemId
group x by x.Id into g
orderby g.Count() descending
select g.First()
is ALWAYS THE SAME as the following query:
from x
in Db.Items
join y in Db.Sales on x.Id equals y.ItemId
group x by x.Id into g
select g.First()
note that the second query lets Linq decide the ordering of the group, which the first query sets as number sold, from most to least.
My ad-hoc tests seem to indicate that Linq automatically sorts groups this way, while the documentation seems to indicate that the opposite is true--items are returned in the order they appear in the select. I figure if it comes sorted this way, adding the extra sort is pointless and wastes cycles, and would be better left out.
You're likely seeing this because the query result returned from the sqlserver is always in the same order in your tests. However, this is a fallacy: by definition, sets in SQL have no order unless it's explicitly specified with an ORDER BY. So if your queries don't have an order by statement, your sets might look like they're ordered, but that's not the case, it might be that in edge cases the order is different (e.g. when the server has to load pages of the table in different order due to memory constraints or otherwise). So rule of thumb: if you want an order, you have to specify one.
If the specs don't guarantee an ordering you should consider it accidental, and subject to change with any new version of the software.
And don't take it out unless
a) you've measured it and it makes a significant difference
b) you are willing to monitor, test, and change it back (everywhere) after every tiny change in your software environment.
LINQ grouping does not guarantee such a thing. While it might work for that specific circumstance, it might not work in another situation. Avoid relying on this side effect.
By the way, if the output is really intentionally sorted by SQL Server due to clustered index or something, adding an ORDER BY clause won't hurt because query optimizer should be smart enough to know that the result is already sorted, so you won't lose anything.

Categories