Linq Efficiently find if collection contains duplicates using IQueryable.GroupBy - c#

The keyword in this is Queryable.GroupBy instead of Enumerable.GroupBy
I use EntityFramework and I want to check if there are no duplicate values. Several answers on StackOverflow like this one suggest using GroupBy
IQueryable<MyType> myItems = ...
IQueryable<IGrouping<string, MyType> groupsWithSameName = myItems
.GroupBy(myItem => myItem.Name);
// note: IQueryable!
bool containsDuplicates = groupsWithSameName.Any(group => group.Skip(1).Any());
Although this is allowed on IEnumerables, Skip is not supported on an unordered sequence. The NotSupportedException suggests using OrberBy before using the Skip.
As an alternative I could check if there are groups with more than one element using Count
bool containsDuplicates = groupsWithSameName.Any(group => group.Count() > 1);
Both methods require to scan all elements in the collection. This is for the 2nd time because they were also scanned to group them.
Is there a method to check for duplicates on an IQueryable more efficiently?

I think that scanning of all the elements will not be avoided. In any case, the process of finding a duplicate with SQL will look like this:
SELECT
name, COUNT(*)
FROM
MyType
GROUP BY
name
HAVING
COUNT(*) > 1
It may be worth trying to seek a solution in such a way?:
Linq with group by having count

Related

ordering of OrderBy, Where, Select in the Linq query

Considering this sample code
System.Collections.ArrayList fruits = new System.Collections.ArrayList();
fruits.Add("mango");
fruits.Add("apple");
fruits.Add("lemon");
IEnumerable<string> query = fruits.Cast<string>()
.OrderBy(fruit => fruit)
.Where(fruit => fruit.StartsWith("m"))
.Select(fruit => fruit);
I have two questions:
Do I need to write the last Select clause if Where returns the same type by itself? The example is from msdn, why do they always write it?
What is the correct order of these methods? Does the order affect something? What if I swap Select and Where, or OrderBy?
No, the Select is not necesssary if you are not actually transforming the returned type.
In this case, the ordering of the method calls could have an impact on performance. Sorting all the objects before filtering is sure to take longer than filtering and then sorting a smaller data set.
The .Select is unnecessary in this case because .Cast already guarantees that you're working with IEnumerable<string>.
The ordering of .OrderBy and .Where doesn't affect the results of the query, but in general if you use .Where first you'll get better performance because there will be fewer elements to sort.

Selecting items in an ordered list after a certain entry

I have an ordered list of objects. I can easily find an item in the list by using the following code:
purchaseOrders.FirstOrDefault(x => x.OurRef.Equals(lastPurchaseOrder, StringComparison.OrdinalIgnoreCase))
What I want to do is select all the items in the list that appear after this entry. How best to achieve this? Would it to be to get the index of this item and select a range?
It sounds like you want SkipWhile:
var orders = purchaseOrders.SkipWhile(x => !x.OurRef.Equals(...));
Once the iterator has stopped skipping, it doesn't evaluate the predicate for later entries.
Note that that code will include the entry that doesn't match the predicate, i.e. the one with the given reference. It will basically give you all entries from that order onwards. You can always use .Skip(1) if you want to skip that:
// Skip the exact match
var orders = purchaseOrders.SkipWhile(x => !x.OurRef.Equals(...)).Skip(1);
This will be linear, mind you... if the list is ordered by x.OurRef you could find the index with a binary search and take the range from there onwards... but I wouldn't do that unless you find that the simpler code causes you problems.
Probably you should take a look at LINQ's combination of Reverse and TakeWhile methods, if I understand your question correctly.
It may look like purchaseOrder.Reverse().TakeWhile(x => !x.OurRef.Equals(lastPurchaseOrder, StringComparison.OrdinalIgnoreCase)).
Sorry if code is unformatted, I'm from mobile web right now.
May be you want something like this:
int itemIndex = list.IndexOf(list.FirstOrDefault(x => x.OurRef.Equals(lastPurchaseOrder, StringComparison.OrdinalIgnoreCase));
var newList = list.Where((f, i) => i >= itemIndex);

EF Many to Many select intersection

I'm trying to implement a tagging system with C# entity framework. I cannot get the query required for the case that two or more tags are expected to all be present to return a result. I have a many to many relationship (just FKs, DB first) and I am attempting to get an object when all selected tags exist. Object - LookupTable - Attributes.
I parse the selected tags into a list and then try to get only those objects for which all tags in this list are present. It appears to result in what I'd expect from an "Any" operator, not the "All".
List<string> intersectTags = new List<string>();
foreach (object i in ef.objects.Where(o => o.Attributes.All(attribute =>
intersectTags.Contains(attribute.AttributeNK))))
Update: Also needed to get instances where ef.Object had more tags than intersectTags. Filtering for instances where intersectTags is a subset of Object.Attributes.
Your code fails in case your Attributes is a subset of selected tags.
If you are looking to match when intersectTags is a subset of o.Attributes, try reversing the check.
Unfortunately, Linq to Entity does not support this kind of syntax, we need ToList() to load the objects and perform Linq To Objects.
It should work but there is a performance implications (I'll post an update if I have a better solution):
List<string> intersectTags = new List<string>();
foreach (object i in ef.objects.ToList().Where(intersectTags.All(tags =>
o.Attributes.Any(attribute => attribute.AttributeNK == tags))))
I don't know if I understood well, if so I can give a solution in plain SQL. You have to lookup for all the records that contain one of the requested tag and then group them by the productId with the clause HAVING COUNT equals the number of tags you are passing.
SELECT ProductId FROM ProductTag
WHERE TagId IN (2,3,4)
GROUP BY ProductId
HAVING COUNT(*) = 3
Here's a demo:
http://sqlfiddle.com/#!3/dd4023/3
I'm sorry, currently I cannot give you an implementation in EF (don't have Visual Studio with me), I did something similar for LINQ TO SQL and it uses the PredicateBuilder class, you can find it here:
http://www.codeproject.com/Articles/36178/How-to-manage-product-options-with-different-price
Paolo

best way to use LINQ to query against a list

i have a collection
IEnumerable<Project>
and i want to do a filter based on project's Id property to included any id that is in a list:
List<int> Ids
what is the best way to do a where clause to check if a property is contained in a list.
var filteredProjectCollection = projectCollection.Where(p => Ids.Contains(p.id));
You may be able to get a more efficient implementation using the Except method:
var specialProjects = Ids.Select(id => new Project(id));
var filtered = projects.Except(specialProjects, comparer);
The tricky thing is that Except works with two collections of the same type - so you want to have two collections of projects. You can get that by creating new "dummy" projects and using comparer that compares projects just based on the ID.
Alternatively, you could use Except just on collections of IDs, but then you may need to lookup projects by the ID, which makes this approach less appealing.
var nonExcludedProjects = from p in allprojects where Ids.Contains(p => p.Id) select p;
If you're going to use one of the .Where(p=> list.Contains(p)) answers, you should consier first making a HashSet out of the list so that it doesn't have to do an O(n) search each time. This cuts running time from O(mn) to O(m+n).
I'm not sure that I understand your question but I'll have a shot.
If you have: IEnumerable enumerable,
and you want to filter it such that it only contians items that are also present in the list: List list,
then: IEnumerable final = enumerable.Where(e => list.Contains(e));

How to use LINQ to SQL to create ranked search results?

I am looking for a way to use l2s to return ranked result based on keywords.
I would like to take a keyword and be able to search the table for that keyword using .contains(). The trick that I haven't been able to figure out is how to get a count of how many times that keyqord appears, and then .OrderByDescending() based on that count.
So if i had some thing like:
string keyword = "SomeKeyword";
IQueryable<Article> searchResults = from a in GenesisRepository.Article
where a.Body.Contains(keyword)
select a;
What is the best way to order searchResults based on the number of times keyword appears in a.Body?
Thanks for any help.
try inserting order by a.Body.Split(' ').Count(w=>w == keyword). That should allow you to see that the concept works. However, I STRONGLY recommend that the final version include this as part of the select projection, possibly using a key-value pair, and order by the property name:
string keyword = "SomeKeyword";
//EDIT: restructured query to force the ordering to be done on the projection,
//not the source.
IQueryable<Article> searchResults = (from a in GenesisRepository.Article
where a.Body.Contains(keyword)
select new KeyValuePair<int, Article>(
a.Body.Split(' ').Count(w=>w == keyword), a))
.OrderBy(kvp=>kvp.Key);
The reason is performance; the Split().Count() method chain is linear-complexity, and will be evaluated for every comparison of two values, making the overall sort N^2logN complexity (slow).
EDIT: Also, understand that a.Body.Contains(keyword) will not search by whole words, and so will return articles that contain "SomeKeywordLongerThanSearch" and "ThisIsSomeKeyword" as well as "SomeKeyword". You can avoid this with a Regex match on the pattern "\bSomeKeyword\b", which will only match instances of SomeKeyword with a word boundary immediately before and after.
This is a little hack I came up with, pretty simple but definitely not a "best practices" one.
IQueryable<Article> searchResults = from a in GenesisRepository.Article
where a.Body.Contains(keyword)
orderby a.Body.Split(new string[] { keyword }, StringSplitOptions.RemoveEmptyEntries).Count() descending
select a;
Maybe this will work...
IQueryable<Article> searchResults = from a in GenesisRepository.Article
where a.Body.Contains(keyword)
select a;
searchResults.OrderByDescending(s => Regex.Matches(a.Body, keyword).Count);

Categories