C# linq union question - c#

Could someone explain how does Union in LINQ work?
It is told that it merges two sequences and removes duplicates.
But can I somehow customize the duplicate removal behavior - let's say if I wish to use the element from the second sequence in case of duplicate or from the first sequence.
Or even if I wish to somehow combine those values in the resulting sequence?
How should that be implemented?
Update
I guess I described the problem incorrectly, let's say we have some value:
class Value {
String name
Int whatever;
}
and the comparer used performs a x.name == y.name check.
And let's say that sometimes I know I should take the element from the second sequence, because it's whatever field is newer / better than the whatever field of the first sequence.
Anyway, I would use the sequence1.Union(sequence2) or sequence2.Union(sequence1) variation of the methods.
Thank you

You can use second.Union(first) instead of first.Union(second). That way, it will keep the items from second rather than the items from first.

When the object returned by this method is enumerated, Union enumerates first and second in that order and yields each element that has not already been yielded.
http://msdn.microsoft.com/en-us/library/bb341731.aspx
So the elements from whichever sequence you use as the left parameter take precedence over the elements from right parameter.
The important thing about this is that it's well defined and documented behavior and not just an implementation detail that might change in the next version of .net.
As a side-note when you implement an IEqualityComparer<T> it's important to use consistent Equals and GetHashCode. And in this case I prefer to explicitly supply an equality comparer to the union method instead of having the Equals of the object itself return true for objects which are not identical for all purposes.

If the elements are duplicates then it doesn't matter which list they are taken from - unless your equality comparer doesn't take all properties of the element into account of course.
If they aren't really duplicates then they'll both appear in the resultant union.
UPDATE
From your new information at the minimum you should write a new equality operator that takes whatever into account. You can't just use sequence1.Union(sequence2) or sequence2.Union(sequence1) unless all the elements want taking from one sequence or the other.
At the extreme you'll have to write your own Union extension method which does this for you.

Related

Forcing/Requiring Sorted List Argument for Method

Suppose a method has been written that expects a sorted list as one of its input.
Of course this will be commented and documented in the code, param will be named "sortedList" but if someone forgets, then there will be a bug.
Is there a way to FORCE the input must be sorted?
I was thinking about creating a new object class with a list and a boolean "sorted", and the passed object has to be that object, and then the method checks immediately if the "sorted" boolean is true. But I feel like there must be a better/standard way.
*This method is called in a loop, so don't want to sort inside the method.
Assuming that you only need to iterate this collection, and not perform any other operations, you can accept an IOrderedEnumerable, which would require that the sequence have been ordered by something. (Keep in mind that doing this may mean that it was sorted based on some other criteria than what you expected, so it's still possible that by the criteria you're using internally, the data is no sorted.)
The other option that you have is to simply sort the data after you receive it, instead of requiring the caller to sort the data. Note that for most common sorting algorithms sorting an already sorted data set is its best case speed (Typically O(n) instead of O(n*log(n))), so even if the data set is sometimes already sorted and sometimes not it's not necessarily terrible, so long as you don't have a huge data set.
First, let's answer the question asked here.
Is there a way to FORCE the input must be sorted?
Well, yes and no. You can specify that you need one of the data structures in .NET that has a sort order. On the other hand, no, you can't specify that it uses a sort order you care about. As such, it could be sorted by a random number, which would be the same as "unsorted" (probably) in your context.
Let me expand on that. There is no way for you to declare a type or method with a requirement that the compiler can verify that the data passed to the method is sorted according to some rules you decide upon. There simply isn't a syntax that will allow you to declare such a requirement. You either got to trust the calling code to have sorted the data correctly, or not.
So what have you got left?
My advice would be to create a method where the calling code would tell you that the data has been sorted according to some predefined requirement for calling that method. If the caller said "no, I haven't or cannot guarantee that the data is in that sort order", then you will have to sort it yourself.
Other than that you could create your own data structure that would imply the correct type of sorting.
It is possible to express and enforce such constraints in more powerful type systems but not in the type system of C# or .NET. You could flag the collection in some way, as you suggested, but this will not really make sure that the collections is actually sorted. You could use a boolean flag as you suggested or a special class or interface.
Personally I would not try to enforce it this way but would either check at runtime that the collection is sorted costing O(n) time. If you are iterating over the collection anyway, it would be easy to just check in every iteration that the current value is larger than the last one and throw an exception if this condition is violated.
Another option would be to use a sorting algorithm that runs in O(n) on a sorted list and just sort the collection every time. This will add not to much overhead in the case the list is really already sorted but it will still work if it is not. Insertion sort has the required property to run in O(n) on a sorted list. Bubble sort has the property, too, but is really slow in other cases.

Does `Any()` forces linq execution?

I have a linq to entity query.
will Any() force linq execution (like ToList() does)?
There is very good MSDN article Classification of Standard Query Operators by Manner of Execution which describes all standard operators of LINQ. As you can see from table Any is executed immediately (as all operators which return single value). You can always refer this table if you have doubts about manner of operator execution.
Yes, and no. The any method will read items from the source right away, but it's not guaranteed to read all items.
The Any method will enumerate items from the source, but only as many as needed to determine the result.
Without any parameter, the Any method will only try to read the first item from the source.
With a parameter, the Any method will only read items from the source until it finds one that satisfies the condition. All items are only read from the source if no items satisfies the condition until the last item.
This is easy to discover: Any() returns a simple bool. Since a bool is always a bool, and not an IQueryable or IEnumerable (or any other type) that can have a custom implementation, we must conclude that Any() itself must calculate the boolean value to return.
The exception is of course if the Any() is used inside a subquery on a IQueryable, in which case the Linq provider will typically just analyse the presence of the call to Any() and convert it to corresponding SQL (for example).
Short question, short answer: Yes it will.
To find out if the any element of the list matches the given condition (or if there is any element at all) the list will have to be enumerated. As MSDN states:
This method does not return any one element of a collection. Instead, it determines whether the collection contains any elements.
The enumeration of source is stopped as soon as the result can be determined.
Deferred execution does not apply here, because this method delivers the result of an enumeration, not another IEnumerable.

LINQ Ring: Any() vs Contains() for Huge Collections

Given a huge collection of objects, is there a performance difference between the the following?
Collection.Contains:
myCollection.Contains(myElement)
Enumerable.Any:
myCollection.Any(currentElement => currentElement == myElement)
Contains() is an instance method, and its performance depends largely on the collection itself. For instance, Contains() on a List is O(n), while Contains() on a HashSet is O(1).
Any() is an extension method, and will simply go through the collection, applying the delegate on every object. It therefore has a complexity of O(n).
Any() is more flexible however since you can pass a delegate. Contains() can only accept an object.
It depends on the collection. If you have an ordered collection, then Contains might do a smart search (binary, hash, b-tree, etc.), while with `Any() you are basically stuck with enumerating until you find it (assuming LINQ-to-Objects).
Also note that in your example, Any() is using the == operator which will check for referential equality, while Contains will use IEquatable<T> or the Equals() method, which might be overridden.
I suppose that would depend on the type of myCollection is which dictates how Contains() is implemented. If a sorted binary tree for example, it could search smarter. Also it may take the element's hash into account. Any() on the other hand will enumerate through the collection until the first element that satisfies the condition is found. There are no optimizations for if the object had a smarter search method.
Contains() is also an extension method which can work fast if you use it in the correct way.
For ex:
var result = context.Projects.Where(x => lstBizIds.Contains(x.businessId)).Select(x => x.projectId).ToList();
This will give the query
SELECT Id
FROM Projects
INNER JOIN (VALUES (1), (2), (3), (4), (5)) AS Data(Item) ON Projects.UserId = Data.Item
while Any() on the other hand always iterate through the O(n).
Hope this will work....

Calling Distinct<>() on HashSet<T>

I'm just curious.. When I call Distinct<>() (from Linq) on HashSet, does .NET know, that this IEnumerable always contains distinct set of values, and optimizes this call away?
Judging by looking at the code through Reflector, I would have to say no.
The code ends up construct an instance of an iterator method generated class regardless of what type you give it.
This problem is also compounded by the fact that you can specify comparer objects for both the Hashset and the Distinct method, which means the optimization would only be used in very few cases.
For instance, in the following case it could actually optimize the call away, but it wouldn't be able to know that:
var set = new HashSet<int>(new MyOwnInt32Comparer());
var distinct = set.Distinct(new MyOwnInt32Comparer());
Since I give it two instances of the comparer class, and such classes usually doesn't implement equality methods, the Distinct method would have no way of knowing that the two comparer implementations are actually identical.
In any case, this is a case where the programmer knows more about the code than the runtime, so take advantage of it. Linq may be very good but it's not omnipotent, so use your knowledge to your advantage.
I think No, because the input of Enumerable class for distinct method is IEnumerable and there is nothing specific for determining it's a hash set (so do not do anything).
No, looking at the implementation in reflector, it doesn't check if the enumeration is a HashSet<T>. The underlying iterator creates a new set and fills it during enumeration, so the overhead shouldn't be that large though.

Fastest way to find out whether two ICollection<T> collections contain the same objects

What is the fastest way to find out whether two ICollection<T> collections contain precisely the same entries? Brute force is clear, I was wondering if there is a more elegant method.
We are using C# 2.0, so no extension methods if possible, please!
Edit: the answer would be interesting both for ordered and unordered collections, and would hopefully be different for each.
use C5
http://www.itu.dk/research/c5/
ContainsAll
" Check if all items in a
supplied collection is in this bag
(counting multiplicities).
The
items to look for.
True if all items are
found."
[Tested]
public virtual bool ContainsAll<U>(SCG.IEnumerable<U> items) where U : T
{
HashBag<T> res = new HashBag<T>(itemequalityComparer);
foreach (T item in items)
if (res.ContainsCount(item) < ContainsCount(item))
res.Add(item);
else
return false;
return true;
}
First compare the .Count of the collections if they have the same count the do a brute force compare on all elements. Worst case scenarios is O(n). This is in the case the order of elements needs to be the same.
The second case where the order is not the same, you need to use a dictionary to store the count of elements found in the collections: Here's a possible algorithm
Compare collection Count : return false if they are different
Iterate the first collection
If item doesn't exist in dictionary then add and entry with Key = Item, Value = 1 (the count)
If item exists increment the count for the item int the dictionary;
Iterate the second collection
If item is not in the dictionary the then return false
If item is in the dictionary decrement count for the item
If count == 0 the remove item;
return Dictionary.Count == 0;
For ordered collections, you can use the SequenceEqual() extension method defined by System.Linq.Enumerable:
if (firstCollection.SequenceEqual(secondCollection))
You mean the same entries or the same entries in the same order?
Anyway, assuming you want to compare if they contain the same entries in the same order, "brute force" is really your only option in C# 2.0. I know what you mean by non elegant, but if the atomic comparision itself is O(1), the whole process should be in O(N), which is not that bad.
If the entries need to be in the same order (besides being the same), then I suggest - as an optimization - that you iterate both collections at the same time and compare the current entry in each collection. Otherwise, the brute force is the way to go.
Oh, and another suggestion - you could override Equals for the collection class and implement the equality stuff in there (depends on you project, though).
Again, using the C5 library, having two sets, you could use:
C5.ICollection<T> set1 = C5.ICollection<T> ();
C5.ICollection<T> set2 = C5.ICollecton<T> ();
if (set1.UnsequencedEquals (set2)) {
// Do something
}
The C5 library includes a heuristic that actually tests the unsequenced hash codes of the two sets first (see C5.ICollection<T>.GetUnsequencedHashCode()) so that if the hash codes of the two sets are unequal, it doesn't need to iterate over every item to test for equality.
Also something of note to you is that C5.ICollection<T> inherits from System.Collections.Generic.ICollection<T>, so you can use C5 implementations while still using the .NET interfaces (though you have access to less functionality through .NET's stingy interfaces).
Brute force takes O(n) - comparing all elements (assuming they are sorted), which I would think is the best you could do - unless there is some property of the data that makes it easier.
I guess for the case of not sorted, its O(n*n).
In which case, I would think a solution based around a merge sort would probably help.
For example, could you re-model it so that there was only one collection? Or 3 collections, one for those in collection A only, one for B only and for in both - so if the A only and B only are empty - then they are the same... I am probably going off on totally the wrong tangent here...

Categories