What is the difference between this two method calls?
HashSet<T>.IsSubsetOf()
HashSet<T>.IsProperSubsetOf()
See here
If the current set is a proper subset of other, other must have at least one element that the current set does not have.
vs here
If other contains the same elements as the current set, the current set is still considered a subset of other.
The difference is set.IsSubsetOf(set) == true, whereas set.IsProperSubsetOf(set) == false
This comes from the set theory:
S = {e,s,t}, T = {e,s,t}
T is a subset of S because every element in T is also in S. However it is not a proper subset, because a proper subset, like a normal subset too, contains elements of the superset, but it also has less elements than the initial collection. Example:
S = {e,s,t}, T = {e,t}
T is a proper subset of S.
IsProperSubsetOf cannot contain the whole HashSet. Only a part of it.
IsSubsetOf can contain any subset, including the full HashSet.
From the "Examples" section found here:
"The following example creates two disparate HashSet objects and compares them to each other. In this example, lowNumbers is both a subset and a proper subset of allNumbers until allNumbers is modified, using the IntersectWith method, to contain only values that are present in both sets. Once allNumbers and lowNumbers are identical, lowNumbers is still a subset of allNumbers but is no longer a proper subset."
Related
In C# code dealing with any HashSet we need to know if it would be necessary to use the LongCount() method of IEnumerable in order to get the number of elements in the set, or should one always use the Count property, which is of type int32 and has a maximum value of 2^31 - 1.
If there exists such limitation on the maximum number of elements in a HashSet, then what type to use when we need to deal with a set that has numberOfElements >= 2^31 ?
Update:
The exact answer can be found if one runs as part of their code this:
var set = new HashSet<bool>(Int32.MaxValue -1);
Then this exception is thrown -- note what the message says:
But anyway, one shouldn't be forced to experiment just to find a simple fact, whose place is in the language documentation!
So far I have
#if (!(Model.CurrentVersion.LRC.List == Model.PrevVersion.LRC.List))
I want to see if the list from the previous version matches the current version however this only returns true (with the !). Both of the lists are empty but it's not returning false.
Is there a better way of seeing if lists match? And why is it always returning true?
Thanks!
You need to check if the content of the lists are equal. There are several ways of doing so.
If the order of items is important then try
SequenceEqual
#if(!Model.CurrentVersion.LRC.List.SequenceEqual(Model.PrevVersion.LRC.List))
If you don't care about the order of the items in the lists you could use
!ints1.All(ints5.Contains)
Now you still have the problem that you are comparing if the items in the list are the same objects. You might want to check if theses items equal in content. For that you need to implement an IEqualityComparer<T>. In the SequenceEqual page there is a great example to implement this case.
I am using:
mybookings.listOfBookings.Sort(Function(x, y) x.checkIn.Date.CompareTo(y.checkIn.Date)
Now, this works great, but I have run into a situation which introduces a bug in the rest of my code.
When I have a list of objects like the above, and both objects' checkIn has the same date value, the sort re-arranges the list... I am expecting it to keep it as is, because there is nothing to sort...
When it swaps around the list, it introduces issues further down my code, that expects the list to have been kept as is. Yes, one can argue that I need to fix that... but in this case, there are many other factors that expect the first occurrence in the list, to be the actual first occurrence (based on other properties of the object).
In a nutshell, objects get "paid" from earliest (checkIn) to latest (checkIn) elsewhere in the software. What happens here, is that there is now an "unpaid" object at position 1 in the list, when it should be, and originally was, at position 2, for example.
The list can have any number of objects
The list normally has multiple objects, and it correctly sorts it based on different Date values
The field sorted on, is a date type.
I cant check positions before and after the sort... well I could, but that defies the idea of using the sort function. I can then roll my own sorting routine...(which I don;t want to do unless absolutely necessary)
Can one somehow force the sort to not re-order the list if there are no changes?
This is documented:
This implementation performs an unstable sort; that is, if two
elements are equal, their order might not be preserved. In contrast, a
stable sort preserves the order of elements that are equal.
You could check first if all are the same:
Dim first As Date = mybookings.listOfBookings.First().checkIn.Date
Dim allSameDate As Boolean = mybookings.listOfBookings.Skip(1).
All(Function(x) x.checkIn.Date = first)
If Not allSameDate Then
' now you can sort '
mybookings.listOfBookings.Sort(Function(x, y) x.checkIn.Date.CompareTo(y.checkIn.Date)
End If
Another way is to use LINQ to create a new List(Of T), Enumerable.OrderBy is stable:
This method performs a stable sort; that is, if the keys of two
elements are equal, the order of the elements is preserved
Stop using List.Sort and use Enumerable.OrderBy instead. The latter implements a stable sort, in contrast to the former:
This method performs a stable sort; that is, if the keys of two
elements are equal, the order of the elements is preserved. In
contrast, an unstable sort does not preserve the order of elements
that have the same key.
I 'm not comfortable with VB.NET, but in C# you would write that as
var sorted = mybookings.listOfBookings.OrderBy(b => b.checkIn.Date);
This question already has answers here:
Why does List<T>.Sort method reorder equal IComparable<T> elements?
(7 answers)
Closed 8 years ago.
I've found that when I implement CompareTo(..) for one of my classes, the ordering is inconsistent between machines. When 2 objects come out equal, they don't always sort in the same order. I'd assume some single threaded iterative approach would be use to sort so I would have assumed a consistant ordering.
For Given the following class..
class Property : ICompareable<Property>
{
public int Value;
public int Name;
public int CompareTo(Property other)
{
if(this.Value > other.Value)
return 1;
if(this.Value < other.Value)
return -1;
return 0;
}
}
And given the following objects
{
List.add(new Property( name="apple", value = 1) );
List.add(new Property( name="grape", value = 2) );
List.add(new Property( name="banana", value = 1) );
}
When I execute
List.sort();
Then when stepping through the list using indexes, the order of banana and apple changes depending on what PC i am executing the code on. Why is this?
List.Sort doesn't provide a stable sort, as per MSDN:
This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.
If you need a stable sort consider using OrderBy from LINQ, which is a stable sort.
Since the CompareTo function only compares off the integer (Value) axis, there is no guarantee about the ordering along the other (Name) axis.
Even an unstable sort (e.g. List.Sort) with the same implementation and the same input sequence should result in the same sorted output. Thus, given an incomplete compare function as per above, the results may differ across environments (PCs) because of the following reasons:
The sort implementation uses a different or environment-sensitive algorithm, or;
The initial order of the items differs across environments.
A stable sort (e.g. OrderBy in LINQ to Objects) will produce consistent output along different environments if and only if the input sequence is in the same order in each environment: a stable sort is free of issue #1 but still affected by #2.
(While the input order is guaranteed given the posted sample code, it might not be in a "real" situation, such as if the items initially came from an unordered source.)
I would likely update the compare function such that it defines a total ordering over all item attributes - then the results will be consistent for all input sequences over all environments and sorting methods.
An example of such an implementation:
public int CompareTo(Property other)
{
var valueCmp = this.Value.CompareTo(other.Value);
if (valueCmp == 0) {
// Same Value, order by Name
return this.Name.CompareTo(other.Name);
} else {
// Different Values
return valueCmp;
}
}
Some sorting methods guarantee that items which compare equal will appear in the same sequence relative to each other after sorting as they did before sorting, but such a guarantee generally comes with a cost. Very few sorting algorithms can move an item directly from its original position to its final position in a single step (and those which do so are generally slow). Instead, sorting algorithms move items around a few times, and generally don't keep track of where items came from. In the Anthony Hoare's "quicksort" algorithm, a typical step will pick some value and move things around so that everything which compares less than that value will end up in array elements which precede all the items which are greater than that value. If one ignores array elements that precisely equal the "pivot" value, there are four categories of items:
Items which are less than the pivot value, and started in parts of the array that precede the pivot value's final location.
Items which are less than the pivot value, but started in parts of the array that follow the pivot value's final location.
Items which are greater than the pivot value, but started in parts of the array that precede the pivot value's final location.
Items which are greater than the pivot value, and started in parts of the array that follow the pivot value's final location.
When moving elements to separate those above and below the pivot value, items in the first and fourth categories (i.e. those which started and ended on the same side of the pivot value's final location) generally remain in their original order, but are intermixed with those which started on the other side, which get copied in reverse order. Because the values from each side get intermixed, there's no way the system can restore the sequence of items which crossed from one side of the pivot to the other unless it keeps track of which items those are. Such tracking would require extra time and memory, which Sort opts not to spend.
If two runs of Sort on a given list always perform the same sequence of steps, then the final ordering should match. The documentation for Sort, however, does not specify the exact sequence of steps it performs, and it would be possible that one machine has a version of .NET which does things slightly differently from the other. The essential thing to note is that if you need a "stable" sort [one in which the relative sequence of equal items after sorting is guaranteed to match the sequence before sorting], you will either need to use a sorting method which guarantees such behavior, or else add an additional field to your items which could be preloaded with their initial position in the array, and make your CompareTo function check that field when items are otherwise equal.
In some library code, I have a List that can contain 50,000 items or more.
Callers of the library can invoke methods that result in strings being added to the list. How do I efficiently check for uniqueness of the strings being added?
Currently, just before adding a string, I scan the entire list and compare each string to the to-be-added string. This starts showing scale problems above 10,000 items.
I will benchmark this, but interested in insight.
if I replace the List<> with a Dictionary<> , will ContainsKey() be appreciably faster as the list grows to 10,000 items and beyond?
if I defer the uniqueness check until after all items have been added, will it be faster? At that point I would need to check every element against every other element, still an n^^2 operation.
EDIT
Some basic benchmark results. I created an abstract class that exposes 2 methods: Fill and Scan. Fill just fills the collection with n items (I used 50,000). Scan scans the list m times (I used 5000) to see if a given value is present. Then I built an implementation of that class for List, and another for HashSet.
The strings used were uniformly 11 characters in length, and randomly generated via a method in the abstract class.
A very basic micro-benchmark.
Hello from Cheeso.Tests.ListTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.4428266
Time to scan: 00:00:13.0291180
Hello from Cheeso.Tests.HashSetTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.3797751
Time to scan: 00:00:00.4364431
So, for strings of that length, HashSet is roughly 25x faster than List , when scanning for uniqueness. Also, for this size of collection, HashSet has zero penalty over List when adding items to the collection.
The results are interesting and not valid. To get valid results, I'd need to do warmup intervals, multiple trials, with random selection of the implementation. But I feel confident that that would move the bar only slightly.
Thanks everyone.
EDIT2
After adding randomization and multple trials, HashSet consistently outperforms List in this case, by about 20x.
These results don't necessarily hold for strings of variable length, more complex objects, or different collection sizes.
You should use the HashSet<T> class, which is specifically designed for what you're doing.
Use HashSet<string> instead of List<string>, then it should scale very well.
From my tests, HashSet<string> takes no time compared to List<string> :)
Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.
Does the Contains(T) function not work for you?
I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative_arrays.
A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.
edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.
edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.