Fast collection comparison - c#

I have the following data type:
ISet<IEnumerable<Foo>>
So, I need to be able to create sets of sequences. E.g. this is ok:
ABC,AC,A
but this is not (since "AB" is repeated here"):
AB,A,ABC,BCA,AB
But, in order to do this - for "set" to not contain duplicates, I need to wrap my IEnumerable in some kind of other data type:
ISet<Seq>
//where
Seq : IEnumerable<Foo>, IEquatable<Seq>
Thus, I will be able to compare two sequences, and provide the Set data structure with a way of eliminating duplicates.
My question is: is there a fast data structure that allows for comparing sequences? I am thinking that somehow when Seq gets created, or added two, some kind of cumulative value is computed.
In other words, is it possible to implement Seq in such a way that I could do this:
var seq1 = new Seq( IList<Foo> );
var seq2 = new Seq( IList<Foo> )
seq1.equals(seq2) // O(1)
Thanks.

I have provided an implementation your sequence below. There are several points to note:
This only works if the IEnumerable<T> returns the same items every time it is enumerated, and that those items are not mutated during the scope of this object.
The hash code is cached. The first time it is requested it calculated it (feel free to improve the hash code algorithm if you know a better one) based on a full iteration of the underlying sequence. Because it only needs to be calculated once, this can be effectively considered O(1) if you compute it often. It's likely that adding to the set will be a bit slower (first time computation of the hash value) but searching or removing will be very quick.
The equals method first compares the hash codes. If the hash codes are different then the objects cannot possibly be equal (if the hash codes were properly implemented on all objects in the sequence, and nothing was mutated). As long as you have a low rate of collision, and are usually comparing items that aren't actually equal, this means that equals checks will not often get past that hash code check. If they do, an iteration of the sequence is needed (there is no way around that). Because of that the equals is likely to average O(1), even though its worst case is still O(n).
public class Foo : IEnumerable
{
private IEnumerable sequence;
private int? myHashCode = null;
public Foo(IEnumerable<T> sequence)
{
this.sequence = sequence;
}
public IEnumerator<T> GetEnumerator()
{
return sequence.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return sequence.GetEnumerator();
}
public override bool Equals(object obj)
{
Foo<T> other = obj as Foo<T>;
if(other == null)
return false;
//if the hash codes are different we don't need to bother doing a deep equals check
//the hash code is cached, so it's fast.
if (GetHashCode() != obj.GetHashCode())
return false;
return Enumerable.SequenceEqual(sequence, other.sequence);
}
public override int GetHashCode()
{
//note that the hash code is cached, so the underlying sequence
//needs to not change.
return myHashCode ?? populateHashCode();
}
private int populateHashCode()
{
int somePrimeNumber = 37;
myHashCode = 1;
foreach (T item in sequence)
{
myHashCode = (myHashCode * somePrimeNumber) + item.GetHashCode();
}
return myHashCode.Value;
}
}

O(1) essentially mean you are not allowed to compare values of elements. If you can represent sequence as list of immutable objects (with caching on add so there is no duplicates across all instances) you can achieve it as you'd only need to compare first element - similar how string interning works.
Insert will have to search for all instances of elements for "current"+"with this next" element. Some sort of dictionary may be reasonable approach...
EDIT: I think it simply tried to come up with suffix tree.

Related

Looking for a data structure that is optimized for finding the next closest element

I have two classes, let's call them foo and bar, that both have a DateTime property called ReadingTime.
I then have long lists of these classes, let's say foos and bars, where foos is List<foo>, bars is List<bar>.
My goal is for every element in foos to find the events in bars that happened right before and right after foo.
Some code to clarify:
var foos = new List<foo>();
var bars = new List<bar>();
...
foreach (var foo in foos)
{
bar before = bars.Where(b => b.ReadingTime <= foo.ReadingTime).OrderByDescending(b => b.ReadingTime).FirstOrDefault();
bar after = bars.Where(b => b.ReadingTime > foo.ReadingTime).OrderBy(b => b.ReadingTime).FirstOrDefault();
...
}
My issue here is performance. Is it possible to use some other data structure than a list to speed up the comparisons? In particular the OrderBy statement every single time seems like a huge waste, having it pre-ordered should also speed up the comparisons, right?
I just don't know what data structure is best, SortedList, SortedSet, SortedDictionary etc. there seem so many. Also all the information I find is on lookups, inserts, delets, etc., noone writes about finding the next closest element so I'm not sure if anything is optimized for that.
I'm on .net core 3.1 if that matters.
Thanks in advance!
Edit: Okay so to wrap this up:
First I tried implementing #derloopkat's approach. For this I figured I needed a data type that could save the data in a sorted order so I just left it as IOrderedEnumerable (which is what linq returns). Probably not very smart, as that actually brought things to a crawl. I then tried going with SortedList. Had to remove some duplicates first which was no problem in my case. Thanks for the help #Olivier Rogier! This got me up to roughly 2x the original performance, though I suspect it's mostly the removed linq OrderBys. For now this is good enough, if/when I need more performance I'm going to go with what #CamiloTerevinto suggested.
Lastly #Aldert thank you for your time but I'm too noob and under too much time pressure to understand what you suggested. Still appreciate it and might revisit this later.
Edit2: Ended up going with #CamiloTerevinto's suggestion. Cut my runtime down from 10 hours to a couple of minutes.
You don't need to sort bars ascending and descending on each iteration. Order bars just once before the loop by calling .OrderBy(f => f.ReadingTime) and then use LastOrDefault() and FirstOrDefault().
foreach (var foo in foos)
{
bar before = bars.LastOrDefault(b => b.ReadingTime <= foo.ReadingTime);
bar after = bars.FirstOrDefault(b => b.ReadingTime > foo.ReadingTime);
//...
}
This produces same output you get with your code and runs faster.
For memory performances and to have strong typing, you can use a SortedDictionary, or SortedList but it manipulates objects. Because you compare DateTime you don't need to implement comparer.
What's the difference between SortedList and SortedDictionary?
SortedList<>, SortedDictionary<> and Dictionary<>
Difference between SortedList and SortedDictionary in C#
For speed optimization you can use a double linked list where each item indicates the next and the previous items:
Doubly Linked List in C#
Linked List Implementation in C#
Using a linked list or a double linked list requires more memory because you store the next and the previous reference in a cell that embeed each instance, but you can have sometimes the most faster way to parse and compare data, as well as to search, sort, reorder, add, remove and move items, because you don't manipulate an array, but linked references.
You also can create powerfull trees and manage data in a better way than arrays.
You can use the binary sort for quick lookup. Below the code where bars is sorted and foo is looked up. You can do yourself some reading on binary searches and enhance the code by also sorting Foos. In this case you can minimize the search range of bars...
The code generates 2 lists with 100 items. then sorts bars and does a binary search for 100 times.
using System;
using System.Collections.Generic;
namespace ConsoleApp2
{
class BaseReading
{
private DateTime readingTime;
public BaseReading(DateTime dt)
{
readingTime = dt;
}
public DateTime ReadingTime
{
get { return readingTime; }
set { readingTime = value; }
}
}
class Foo:BaseReading
{
public Foo(DateTime dt) : base(dt)
{ }
}
class Bar: BaseReading
{
public Bar(DateTime dt) : base(dt)
{ }
}
class ReadingTimeComparer: IComparer<BaseReading>
{
public int Compare(BaseReading x, BaseReading y)
{
return x.ReadingTime.CompareTo(y.ReadingTime);
}
}
class Program
{
static private List<BaseReading> foos = new List<BaseReading>();
static private List<BaseReading> bars = new List<BaseReading>();
static private Random ran = new Random();
static void Main(string[] args)
{
for (int i = 0; i< 100;i++)
{
foos.Add(new BaseReading(GetRandomDate()));
bars.Add(new BaseReading(GetRandomDate()));
}
var rtc = new ReadingTimeComparer();
bars.Sort(rtc);
foreach (BaseReading br in foos)
{
int index = bars.BinarySearch(br, rtc);
}
}
static DateTime GetRandomDate()
{
long randomTicks = ran.Next((int)(DateTime.MaxValue.Ticks >> 32));
randomTicks = (randomTicks << 32) + ran.Next();
return new DateTime(randomTicks);
}
}
}
The only APIs available in the .NET platform for finding the next closest element, with a computational complexity better than O(N), are the List.BinarySearch and Array.BinarySearch methods:
// Returns the zero-based index of item in the sorted List<T>, if item is found;
// otherwise, a negative number that is the bitwise complement of the index of
// the next element that is larger than item or, if there is no larger element,
// the bitwise complement of Count.
public int BinarySearch (T item, IComparer<T> comparer);
These APIs are not 100% robust, because the correctness of the results depends on whether the underlying data structure is already sorted, and the platform does not check or enforce this condition. It's up to you to ensure that the list or array is sorted with the correct comparer, before attempting to BinarySearch on it.
These APIs are also cumbersome to use, because in case a direct match is not found you'll get the next largest element as a bitwise complement, which is a negative number, and you'll have to use the ~ operator to get the actual index. And then subtract one to get the closest item from the other direction.
If you don't mind adding a third-party dependency to your app, you could consider the C5 library, which contains the TreeDictionary collection, with the interesting methods below:
// Find the entry in the dictionary whose key is the predecessor of the specified key.
public bool TryPredecessor(K key, out SCG.KeyValuePair<K, V> res);
//Find the entry in the dictionary whose key is the successor of the specified key.
public bool TrySuccessor(K key, out SCG.KeyValuePair<K, V> res)
There are also the TryWeakPredecessor and TryWeakSuccessor methods available, that consider an exact match as a predecessor or successor respectively. In other words they are analogous to the <= and >= operators.
The C5 is a powerful and feature-rich library that offers lots of specialized collections, with its cons being its somewhat idiomatic API.
You should get excellent performance by any of these options.

Null vs Empty Collections in GetHashCode

In the implementation of GetHashCode below, when Collection is null or empty will both result in a hash code of 0.
A colleague suggested return a random hard coded number like 19 to differentiate from a null collection. Why would I want to do this? Why would I care that a null or empty collection produces a different hash code?
public class Foo
{
public List<int> Collection { get; set; }
// Other properties omitted.
public int override GetHashCode()
{
var hashCode = 0;
if (this.Collection != null)
{
foreach (var item in this.Collection)
{
var itemHashCode = item == null ? 0 : item.GetHashCode();
hashCode = ((hashCode << 5) + hashCode) ^ itemHashCode;
}
}
return hashCode;
}
}
The design of GetHashCode is that it is supposed to minimize the number of collisions that will take place, as best as it can. While having some hash collisions is inevitable, you'll want to be mindful of what types of objects are colliding, what type of data are going to be stored in your hash based collections, and working to ensure that types of objects stored together in the same collection are less likely to collide.
So if you happen to know something about how hash-based collections of this type are going to be used, and that there are likely to be both null and empty objects in them, then it would improve the performance to have them not collide. If you suspect that having both a null and empty value in the same collection is not particularly likely, then having them collide isn't actually a concern.

Using IEqualityComparer GetHashCode with a tolerance

I am trying to implement an IEqualityComparer that has a tolerance on a date comparison. I have also looked into this question. The problem is that I can't use a workaround because I am using the IEqualityComparer in a LINQ .GroupJoin(). I have tried a few implementations that allow for tolerance. I can get the Equals() to work because I have both objects but I can't figure out how to implement GetHashCode().
My best attempt looks something like this:
public class ThingWithDateComparer : IEqualityComparer<IThingWithDate>
{
private readonly int _daysToAdd;
public ThingWithDateComparer(int daysToAdd)
{
_daysToAdd = daysToAdd;
}
public int GetHashCode(IThingWithDate obj)
{
unchecked
{
var hash = 17;
hash = hash * 23 + obj.BirthDate.AddDays(_daysToAdd).GetHashCode();
return hash;
}
}
public bool Equals(IThingWithDate x, IThingWithDate y)
{
throw new NotImplementedException();
}
}
public interface IThingWithDate
{
DateTime BirthDate { get; set; }
}
With .GroupJoin() building a HashTable out of the GetHashCode() it applies the days to add to both/all objects. This doesn't work.
The problem is impossible, conceptually. You're trying to compare objects in a way that doesn't have a form of equality that is necessary for the operations you're trying to perform with it. For example, GroupJoin is dependant on the assumption that if A is equal to B, and B is equal to C, then A is equal to C, but in your situation, that's not true. A and B may be "close enough" together for you to want to group them, but A and C may not be.
You're going to need to not implement IEqualityComparer at all, because you cannot fulfill the contract that it requires. If you want to create a mapping of items in one collection to all of the items in another collection that are "close enough" to it then you're going to need to write that algorithm yourself (doing so efficiently is likely to be hard, but doing so inefficiently isn't shouldn't' be that difficult), rather than using GroupJoin, because it's not capable of performing that operation.
I can't see any way to generate a logical hash code for your given criteria.
The hash code is used to determine if 2 dates should stick together. If they should group together, than they must return the same hash code.
If your "float" is 5 days, that means that 1/1/2000 must generate the same hash code as 1/4/2000, and 1/4/2000 must generate the same hashcode as 1/8/2000 (since they are both within 5 days of each other). That implies that 1/1/2000 has the same code as 1/8/2000 (since if a=b and b=c, a=c).
1/1/2000 and 1/8/2000 are outside the 5 day "float".

Implementation of Dictionary where equivalent contents are equal and return the same hash code regardless of order of insertion

I need to use Dictionary<long, string> collections that given two instances d1 and d2 where they each have the same KeyValuePair<long, string> contents, which could be inserted in any order:
(d1 == d2) evaluates to true
d1.GetHashCode() == d2.GetHashCode()
The first requirement was achieved most easily by using a SortedDictionary instead of a regular Dictionary.
The second requirement is necessary because I have one point where I need to store Dictionary<Dictionary<long, string>, List<string> - the main Dictionary type is used as the key for another Dictionary, and if the HashCodes don't evaluate based on identical contents, the using ContainsKey() will not work the way that I want (ie: if there is already an item inserted into the dictionary with d1 as its key, then dictionary.ContainsKey(d2) should evaluate to true.
To achieve this, I have created a new object class ComparableDictionary : SortedDictionary<long, string>, and have included the following:
public override int GetHashCode() {
StringBuilder str = new StringBuilder();
foreach (var item in this) {
str.Append(item.Key);
str.Append("_");
str.Append(item.Value);
str.Append("%%");
}
return str.ToString().GetHashCode();
}
In my unit testing, this meets the criteria for both equality and hashcodes. However, in reading Guidelines and Rules for GetHashCode, I came across the following:
Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable
It is permissible, though dangerous, to make an object whose hash code value can mutate as the fields of the object mutate. If you have such an object and you put it in a hash table then the code which mutates the object and the code which maintains the hash table are required to have some agreed-upon protocol that ensures that the object is not mutated while it is in the hash table. What that protocol looks like is up to you.
If an object's hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn't find it.
Remember, objects can be put into hash tables in ways that you didn't expect. A lot of the LINQ sequence operators use hash tables internally. Don't go dangerously mutating objects while enumerating a LINQ query that returns them!
Now, the Dictionary<ComparableDictionary, List<String>> is used only once in code, in a place where the contents of all ComparableDictionary collections should be set. Thus, according to these guidelines, I think that it would be acceptable to override GetHashCode as I have done (basing it completely on the contents of the dictionary).
After that introduction my questions are:
I know that the performance of SortedDictionary is very poor compared to Dictionary (and I can have hundreds of object instantiations). The only reason for using SortedDictionary is so that I can have the equality comparison work based on the contents of the dictionary, regardless of order of insertion. Is there a better way to achieve this equality requirement without having to use a SortedDictionary?
Is my implementation of GetHashCode acceptable based on the requirements? Even though it is based on mutable contents, I don't think that that should pose any risk, since the only place where it is using (I think) is after the contents have been set.
Note: while I have been setting these up using Dictionary or SortedDictionary, I am not wedded to these collection types. The main need is a collection that can store pairs of values, and meet the equality and hashing requirements defined out above.
Your GetHashCode implementation looks acceptable to me, but it's not how I'd do it.
This is what I'd do:
Use composition rather than inheritance. Aside from anything else, inheritance gets odd in terms of equality
Use a Dictionary<TKey, TValue> variable inside the dictionary
Implement GetHashCode by taking an XOR of the individual key/value pair hash codes
Implement equality by checking whether the sizes are the same, then checking every key in "this" to see if its value is the same in the other dictionary.
So something like this:
public sealed class EquatableDictionary<TKey, TValue>
: IDictionary<TKey, TValue>, IEquatable<ComparableDictionary<TKey, TValue>>
{
private readonly Dictionary<TKey, TValue> dictionary;
public override bool Equals(object other)
{
return Equals(other as ComparableDictionary<TKey, TValue>);
}
public bool Equals(ComparableDictionary<TKey, TValue> other)
{
if (ReferenceEquals(other, null))
{
return false;
}
if (Count != other.Count)
{
return false;
}
foreach (var pair in this)
{
var otherValue;
if (!other.TryGetValue(pair.Key, out otherValue))
{
return false;
}
if (!EqualityComparer<TValue>.Default.Equals(pair.Value,
otherValue))
{
return false;
}
}
return true;
}
public override int GetHashCode()
{
int hash = 0;
foreach (var pair in this)
{
int miniHash = 17;
miniHash = miniHash * 31 +
EqualityComparer<TKey>.Default.GetHashCode(pair.Key);
miniHash = miniHash * 31 +
EqualityComparer<Value>.Default.GetHashCode(pair.Value);
hash ^= miniHash;
}
return hash;
}
// Implementation of IDictionary<,> which just delegates to the dictionary
}
Also note that I can't remember whether EqualityComparer<T>.Default.GetHashCode copes with null values - I have a suspicion that it does, returning 0 for null. Worth checking though :)

Why do we need iterators in c#?

Can somebody provide a real life example regarding use of iterators. I tried searching google but was not satisfied with the answers.
You've probably heard of arrays and containers - objects that store a list of other objects.
But in order for an object to represent a list, it doesn't actually have to "store" the list. All it has to do is provide you with methods or properties that allow you to obtain the items of the list.
In the .NET framework, the interface IEnumerable is all an object has to support to be considered a "list" in that sense.
To simplify it a little (leaving out some historical baggage):
public interface IEnumerable<T>
{
IEnumerator<T> GetEnumerator();
}
So you can get an enumerator from it. That interface (again, simplifying slightly to remove distracting noise):
public interface IEnumerator<T>
{
bool MoveNext();
T Current { get; }
}
So to loop through a list, you'd do this:
var e = list.GetEnumerator();
while (e.MoveNext())
{
var item = e.Current;
// blah
}
This pattern is captured neatly by the foreach keyword:
foreach (var item in list)
// blah
But what about creating a new kind of list? Yes, we can just use List<T> and fill it up with items. But what if we want to discover the items "on the fly" as they are requested? There is an advantage to this, which is that the client can abandon the iteration after the first three items, and they don't have to "pay the cost" of generating the whole list.
To implement this kind of lazy list by hand would be troublesome. We would have to write two classes, one to represent the list by implementing IEnumerable<T>, and the other to represent an active enumeration operation by implementing IEnumerator<T>.
Iterator methods do all the hard work for us. We just write:
IEnumerable<int> GetNumbers(int stop)
{
for (int n = 0; n < stop; n++)
yield return n;
}
And the compiler converts this into two classes for us. Calling the method is equivalent to constructing an object of the class that represents the list.
Iterators are an abstraction that decouples the concept of position in a collection from the collection itself. The iterator is a separate object storing the necessary state to locate an item in the collection and move to the next item in the collection. I have seen collections that kept that state inside the collection (i.e. a current position), but it is often better to move that state to an external object. Among other things it enables you to have multiple iterators iterating the same collection.
Simple example : a function that generates a sequence of integers :
static IEnumerable<int> GetSequence(int fromValue, int toValue)
{
if (toValue >= fromValue)
{
for (int i = fromValue; i <= toValue; i++)
{
yield return i;
}
}
else
{
for (int i = fromValue; i >= toValue; i--)
{
yield return i;
}
}
}
To do it without an iterator, you would need to create an array then enumerate it...
Iterate through the students in a class
The Iterator design pattern provides
us with a common method of enumerating
a list of items or array, while hiding
the details of the list's
implementation. This provides a
cleaner use of the array object and
hides unneccessary information from
the client, ultimately leading to
better code-reuse, enhanced
maintainability, and fewer bugs. The
iterator pattern can enumerate the
list of items regardless of their
actual storage type.
Iterate through a set of homework questions.
But seriously, Iterators can provide a unified way to traverse the items in a collection regardless of the underlying data structure.
Read the first two paragraphs here for a little more info.
A couple of things they're great for:
a) For 'perceived performance' while maintaining code tidiness - the iteration of something separated from other processing logic.
b) When the number of items you're going to iterate through is not known.
Although both can be done through other means, with iterators the code can be made nicer and tidier as someone calling the iterator don't need to worry about how it finds the stuff to iterate through...
Real life example: enumerating directories and files, and finding the first [n] that fulfill some criteria, e.g. a file containing a certain string or sequence etc...
Beside everything else, to iterate through lazy-type sequences - IEnumerators. Each next element of such sequence may be evaluated/initialized upon iteration step which makes it possible to iterate through infinite sequences using finite amount of resources...
The canonical and simplest example is that it makes infinite sequences possible without the complexity of having to write the class to do that yourself:
// generate every prime number
public IEnumerator<int> GetPrimeEnumerator()
{
yield return 2;
var primes = new List<int>();
primesSoFar.Add(2);
Func<int, bool> IsPrime = n => primes.TakeWhile(
p => p <= (int)Math.Sqrt(n)).FirstOrDefault(p => n % p == 0) == 0;
for (int i = 3; true; i += 2)
{
if (IsPrime(i))
{
yield return i;
primes.Add(i);
}
}
}
Obviously this would not be truly infinite unless you used a BigInt instead of int but it gives you the idea.
Writing this code (or similar) for each generated sequence would be tedious and error prone. the iterators do that for you. If the above example seems too complex for you consider:
// generate every power of a number from start^0 to start^n
public IEnumerator<int> GetPowersEnumerator(int start)
{
yield return 1; // anything ^0 is 1
var x = start;
while(true)
{
yield return x;
x *= start;
}
}
They come at a cost though. Their lazy behaviour means you cannot spot common errors (null parameters and the like) until the generator is first consumed rather than created without writing wrapping functions to check first. The current implementation is also incredibly bad(1) if used recursively.
Wiriting enumerations over complex structures like trees and object graphs is much easier to write as the state maintenance is largely done for you, you must simply write code to visit each item and not worry about getting back to it.
I don't use this word lightly - a O(n) iteration can become O(N^2)
An iterator is an easy way of implementing the IEnumerator interface. Instead of making a class that has the methods and properties required for the interface, you just make a method that returns the values one by one and the compiler creates a class with the methods and properties needed to implement the interface.
If you for example have a large list of numbers, and you want to return a collection where each number is multiplied by two, you can make an iterator that returns the numbers instead of creating a copy of the list in memory:
public IEnumerable<int> GetDouble() {
foreach (int n in originalList) yield return n * 2;
}
In C# 3 you can do something quite similar using extension methods and lambda expressions:
originalList.Select(n => n * 2)
Or using LINQ:
from n in originalList select n * 2
IEnumerator<Question> myIterator = listOfStackOverFlowQuestions.GetEnumerator();
while (myIterator.MoveNext())
{
Question q;
q = myIterator.Current;
if (q.Pertinent == true)
PublishQuestion(q);
else
SendMessage(q.Author.EmailAddress, "Your question has been rejected");
}
foreach (Question q in listOfStackOverFlowQuestions)
{
if (q.Pertinent == true)
PublishQuestion(q);
else
SendMessage(q.Author.EmailAddress, "Your question has been rejected");
}

Categories