How to create a HashSet<List<Int>> with distinct elements? - c#

I have a HashSet that contains multiple lists of integers - i.e. HashSet<List<int>>
In order to maintain uniqueness I am currently having to do two things:
1. Manually loop though existing lists, looking for duplicates using SequenceEquals.
2. Sorting the individual lists so that SequenceEquals works currently.
Is there a better way to do this? Is there an existing IEqualityComparer that I can provide to the HashSet so that HashSet.Add() can automatically handle uniqueness?
var hashSet = new HashSet<List<int>>();
for(/* some condition */)
{
List<int> list = new List<int>();
...
/* for eliminating duplicate lists */
list.Sort();
foreach(var set in hashSet)
{
if (list.SequenceEqual(set))
{
validPartition = false;
break;
}
}
if (validPartition)
newHashSet.Add(list);
}

Here is a possible comparer that compares an IEnumerable<T> by its elements. You still need to sort manually before adding.
One could build the sorting into the comparer, but I don't think that's a wise choice. Adding a canonical form of the list seems wiser.
This code will only work in .net 4 since it takes advantage of generic variance. If you need earlier versions you need to either replace IEnumerable with List, or add a second generic parameter for the collection type.
class SequenceComparer<T>:IEqualityComparer<IEnumerable<T>>
{
public bool Equals(IEnumerable<T> seq1,IEnumerable<T> seq2)
{
return seq1.SequenceEqual(seq2);
}
public int GetHashCode(IEnumerable<T> seq)
{
int hash = 1234567;
foreach(T elem in seq)
hash = unchecked(hash * 37 + elem.GetHashCode());
return hash;
}
}
void Main()
{
var hashSet = new HashSet<List<int>>(new SequenceComparer<int>());
List<int> test=new int[]{1,3,2}.ToList();
test.Sort();
hashSet.Add(test);
List<int> test2=new int[]{3,2,1}.ToList();
test2.Sort();
hashSet.Contains(test2).Dump();
}

This starts off wrong, it has to be a HashSet<ReadOnlyCollection<>> because you cannot allow the lists to change and invalidate the set predicate. This then allows you to calculate a hash code in O(n) when you add the collection to the set. And an O(n) test to check if it is already in the set with a very uncommon O(n^2) worst case if all the hashes turn out to be equal. Store the computed hash with the collection.

Is there a reason you aren't just using an array? int[] will perform better. Also I assume the lists contain duplicates, otherwise you'd just be using sets and not have a problem.
It appears that their contents won't change (much) once they've been added to the HashSet. At the end of the day, you are going to have to use a comparer that falls back on SequenceEqual. But you don't have to do it every single time. Instead or doing an exponential number of sequence compares (e.g. -- as the hashset grows, doing a SequenceEqual against each existing member) -- if you create a good hashcode up front, you may have to do very few such compares. While the overhead of generating a good hashcode is probably about the same as doing a SequenceEqual you're only doing it a single time for each list.
So, the first time you operate on a particular List<int>, you should generate a hash based on the ordered sequence of numbers and cache it. Then the next time the list is compared, the cached value can be used. I'm not sure how you might do this with a comparer off the top of my head (maybe a static dictionary?) -- but you could implement List wrapper that does this easily.
Here's a basic idea. You'd need to be careful to ensure that it isn't brittle (e.g. make sure you void any cached hash code when members change) but it doesn't look like that's going to be a typical situation for the way you're using this.
public class FasterComparingList<T>: IList<T>, IList, ...
/// whatever you need to implement
{
// Implement your interfaces against InnerList
// Any methods that change members of the list need to
// set _LongHash=null to force it to be regenerated
public List<T> InnerList { ... lazy load a List }
public int GetHashCode()
{
if (_LongHash==null) {
_LongHash=GetLongHash();
}
return (int)_LongHash;
}
private int? _LongHash=null;
public bool Equals(FasterComparingList<T> list)
{
if (InnerList.Count==list.Count) {
return true;
}
// you could also cache the sorted state and skip this if a list hasn't
// changed since the last sort
// not sure if native `List` does
list.Sort();
InnerList.Sort();
return InnerList.SequenceEqual(list);
}
protected int GetLongHash()
{
return .....
// something to create a reasonably good hash code -- which depends on the
// data. Adding all the numbers is probably fine, even if it fails a couple
// percent of the time you're still orders of magnitude ahead of sequence
// compare each time
}
}
If the lists won't change once added, this should be very fast. Even in situations where the lists could change frequently, the time to create a new hash code is not likely very different (if even greater at all) than doing a sequence compare.

If you don't specify an IEQualityComparer, then the types default will be used, so I think what you'll need to do is create your own implementation of IEQualityComparer, and pass that to the constructor of your HashSet. Here is a good example.

When comparing hashsets of lists one option you always have is that instead of comparing each element, you sort lists and join them using a comma and compare generated strings.
So, in this case, when you create custom comparer instead of iterating over elements and calculating custom hash function, you can apply this logic.

Related

How to efficiently filter objects out of an (initially) large list of objects

I need to filter a large list of complex (20+ properties) objects into multiple sub lists. To create the sub-lists, I have a list of filter specifications. Requirements are: a) An item is not allowed to be part of two sub lists and b) it must be possible to get hold of all undivided items after processing has finished.
Currently I use the following algorithm:
List item
Put the objects to be filtered in a Generic List
For each filter specification:
Create a Where expression (Expression>)
Apply the expression using Linq > Where to the list of objects
Get the resulting IEnumerable of selected objects and store them in a list, together with the description of the filter
Remove the items found from the source list using Linq > Except to create a new list to continue working with and to prevent an object from being put in more than one sub list
Check whether there a still (undivided) objects in the working list
My initial list of objects can be over 400.000 objects and I've noticed that both the filtering, as well as reducing the working list takes some time. So I would like to know:
Filtering to create the sub-lists takes place on a maximum of 7 properties of my object. Is there a way to improve performance of a Linq > Where selection?
Is there a way to prevent items from being selected into multiple sub-lists, without reducing the working collection by using Except or RemoveAll (possible improvement)?
Thanks in advance!
If you can not leverage any indexes in the incoming list you are trying to classify then you are better off just iterating through the whole list only once and classifying the items as you go. This way you avoid unnecessary remove and except operations that are seriously hurting the performance with pointless iterations and equality comparisons.
I was thinking about something a long the lines of:
public static IDictionary<string, List<T>> Classify<T>(this IEnumerable<T> items, IDictionary<string, Predicate<T>> predicates, out List<T> defaultBucket)
{
var classifiedItems = new Dictionary<string, List<T>>(predicates.Count);
defaultBucket = new List<T>();
foreach (var predicate in predicates)
{
classifiedItems.Add(predicate.Key, new List<T>());
}
foreach (var item in items)
{
var matched = false;
foreach (var predicate in predicates)
{
if (predicate.Value(item))
{
matched = true;
classifiedItems[predicate.Key].Add(item);
break;
}
}
if (!matched)
{
defaultBucket.Add(item);
}
}
return classifiedItems;
}
Any given predicate can be as complex as you need it to be. Only condition is that it takes in a T and returns a bool. If that is not enough, nothing is preventing you from implementing your own MyPredicate<???> with whatever signature you need.
EDIT: Edited the code to handle a "default bucket" where items that don't comply with any of the specified predicates go.

How to reverse a generic list without changing the same list?

I have a generic list that is being used inside a method that's being called 4 times. This method writes a table in a PDF with the values of this generic list.
My problem is that I need to reverse this generic list inside the method, but I'm calling the method 4 times so the list is being reversed every time I call the method and I don't want that... what can I do? Is there a way to reverse the list without mutating the original?
This is inside the method:
t.List.Reverse();
foreach (string t1 in t.List)
{
//Some code
}
The "easy" option would be to just iterate the list in reverse order without actually changing the list itself instead of trying to reverse it the first time and know to do nothing the other times:
foreach (string t1 in t.List.AsEnumerable().Reverse())
{
//Some code
}
By using the LINQ Reverse method instead of the List Reverse, we can iterate it backwards without mutating the list. The AsEnumerable needs to be there to prevent the List Reverse method from being used.
You're asking how to create a reversed copy of the list, without mutating the original.
LINQ can do that:
foreach (var t1 in Enumerable.Reverse(t.List))
You can do one of two things. Either take out reversing the list from the method and reverse the list once before calling the method four times, or do:
List<My_Type> new_list = new List<Int32>(t.List);
new_list.Reverse();
That takes a copy of the list before reversing it so you don't touch the original list.
I would recomend the first approach because right now you are calling Reverse four times instead of just once.
All the given options until now internally copy all elements to another list in reverse order, explicit as new_list.Reverse() or implicit t.List.AsEnumerable().Reverse(). I prefer something that has not this cost as the extension method below.
public class ListExtetions {
public static IEnumerable<T> GetReverseEnumerator(this List<T> list) {
for (int i = list.Count - 1; i >= 0; i--)
return yeild list[i];
}
}
And could be used as
foreach (var a in list.GetReverseEnumerator()) {
}

c# Find an item in 2 / multiple lists

I have the presumably common problem of having elements that I wish to place in 2 (or more) lists. However sometimes I want to find an element that could be in one of the lists. Now there is more than one way of doing this eg using linq or appending, but all seem to involve the unnecessary creation of an extra list containing all the elements of the separate lists and hence waste processing time.
So I was considering creating my own generic FindinLists class which would take 2 lists as its constructor parameters would provide a Find() and an Exists() methods. The Find and Exists methods would only need to search the second or subsequent lists if the item was not found in the first list. The FindInLists class could be instantiated in the getter of a ( no setter)property. A second constructor for the FindInLists class could take an array of lists as its parameter.
Is this useful or is there already a way to search multiple lists without incurring the wasteful overhead of the creation of a super list?
You could use the LINQ Concat function.
var query = list1.Concat(list2).Where(x => x.Category=="my category");
Linq already has this functionality by virtue of the FirstOrDefault method. It uses deferred execution so will stream from any input and will short circuit the return when a matching element is found.
var matched = list1.Concat(list2).FirstOrDefault(e => element.Equals(e));
Update
BaseType matched = list1.Concat(list2).Concat(list3).FirstOrDefault(e => element.Equals(e));
I believe IEnumerable<T>.Concat() is what you need. It doesn't create an extra list, it only iterates through the given pair of collections when queried
Concat() uses deferred execution, so at the time it's called it only creates an iterator which stores the reference to both concatenated IEnumerables. At the time the resulting collection is enumerated, it iterates through first and then through the second.
Here's the decompiled code for the iterator - no rocket science going on there:
private static IEnumerable<TSource> ConcatIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second)
{
foreach (TSource iteratorVariable0 in first)
{
yield return iteratorVariable0;
}
foreach (TSource iteratorVariable1 in second)
{
yield return iteratorVariable1;
}
}
When looking to the docs for Concat(), I've stumbled across another alternative I didn't know - SelectMany. Given a collection of collections it allows you to work with the children of all parent collections at once like this:
IEnumerable<string> concatenated = new[] { firstColl, secondColl }
.SelectMany(item => item);
you can do something like this:
var list1 = new List<int>{1,2,3,4,5,6,7};
var list2 = new List<int>{0,-3,-4,2};
int elementToPush = 4;//value to find among available lists
var exist = list1.Exists(i=>i==elementToPush) || list2.Exists(j=>j==elementToPush);
If at least one collection required element exists, result is false, otherwise it's true.
One row and no external storage creation.
Hope this helps.
You could probably just create a List of lists and then use linq on that list. It is still creating a new List but it is a list of references rather than duplicating the contents of all the lists.
List<string> a = new List<string>{"apple", "aardvark"};
List<string> b = new List<string>{"banana", "bananananana", "bat"};
List<string> c = new List<string>{"cat", "canary"};
List<string> d = new List<string>{"dog", "decision"};
List<List<string>> super = new List<List<string>> {a,b,c,d};
super.Any(x=>x.Contains("apple"));
the Any call should return after the first list returns true so as requested will not process later lists if it finds it in an earlier list.
Edit: Having written this I prefer the answers using Concat but I leave this here as an alternative if you want something that might be more aesthetically pleasing. ;-)

Selecting x number of elements from a System.Collections.Generic.List c#

What i need is a way to select the last 100 elements from a list, as list
public List<Model.PIP> GetPIPList()
{
if (Repository.PIPRepository.PIPList == null)
Repository.PIPRepository.Load();
return Repository.PIPRepository.PIPList.Take(100);
}
I get error like this
'System.Collections.Generic.IEnumerable' to 'System.Collections.Generic.List'. An explicit conversion exists (are you missing a cast?)
somelist.Reverse().Take(100).Reverse().ToList();
This would be much cheaper than ordering :) Also preserves the original ordering.
If your list is large, you'll get the best performance by rolling your own:
public static class ListExtensions
{
public static IEnumerable<T> LastItems<T>(this IList<T> list, int numberOfItems) //Can also handle arrays
{
for (int index = Math.Max(list.Count - numberOfItems, 0); index < list.Count; index++)
yield return list[index];
}
}
Why is this faster than using Skip()? If you have a list with 50,000 items, Skip() calls MoveNext() on the enumerator 49,900 times before it starts returning items.
Why is it faster than using Reverse()? Because Reverse allocates a new array large enough to hold the list's elements, and copies them into the array. This is especially good to avoid if the array is large enough to go on the large object heap.
EDIT: I missed that you said you wanted the last 100 items, and weren't able to do that yet.
To get the last 100 items:
return Repository.PIPRepository.PIPList
.OrderByDescending(pip=>pip.??).Take(100)
.OrderBy(pip=>pip.??);
...and then change your method signature to return IEnumerable<Model.PIP>
?? signifies what ever property you would be sorting on.
Joel also gives a great solution, based on counting the number of items in the last, and skipping all but 100 of them. In many cases, that probably works better. I didn't want to post the same solution in my edit! :)
Try:
public List<Model.PIP> GetPIPList()
{
if (Repository.PIPRepository.PIPList == null)
Repository.PIPRepository.Load();
return Repository.PIPRepository.PIPList.Take(100).ToList();
}
The .Take() method returns and IEnumerable<T> rather than a List<T>. This is a good thing, and you should strongly consider altering your method and your work habits to use IEnumerable<T> rather than List<T> as much as is practical.
Aside from that, .Take(100) will also return the first 100 elements, rather than the last 100 elements. You want something like this:
public IEnumerable<Model.PIP> GetPIPs()
{
if (Repository.PIPRepository.PIPList == null)
Repository.PIPRepository.Load();
return Repository.PIPRepository.PIPList.Skip(Math.Max(0,Repository.PIPRepository.PIPList.Count - 100));
}
If you really need a list rather than an enumerable (hint: you probably don't), it's still better to build this method using an IEnumerable and use .ToList() at the place where you call this method.
At some point in the future you'll want to go back and update your Load() code to also use IEnumerable, as well as code later on in the process. The ultimate goal here is to get to the point where you are effectively streaming your objects to the browser, and only ever one have one of them loaded into memory on your web server at a time. IEnumerable allows for this. List does not.

Get the last element in a dictionary?

My dictionary:
Dictionary<double, string> dic = new Dictionary<double, string>();
How can I return the last element in my dictionary?
What do you mean by Last? Do you mean Last value added?
The Dictionary<TKey,TValue> class is an unordered collection. Adding and removing items can change what is considered to be the first and last element. Hence there is no way to get the Last element added.
There is an ordered dictionary class available in the form of SortedDictionary<TKey,TValue>. But this will be ordered based on comparison of the keys and not the order in which values were added.
EDIT
Several people have mentioned using the following LINQ style approach
var last = dictionary.Values.Last();
Be very wary about using this method. It will return the last value in the Values collection. This may or may not be the last value you added to the Dictionary. It's probably as likely to not be as it is to be.
Dictionaries are unordered collections - as such, there is no concept of a first or last element. If you are looking for a class that behaves like a dictionary but maintains the insertion order of items, consider using OrderedDictionary.
If you are looking for a collection that sorts the items, consider using SortedDictionary<TKey,TValue>.
If you have an existing dictionary, and you are looking for the 'last' element given some sort order, you could use linq to sort the collection, something like:
myDictionary.Values.OrderBy( x => x.Key ).Last();
By wary of using Dictionary.Keys.Last() - while the key list is sorted using the default IComparer for the type of the key, the value you get may not be the value you expect.
I know this question is too old to get any upvotes, but I didn't like any of the answers so will post my own in the hopes of offering another option to future readers.
Assuming you want the highest key value in a dictionary, not the last inserted:
The following did not work for me on .NET 4.0:
myDictionary.Values.OrderBy( x => x.Key ).Last();
I suspect the problem is that the 'x' represents a value in the dictionary, and a value has no key (the dictionary stores the key, the dictionary values do not). I may also be making a mistake in my usage of the technique.
Either way, this solution would be slow for large dictionaries, probably O(n log n) for CS folks, because it is sorting the entire dictionary just to get one entry. That's like rearranging your entire DVD collection just to find one specific movie.
var lastDicVal = dic.Values.Last();
is well established as a bad idea. In practice, this solution may return the last value added to the dictionary (not the highest key value), but in software engineering terms that is meaningless and should not be relied upon. Even if it works every time for the rest of eternity, it represents a time bomb in your code that depends on library implementation detail.
My solution is as follows:
var lastValue = dic[dic.Keys.Max()];
The Keys.max() function is much faster than sorting O(n) instead of O(n log n).
If performance is important enough that even O(n) is too slow, the last inserted key can be tracked in a separate variable used to replace dic.Keys.Max(), which will make the entire lookup as fast as it can be, or O(1).
Note: Use of double or float as a key is not best practice and can yield surprising results which are beyond the scope of this post. Read about "epsilon" in the context of float/double values.
If you're using .NET 3.5, look at:
dic.Keys.Last()
If you want a predictable order, though, use:
IDictionary<int, string> dic = new SortedDictionary<int, string>();
Instead of using:
Dictionary<double, string>
...you could use:
List<KeyValuePair<double, string>>
This would allow you to use the indexer to access the element by order instead of by key.
Consider creating a custom collection that contains a reference in the Add method of the custom collection. This would set a private field containing the last added key/value(or both) depending on your requirements.
Then have a Last() method that returns this. Here's a proof of concept class to show what I mean (please don't knock the lack of interface implementation etc- it is sample code):
public class LastDictionary<TKey, TValue>
{
private Dictionary<TKey, TValue> dict;
public LastDictionary()
{
dict = new Dictionary<TKey, TValue>();
}
public void Add(TKey key, TValue value)
{
LastKey = key;
LastValue = value;
dict.Add(key, value);
}
public TKey LastKey
{
get; private set;
}
public TValue LastValue
{
get; private set;
}
}
From the docs:
For purposes of enumeration, each item
in the dictionary is treated as a
KeyValuePair structure representing a
value and its key. The order in which
the items are returned is undefined.
So, I don't think you can rely on Dictionary to return the last element.
Use another collection. Maybe SortedDictionary ...
If you just want the value, this should work (assuming you can use LINQ):
dic.Values.Last()
You could use:
dic.Last()
But a dictionary doesn't really have a last element (the pairs inside aren't ordered in any particular way). The last item will always be the same, but it's not obvious which element it might be.
With .Net 3.5:
string lastItem = dic.Values.Last()
string lastKey = dic.Keys.Last()
...but keep in mind that a dictionary is not ordered, so you can't count on the fact that the values will remain in the same order.
A dictionary isn't meant to be accessed in order, so first, last have no meaning. Do you want the value indexed by the highest key?
Dictionary<double, string> dic = new Dictionary<double, string>();
double highest = double.MinValue;
string result = null;
foreach(double d in dic.keys)
{
if(d > highest)
{
result = dic[d];
highest = d;
}
}
Instead of using Linq like most of the other answers suggest, you can just access the last element of any Collection object via the Count property (see ICollection.Count Property for more information).
See the code here for an example of how to use count to access the final element in any Collection (including a Dictionary):
Dictionary<double, string> dic = new Dictionary<double, string>();
var lastElementIndex = dic.Count - 1;
var lastElement = dic[lastElementIndex];
Keep in mind that this returns the last VALUE, not the key.

Categories