Finding duplicates within list of list - c#

Simple situation. I have a list of lists, almost table like, and I am trying to find out if any of the lists are duplicated.
Example:
List<List<int>> list = new List<List<int>>(){
new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};
I would like to know that there are 4 total items, 2 of which are duplicates. I was thinking about doing something like a SQL checksum but I didn't know if there was a better/easier way.
I care about performance, and I care about ordering.
Additional Information That May Help
Things inserted into this list will never be removed
Not bound to any specific collection.
Dont care about function signature
They type is not restricted to int

Let's try to get best performace. if n is number of lists and m is length of lists then we can get O(nm + nlogn + n) plus some probability of hash codes to be equal for different lists.
Major steps:
Calculate hash codes*
Sort them
Go over list to find dupes
* this is important step. for simlicity you can calc hash as = ... ^ (list[i] << i) ^ (list[i + 1] << (i + 1))
Edit for those people that think that PLINQ can boost the thing, but not good algorythm. PLINQ can also be added here, because all the steps are easily parallelizable.
My code:
static public void Main()
{
List<List<int>> list = new List<List<int>>(){
new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};
var hashList = list.Select((l, ind) =>
{
uint hash = 0;
for (int i = 0; i < l.Count; i++)
{
uint el = (uint)l[i];
hash ^= (el << i) | (el >> (32 - i));
}
return new {hash, ind};
}).OrderBy(l => l.hash).ToList();
//hashList.Sort();
uint prevHash = hashList[0].hash;
int firstInd = 0;
for (int i = 1; i <= hashList.Count; i++)
{
if (i == hashList.Count || hashList[i].hash != prevHash)
{
for (int n = firstInd; n < i; n++)
for (int m = n + 1; m < i; m++)
{
List<int> x = list[hashList[n].ind];
List<int> y = list[hashList[m].ind];
if (x.Count == y.Count && x.SequenceEqual(y))
Console.WriteLine("Dupes: {0} and {1}", hashList[n].ind, hashList[m].ind);
}
}
if (i == hashList.Count)
break;
if (hashList[i].hash != prevHash)
{
firstInd = i;
prevHash = hashList[i].hash;
}
}
}

Unless you're doing some seriously heavy lifting, perhaps the following simple code will work for you:
var lists = new List<List<int>>()
{
new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
new List<int>() {0 ,1, 4, 2, 4, 5, 6 },
new List<int>() {0 ,3, 2, 5, 1, 6, 4 }
};
var duplicates = from list in lists
where lists.Except(new[] { list }).Any(l => l.SequenceEqual(list))
select list;
Obviously you could get better performance if you hand-tweak an algorithm such that you don't have to scan the lists each iteration, but there is something to be said for writing declarative, simpler code.
(Plus, thanks to the Awesomeness of LINQ®, by adding a .AsParallel() call to the above code, the algorithm will run on multiple cores, thus running potentially faster than the complex, hand-tweaked solutions mentioned in this thread.)

Something like this will give you the correct results:
List<List<int>> list = new List<List<int>>(){
new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};
list.ToLookup(l => String.Join(",", l.Select(i => i.ToString()).ToArray()))
.Where(lk => lk.Count() > 1)
.SelectMany(group => group);

You will have to iterate through each index of each list at least once, but you can potentially speed up the process by creating a custom hash table, so that you can quickly reject non-duplicate lists without having to do comparisons per-item.
Algorithm:
Create a custom hashtable (dictionary: hash -> list of lists)
For each list
Take a hash of the list (one that takes order into account)
Search in hashtable
If you find matches for the hash
For each list in the hash entry, re-compare the tables
If you find a duplicate, return true
Else if you don't find matches for the hash
Create a temp list
Append the current list to our temp list
Add the temp list to the dictionary as a new hash entry
You didn't find any duplicates, so return false
If you have a strong enough hashing algorithm for your input data, you might not even have to do the sub-comparisons, since there wouldn't be any hash collisions.
I have some example code. The missing bits are:
An optimization so that we do the dictionary lookup only once per list (for search and insert). Might have to make your own Dictionary/Hash Table class to do this?
A better hashing algorithm that you find by profiling a bunch of them against your data
Here is the code:
public bool ContainsDuplicate(List<List<int>> input)
{
var encounteredLists = new Dictionary<int, List<EnumerableWrapper>>();
foreach (List<int> currentList in input)
{
var currentListWrapper = new EnumerableWrapper(currentList);
int hash = currentListWrapper.GetHashCode();
if (encounteredLists.ContainsKey(hash))
{
foreach (EnumerableWrapper currentEncounteredEntry in encounteredLists[hash])
{
if (currentListWrapper.Equals(currentEncounteredEntry))
return true;
}
}
else
{
var newEntry = new List<EnumerableWrapper>();
newEntry.Add(currentListWrapper);
encounteredLists[hash] = newEntry;
}
}
return false;
}
sealed class EnumerableWrapper
{
public EnumerableWrapper(IEnumerable<int> list)
{
if (list == null)
throw new ArgumentNullException("list");
this.List = list;
}
public IEnumerable<int> List { get; private set; }
public override bool Equals(object obj)
{
bool result = false;
var other = obj as EnumerableWrapper;
if (other != null)
result = Enumerable.SequenceEqual(this.List, other.List);
return result;
}
public override int GetHashCode()
{
// Todo: Implement your own hashing algorithm here
var sb = new StringBuilder();
foreach (int value in List)
sb.Append(value.ToString());
return sb.ToString().GetHashCode();
}
}

Here's a potential idea (this assumes that the values are numerical):
Implement a comparer that multiplies each member of each collection by its index, then sum the whole thing:
Value: 0 5 8 3 2 0 5 3 5 1
Index: 1 2 3 4 5 6 7 8 9 10
Multiple: 0 10 24 12 10 0 35 24 45 10
Member CheckSum: 170
So, the whole "row" has a number that changes with the members and ordering. Fast to compute and compare.

You could also try probabilistic algorithms if duplicates are either very rare or very common. e.g. a bloom filter

What about that writing your own list comparer:
class ListComparer:IEqualityComparer<List<int>>
{
public bool Equals(List<int> x, List<int> y)
{
if(x.Count != y.Count)
return false;
for(int i = 0; i < x.Count; i++)
if(x[i] != y[i])
return false;
return true;
}
public int GetHashCode(List<int> obj)
{
return base.GetHashCode();
}
}
and then just:
var nonDuplicatedList = list.Distinct(new ListComparer());
var distinctCount = nonDuplicatedList.Count();

if they are all single digit and have the same number of elements you can put them together so the first one is 123456 and check if the numbers are the same.
then you would have a list {123456, 123456, 142456, 325164}
which is easier to check for duplicates, if the individual members can be more than 10 you would have to modify this.
Edit: added sample code, can be optimize, just a quick example to explain what I meant.
for(int i = 0; i< list.length; i++)
{
List<int> tempList = list[i];
int temp = 0;
for(int j = tempList.length - 1;i > = 0; j--)
{
temp = temp * 10 + tempList[j];
}
combinded.add(temp);
}
for(int i =0; i< combined.length; i++)
{
for(int j = i; j < combined.length; j++)
{
if(combined[i] == combined[j])
{
return true;
}
}
}
return false;

There are a number of good solutions here already, but I believe this one will consistently run the fastest unless there is some structure of the data that you haven't yet told us about.
Create a map from integer key to List, and a map from key to List<List<int>>
For each List<int>, compute a hash using some simple function like (...((x0)*a + x1)*a + ...)*a + xN) which you can calculate recursively; a should be something like 1367130559 (i.e. some large prime that is randomly non-close to any interesting power of 2.
Add the hash and the list it comes from as a key-value pair, if it does not exist. If it does exist, look in the second map. If the second map has that key, append the new List<int> to the accumulating list. If not, take the List<int> you looked up from the first map and the List<int> you were testing, and add a new entry in the second map containing a list of those two items.
Repeat until you've passed through your entire first list. Now you have a hashmap with a list of potential collisions (the second map), and a hashmap with a list of keys (the first map).
Iterate through the second map. For each entry, take the List<List<int>> therein and sort it lexicographically. Now just walk through doing equality comparisons to count the number of different blocks.
Your total number of items is equal to the length of your original list.
Your number of distinct items is equal to the size of your first hashmap plus the sum of (number of blocks - 1) for each entry in your second hashmap.
Your number of duplicate items is the difference of those two numbers (and you can find out all sorts of other things if you want).
If you have N non-duplicate items, and M entries which are duplicates from a set of K items, then it will take you O(N+M+2K) to create the initial hash maps, at the very worst O(M log M) to do the sorting (and probably more like O(M log(M/K))), and O(M) to do the final equality test.

Check out C# 3.0: Need to return duplicates from a List<> it shows you how to return duplicates from the list.
Example from that page:
var duplicates = from car in cars
group car by car.Color into grouped
from car in grouped.Skip(1)
select car;

Related

Finding 2 matching sub-lists from list of list

got this little problem in my little C# hobby project that I can't quite work out. I have been stuck in lots of messy and complicated nested loops. Hope someone can give light.
I have a list of list of int, i.e. List<List<int>> . Assume each list of int contains unique items. The minimum size of the list is 5. I need to find exactly two lists of int (List A and List B) that share exactly three common items and another list of int (List X) that contains exactly one of these common items. Another condition must hold: none of the other lists contain any of these three items.
For example:
List<List<int>> allLists = new List<List<int>>();
allLists.Add(new List<int>() {1, 2, 3, 4});
allLists.Add(new List<int>() {1, 2});
allLists.Add(new List<int>() {3, 4});
allLists.Add(new List<int>() {3, 4, 5, 6, 7, 8, 9});
allLists.Add(new List<int>() {4, 6, 8});
allLists.Add(new List<int>() {5, 7, 9, 11});
allLists.Add(new List<int>() {6, 7, 8});
For the above example, I would hope to find a solution as:
ListA and ListB: [3, 5] // indices of allLists
ListX: 6 // index of allLists
The three shared items: [5, 7, 9]
The matching item in ListX: 7
Note: Depending on the content of lists, there may be multiple solutions. There may be also situations that no lists is found matching the above conditions.
I was stuck in some messy nested loops. I was thinking if anyone may come up with a simple and efficient solution (possibly with LINQ?)
Originally I had something stupid like the following:
for (var i = 0; i < allLists.Count - 1; i++)
{
if (allLists[i].Count > 2)
{
for (var j = i + 1; j < allLists.Count; j++)
{
List<int> sharedItems = allLists[i].Intersect(allLists[j]).ToList();
if (sharedItems.Count == 3)
{
foreach (var item in sharedItems)
{
int itemCount = 0;
int? possibleListXIndex = null;
for (var k = 0; k < allLists.Count; k++)
{
if (k != i && k != j && allLists[k].Contains(item))
{
// nested loops getting very ugly here... also not sure what to do....
}
}
}
}
}
}
}
Extended Problem
There is an extended version of this problem in my project. It is in the same fashion:
find exactly three lists of int (List A, List B and List C) that share exactly four common items
find another list of int (List X) that contains exactly one of the above common items
none of the other lists contain these four items.
I was thinking the original algorithm may become scalable to also cover the extended version without having to write another version of algorithm from scratch. With my nested loops above, I think I will have no choice but to add at least two deeper-level loops to cover four items and three lists.
I thank everyone for your contributions in advance! Truly appreciated.
Here's a solution that comes up with your answer. I wouldn't exactly called it efficient, but it's pretty simple to follow.
It breaks the work in two step. First it constructs a list of initial candidates where they have exactly three matches. The second step adds the ListX property and checks to see if the remaining criteria is met.
var matches = allLists.Take(allLists.Count - 1)
.SelectMany((x, xIdx) => allLists
.Skip(xIdx + 1)
.Select(y => new { ListA = x, ListB = y, Shared = x.Intersect(y) })
.Where(y => y.Shared.Count() == 3))
.SelectMany(x => allLists
.Where(y => y != x.ListA && y != x.ListB)
.Select(y => new
{
x.ListA,
x.ListB,
x.Shared,
ListX = y,
SingleShared = x.Shared.Intersect(y)
})
.Where(y => y.SingleShared.Count() == 1
&& !allLists.Any(z => z != y.ListA
&& z != y.ListB
&& z != y.ListX
&& z.Intersect(y.Shared).Any())));
You get the output below after running the following code.
ListA. 3: [3, 4, 5, 6, 7, 8, 9] ListB. 5: [5, 7, 9, 11] => [5, 7, 9], ListX. 6:[6, 7, 10] => 7
matches.ToList().ForEach(x => {
Console.WriteLine("ListA. {0}: [{1}] ListB. {2}: [{3}] => [{4}], ListX. {5}:[{6}] => {7}",
allLists.IndexOf(x.ListA),
string.Join(", ", x.ListA),
allLists.IndexOf(x.ListB),
string.Join(", ", x.ListB),
string.Join(", ", x.Shared),
allLists.IndexOf(x.ListX),
string.Join(", ", x.ListX),
string.Join(", ", x.SingleShared));
I will leave the exercise of further work such as which list matches which other one given your fairly generic requirement. So here, I find those that match 3 other values in a given array, processing all arrays - so there are duplicates here where an A matches a B and a B matches an A for example.
This should give you something you can work from:
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main()
{
Console.WriteLine("Hello World");
var s = new List<int>();
List<List<int>> allLists = new List<List<int>>();
allLists.Add(new List<int>()
{1, 2, 3, 4});
allLists.Add(new List<int>()
{1, 2});
allLists.Add(new List<int>()
{3, 4});
allLists.Add(new List<int>()
{3, 4, 5, 6, 7, 8, 9});
allLists.Add(new List<int>()
{4, 6, 8});
allLists.Add(new List<int>()
{5, 7, 9, 11});
allLists.Add(new List<int>()
{6, 7, 8});
/*
// To iterate over it.
foreach (List<int> subList in allLists)
{
foreach (int item in subList)
{
Console.WriteLine(item);
}
}
*/
var countMatch = 3;
/* iterate over our lists */
foreach (var sub in allLists)
{
/* not the sub list */
var ns = allLists.Where(g => g != sub);
Console.WriteLine("Check:{0}", ns.Count()); // 6 of the 7 lists so 6 to check against
//foreach (var glist in ns) - all of them, now refactor to filter them:
foreach (var glist in ns.Where(n=> n.Intersect(sub).Count() == countMatch))
{
var r = sub.Intersect(glist); // get all the matches of glist and sub
Console.WriteLine("Matches:{0} in {1}", r.Count(), glist.Count());
foreach (int item in r)
{
Console.WriteLine(item);
}
}
}
}
}
This will output this:
Hello World
Check:6
Check:6
Check:6
Check:6
Matches:3 in 3
4
6
8
Matches:3 in 4
5
7
9
Matches:3 in 3
6
7
8
Check:6
Matches:3 in 7
4
6
8
Check:6
Matches:3 in 7
5
7
9
Check:6
Matches:3 in 7
6
7
8
I think it would be better to break this functionality into several methods, then it will look easier to read.
var allLists = new List<List<int>>();
allLists.Add(new List<int>() {1, 2, 3, 4});
allLists.Add(new List<int>() {1, 2});
allLists.Add(new List<int>() {3, 4});
allLists.Add(new List<int>() {3, 4, 5, 6, 7, 8, 9});
allLists.Add(new List<int>() {4, 6, 8});
allLists.Add(new List<int>() {5, 7, 9, 11});
allLists.Add(new List<int>() {6, 7, 8});
var count = allLists.Count;
for (var i = 0; i < count - 1; i++)
{
var left = allLists[i];
if (left.Count > 2)
{
for (var j = i + 1; j < count; j++)
{
var right = allLists[j];
var sharedItems = left.Intersect(right).ToList();
if (sharedItems.Count == 3)
{
for (int k = 0; k < count; k++)
{
if (k == i || k == j)
continue;
var intersected = allLists[k].Intersect(sharedItems).ToList();
if (intersected.Count == 1)
{
Console.WriteLine($"Found index k:{k},i:{i},j:{j}, Intersected numbers:{string.Join(",",intersected)}");
}
}
}
}
}
}
I just thought if I try to find the item in List X first, I might need fewer loops in practice. Correct me if I am wrong.
public static void match(List<List<int>> allLists, int numberOfMainListsToExistIn, int numberOfCommonItems)
{
var possibilitiesToCheck = allLists.SelectMany(i => i).GroupBy(e => e).Where(e => (e.Count() == numberOfMainListsToExistIn + 1));
foreach (var pGroup in possibilitiesToCheck)
{
int p = pGroup.Key;
List<int> matchingListIndices = allLists.Select((l, i) => l.Contains(p) ? i : -1).Where(i => i > -1).ToList();
for (int i = 0; i < matchingListIndices.Count; i++)
{
int aIndex = matchingListIndices[i];
int bIndex = matchingListIndices[(i + 1) % matchingListIndices.Count];
int indexOfListXIndex = (i - 1 + matchingListIndices.Count) % matchingListIndices.Count;
int xIndex = matchingListIndices[indexOfListXIndex];
IEnumerable<int> shared = allLists[aIndex].Intersect(allLists[bIndex]).OrderBy(e => e);
IEnumerable<int> xSingle = shared.Intersect(allLists[xIndex]);
bool conditionsHold = false;
if (shared.Count() == numberOfCommonItems && xSingle.Count() == 1 && xSingle.Contains(p))
{
conditionsHold = true;
for (int j = 2; j < matchingListIndices.Count - 1; j++)
{
int cIndex = matchingListIndices[(i + j) % matchingListIndices.Count];
if (!Enumerable.SequenceEqual(shared, allLists[aIndex].Intersect(allLists[cIndex]).OrderBy(e => e)))
{
conditionsHold = false;
break;
}
}
if (conditionsHold)
{
List<int> theOtherListIndices = Enumerable.Range(0, allLists.Count - 1).Except(matchingListIndices).ToList();
if (theOtherListIndices.Any(x => shared.Intersect(allLists[x]).Count() > 0))
{
conditionsHold = false;
}
}
}
if (conditionsHold)
{
matchingListIndices.RemoveAt(indexOfListXIndex);
Console.Write("List A and B: {0}. ", String.Join(", ", matchingListIndices));
Console.Write("Common items: {0}. ", String.Join(", ", shared));
Console.Write("List X: {0}.", xIndex);
Console.WriteLine("Common item in list X: {0}. ", p);
}
}
}
}
For the above example, I will just call the method like this:
match(allLists, 2, 3);
This method will also work with the extended problem:
match(allLists, 3, 4);
... and even more if the problem is even further more extended to (4, 5) and so on...

The union of the intersects of the 2 set combinations of a sequence of sequences

How can I find the set of items that occur in 2 or more sequences in a sequence of sequences?
In other words, I want the distinct values that occur in at least 2 of the passed in sequences.
Note:
This is not the intersect of all sequences but rather, the union of the intersect of all pairs of sequences.
Note 2:
The does not include the pair, or 2 combination, of a sequence with itself. That would be silly.
I have made an attempt myself,
public static IEnumerable<T> UnionOfIntersects<T>(
this IEnumerable<IEnumerable<T>> source)
{
var pairs =
from s1 in source
from s2 in source
select new { s1 , s2 };
var intersects = pairs
.Where(p => p.s1 != p.s2)
.Select(p => p.s1.Intersect(p.s2));
return intersects.SelectMany(i => i).Distinct();
}
but I'm concerned that this might be sub-optimal, I think it includes intersects of pair A, B and pair B, A which seems inefficient. I also think there might be a more efficient way to compound the sets as they are iterated.
I include some example input and output below:
{ { 1, 1, 2, 3, 4, 5, 7 }, { 5, 6, 7 }, { 2, 6, 7, 9 } , { 4 } }
returns
{ 2, 4, 5, 6, 7 }
and
{ { 1, 2, 3} } or { {} } or { }
returns
{ }
I'm looking for the best combination of readability and potential performance.
EDIT
I've performed some initial testing of the current answers, my code is here. Output below.
Original valid:True
DoomerOneLine valid:True
DoomerSqlLike valid:True
Svinja valid:True
Adricadar valid:True
Schmelter valid:True
Original 100000 iterations in 82ms
DoomerOneLine 100000 iterations in 58ms
DoomerSqlLike 100000 iterations in 82ms
Svinja 100000 iterations in 1039ms
Adricadar 100000 iterations in 879ms
Schmelter 100000 iterations in 9ms
At the moment, it looks as if Tim Schmelter's answer performs better by at least an order of magnitude.
// init sequences
var sequences = new int[][]
{
new int[] { 1, 2, 3, 4, 5, 7 },
new int[] { 5, 6, 7 },
new int[] { 2, 6, 7, 9 },
new int[] { 4 }
};
One-line way:
var result = sequences
.SelectMany(e => e.Distinct())
.GroupBy(e => e)
.Where(e => e.Count() > 1)
.Select(e => e.Key);
// result is { 2 4 5 7 6 }
Sql-like way (with ordering):
var result = (
from e in sequences.SelectMany(e => e.Distinct())
group e by e into g
where g.Count() > 1
orderby g.Key
select g.Key);
// result is { 2 4 5 6 7 }
May be fastest code (but not readable), complexity O(N):
var dic = new Dictionary<int, int>();
var subHash = new HashSet<int>();
int length = array.Length;
for (int i = 0; i < length; i++)
{
subHash.Clear();
int subLength = array[i].Length;
for (int j = 0; j < subLength; j++)
{
int n = array[i][j];
if (!subHash.Contains(n))
{
int counter;
if (dic.TryGetValue(n, out counter))
{
// duplicate
dic[n] = counter + 1;
}
else
{
// first occurance
dic[n] = 1;
}
}
else
{
// exclude duplucate in sub array
subHash.Add(n);
}
}
}
This should be very close to optimal - how "readable" it is depends on your taste. In my opinion it is also the most readable solution.
var seenElements = new HashSet<T>();
var repeatedElements = new HashSet<T>();
foreach (var list in source)
{
foreach (var element in list.Distinct())
{
if (seenElements.Contains(element))
{
repeatedElements.Add(element);
}
else
{
seenElements.Add(element);
}
}
}
return repeatedElements;
You can skip already Intesected sequences, this way will be a little faster.
public static IEnumerable<T> UnionOfIntersects<T>(this IEnumerable<IEnumerable<T>> source)
{
var result = new List<T>();
var sequences = source.ToList();
for (int sequenceIdx = 0; sequenceIdx < sequences.Count(); sequenceIdx++)
{
var sequence = sequences[sequenceIdx];
for (int targetSequenceIdx = sequenceIdx + 1; targetSequenceIdx < sequences.Count; targetSequenceIdx++)
{
var targetSequence = sequences[targetSequenceIdx];
var intersections = sequence.Intersect(targetSequence);
result.AddRange(intersections);
}
}
return result.Distinct();
}
How it works?
Input: {/*0*/ { 1, 2, 3, 4, 5, 7 } ,/*1*/ { 5, 6, 7 },/*2*/ { 2, 6, 7, 9 } , /*3*/{ 4 } }
Step 0: Intersect 0 with 1..3
Step 1: Intersect 1 with 2..3 (0 with 1 already has been intersected)
Step 2: Intersect 2 with 3 (0 with 2 and 1 with 2 already has been intersected)
Return: Distinct elements.
Result: { 2, 4, 5, 6, 7 }
You can test it with the below code
var lists = new List<List<int>>
{
new List<int> {1, 2, 3, 4, 5, 7},
new List<int> {5, 6, 7},
new List<int> {2, 6, 7, 9},
new List<int> {4 }
};
var result = lists.UnionOfIntersects();
You can try this approach, it might be more efficient and also allows to specify the minimum intersection-count and the comparer used:
public static IEnumerable<T> UnionOfIntersects<T>(this IEnumerable<IEnumerable<T>> source
, int minIntersectionCount
, IEqualityComparer<T> comparer = null)
{
if (comparer == null) comparer = EqualityComparer<T>.Default;
foreach (T item in source.SelectMany(s => s).Distinct(comparer))
{
int containedInHowManySequences = 0;
foreach (IEnumerable<T> seq in source)
{
bool contained = seq.Contains(item, comparer);
if (contained) containedInHowManySequences++;
if (containedInHowManySequences == minIntersectionCount)
{
yield return item;
break;
}
}
}
}
Some explaining words:
It enumerates all unique items in all sequences. Since Distinct is using a set this should be pretty efficient. That can help to speed up in case of many duplicates in all sequences.
The inner loop just looks into every sequence if the unique item is contained. Thefore it uses Enumerable.Contains which stops execution as soon as one item was found(so duplicates are no issue).
If the intersection-count reaches the minum intersection count this item is yielded and the next (unique) item is checked.
That should nail it:
int[][] test = { new int[] { 1, 2, 3, 4, 5, 7 }, new int[] { 5, 6, 7 }, new int[] { 2, 6, 7, 9 }, new int[] { 4 } };
var result = test.SelectMany(a => a.Distinct()).GroupBy(x => x).Where(g => g.Count() > 1).Select(y => y.Key).ToList();
First you make sure, there are no duplicates in each sequence. Then you join all sequences to a single sequence and look for duplicates as e.g. here.

Merging arrays with common element

I want to merge arrays with common element. I have list of arrays like this:
List<int[]> arrList = new List<int[]>
{
new int[] { 1, 2 },
new int[] { 3, 4, 5 },
new int[] { 2, 7 },
new int[] { 8, 9 },
new int[] { 10, 11, 12 },
new int[] { 3, 9, 13 }
};
and I would like to merge these arrays like this:
List<int[]> arrList2 = new List<int[]>
{
new int[] { 1, 2, 7 },
new int[] { 10, 11, 12 },
new int[] { 3, 4, 5, 8, 9, 13 } //order of elements doesn't matter
};
How to do it?
Let each number be a vertex in the labelled graph. For each array connect vertices pointed by the numbers in the given array. E.g. given array (1, 5, 3) create two edges (1, 5) and (5, 3). Then find all the connected components in the graph (see: http://en.wikipedia.org/wiki/Connected_component_(graph_theory))
I'm pretty sure it is not the best and the fastest solution, but works.
static List<List<int>> Merge(List<List<int>> source)
{
var merged = 0;
do
{
merged = 0;
var results = new List<List<int>>();
foreach (var l in source)
{
var i = results.FirstOrDefault(x => x.Intersect(l).Any());
if (i != null)
{
i.AddRange(l);
merged++;
}
else
{
results.Add(l.ToList());
}
}
source = results.Select(x => x.Distinct().ToList()).ToList();
}
while (merged > 0);
return source;
}
I've used List<List<int>> instead of List<int[]> to get AddRange method available.
Usage:
var results = Merge(arrList.Select(x => x.ToList()).ToList());
// to get List<int[]> instead of List<List<int>>
var array = results.Select(x => x.ToArray()).ToList();
Use Disjoint-Set Forest data structure. The data structure supports three operations:
MakeSet(item) - creates a new set with a single item
Find(item) - Given an item, look up a set.
Union(item1, item2) - Given two items, connects together the sets to which they belong.
You can go through each array, and call Union on its first element and each element that you find after it. Once you are done with all arrays in the list, you will be able to retrieve the individual sets by going through all the numbers again, and calling Find(item) on them. Numbers the Find on which produce the same set should be put into the same array.
This approach finishes the merge in O(α(n)) amortized (α grows very slowly, so for all practical purposes it can be considered a small constant).

comparing two lists and removing missing numbers with C#

there are two lists:
List<int> list2 = new List<int>(new[] { 1, 2, 3, 5, 6 }); // missing: 0 and 4
List<int> list1 = new List<int>(new[] { 0, 1, 2, 3, 4, 5, 6 });
how do you compare two lists, find missing numbers in List1 and remove these numbers from List1? To be more precise, I need to find a way to specify starting and ending position for comparison.
I imagine that the proccess should be very similar to this:
Step 1.
int start_num = 3; // we know that comparisons starts at number 3
int start = list2.IndexOf(start_num); // we get index of Number (3)
int end = start + 2; // get ending position
int end_num = list2[end]; // get ending number (6)
now we've got positions of numbers (and numbers themselves) for comparison in List2 (3,5,6)
Step 2. To get positions of numbers in List1 for comparison - we can do the following:
int startlist1 = list1.IndexOf(start_num); // starting position
int endlist1 = list1.IndexOf(end_num); // ending position
the range is following: (3,4,5,6)
Step 3. Comparison. Tricky part starts here and I need a help with it
Basically now we need to compare list2 at (3,5,6) with list1 at (3,4,5,6). The missing number is "4".
// I have troubles with this step but the result will be:
int remove_it = 4; // or int []
Step 4. Odd number removal.
int remove_it = 4;
list1 = list1.Where(a => a != remove_it).ToList();
works great, but what will happen if we have 2 missing numbers? i.e.
int remove_it = 4 // becomes int[] remove_it = {4, 0}
Result As you have guessed the result is new List1, without number 4 in it.
richTextBox1.Text = "" + string.Join(",", list1.ToArray()); // output: 0,1,2,3,5,6
textBox1.Text = "" + start + " " + start_num; // output: 2 3
textBox3.Text = "" + end + " " + end_num; // output: 4 6
textBox2.Text = "" + startlist1; // output: 3
textBox4.Text = "" + endlist1; // output: 6
Can you guy help me out with Step 3 or point me out to the right direction?
Also, can you say what will happen if starting number(start_num) is the last number, but I need to get next two numbers? In example from above numbers were 3,5,6, but they should be no different than 5,6,0 or 6,0,1 or 0,1,2.
Just answering the first part:
var list3 = list1.Intersect(list2);
This will set list3 to { 0, 1, 2, 3, 4, 5, 6 } - { 0, 4 } = { 1, 2, 3, 5, 6 }
And a reaction to step 1:
int start_num = 3; // we know that comparisons starts at number 3
int start = list2.IndexOf(start_num); // we get index of Number (3)
int end = start + 2; // get ending position
From where do you get all those magic numbers (3, + 2 ) ?
I think you are over-thinking this, a lot.
var result = list1.Intersect(list2)
You can add a .ToList on the end if you really need the result to be a list.
List<int> list2 = new List<int>(new[] { 1, 2, 3, 5, 6 }); // missing: 0 and 4
List<int> list1 = new List<int>(new[] { 0, 1, 2, 3, 4, 5, 6 });
// find items in list 2 notin 1
var exceptions = list1.Except(list2);
// or are you really wanting to do a union? (unique numbers in both arrays)
var uniquenumberlist = list1.Union(list2);
// or are you wanting to find common numbers in both arrays
var commonnumberslist = list1.Intersect(list2);
maybe you should work with OrderedList instead of List...
Something like this:
list1.RemoveAll(l=> !list2.Contains(l));
To get the numbers that exist in list1 but not in list2, you use the Except extension method:
IEnumerable<int> missing = list1.Except(list2);
To loop through this result to remove them from list1, you have to realise the result, otherwise it will read from the list while you are changing it, and you get an exception:
List<int> missing = list1.Except(list2).ToList();
Now you can just remove them:
foreach (int number in missing) {
list1.Remove(number);
}
I'm not sure I understand your issue, and I hope the solution I give you to be good for you.
You have 2 lists:
List list2 = new List(new[] { 1, 2, 3, 5, 6 }); // missing: 0 and 4
List list1 = new List(new[] { 0, 1, 2, 3, 4, 5, 6 });
To remove from list1 all the missing numbers in list2 I suggest this solution:
Build a new list with missing numbers:
List diff = new List();
then put all the numbers you need to remove in this list. Now the remove process should be simple, just take all the elements you added in diff and remove from list2.
Did I understand correctly that algorithm is:
1) take first number in List 2 and find such number in List1,
2) then remove everything from list 1 until you find second number form list2 (5)
3) repeat step 2) for next number in list2.?
You can use Intersect in conjunction with Skip and Take to get the intersection logic combined with a range (here we ignore the fact 0 is missing as we skip it):
static void Main(string[] args)
{
var list1 = new List<int> { 1, 2, 3, 4, 5 };
var list2 = new List<int> { 0, 1, 2, 3, 5, 6 };
foreach (var i in list2.Skip(3).Take(3).Intersect(list1))
Console.WriteLine(i); // Outputs 3 then 5.
Console.Read();
}
Though if I'm being really honest, I'm not sure what is being asked - the only thing I'm certain on is the intersect part:
var list1 = new List<int> { 1, 2, 3, 4, 5 };
var list2 = new List<int> { 0, 1, 2, 3, 5, 6 };
foreach (var i in list2.Intersect(list1))
Console.WriteLine(i); // Outputs 1, 2, 3, 5.
ok, seems like I hadn't explained the problem well enough, sorry about it. Anyone interested can understand what I meant by looking at this code:
List<int> list2 = new List<int>() { 1, 2, 3, 5, 6 }; // missing: 0 and 4
List<int> list1 = new List<int>() { 0, 1, 2, 3, 4, 5, 6 };
int number = 3; // starting position
int indexer = list2.BinarySearch(number);
if (indexer < 0)
{
list2.Insert(~index, number); // don't look at this part
}
// get indexes of "starting position"
int index1 = list1.Select((item, i) => new { Item = item, Index = i }).First(x => x.Item == number).Index;
int index2 = list2.Select((item, i) => new { Item = item, Index = i }).First(x => x.Item == number).Index;
// reorder lists starting at "starting position"
List<int> reorderedList1 = list1.Skip(index1).Concat(list1.Take(index1)).ToList(); //main big
List<int> reorderedList2 = list2.Skip(index2).Concat(list2.Take(index2)).ToList(); // main small
int end = 2; // get ending position: 2 numbers to the right
int end_num = reorderedList2[end]; // get ending number
int endlist1 = reorderedList1.IndexOf(end_num); // ending position
//get lists for comparison
reorderedList2 = reorderedList2.Take(end + 1).ToList();
reorderedList1 = reorderedList1.Take(endlist1 + 1).ToList();
//compare lists
var list3 = reorderedList1.Except(reorderedList2).ToList();
if (list3.Count != 0)
{
foreach (int item in list3)
{
list1 = list1.Where(x => x != item).ToList(); // remove from list
}
}
// list1 is the result that I wanted to see
if there are any ways to optimize this code please inform me. cheers.

Joining two lists together

If I have two lists of type string (or any other type), what is a quick way of joining the two lists?
The order should stay the same. Duplicates should be removed (though every item in both links are unique). I didn't find much on this when googling and didn't want to implement any .NET interfaces for speed of delivery.
You could try:
List<string> a = new List<string>();
List<string> b = new List<string>();
a.AddRange(b);
MSDN page for AddRange
This preserves the order of the lists, but it doesn't remove any duplicates (which Union would do).
This does change list a. If you wanted to preserve the original lists then you should use Concat (as pointed out in the other answers):
var newList = a.Concat(b);
This returns an IEnumerable as long as a is not null.
The way with the least space overhead is to use the Concat extension method.
var combined = list1.Concat(list2);
It creates an instance of IEnumerable<T> which will enumerate the elements of list1 and list2 in that order.
The Union method might address your needs. You didn't specify whether order or duplicates was important.
Take two IEnumerables and perform a union as seen here:
int[] ints1 = { 5, 3, 9, 7, 5, 9, 3, 7 };
int[] ints2 = { 8, 3, 6, 4, 4, 9, 1, 0 };
IEnumerable<int> union = ints1.Union(ints2);
// yields { 5, 3, 9, 7, 8, 6, 4, 1, 0 }
Something like this:
firstList.AddRange (secondList);
Or, you can use the 'Union' extension method that is defined in System.Linq.
With 'Union', you can also specify a comparer, which can be used to specify whether an item should be unioned or not.
Like this:
List<int> one = new List<int> { 1, 2, 3, 4, 5 };
List<int> second=new List<int> { 1, 2, 5, 6 };
var result = one.Union (second, new EqComparer ());
foreach( int x in result )
{
Console.WriteLine (x);
}
Console.ReadLine ();
#region IEqualityComparer<int> Members
public class EqComparer : IEqualityComparer<int>
{
public bool Equals( int x, int y )
{
return x == y;
}
public int GetHashCode( int obj )
{
return obj.GetHashCode ();
}
}
#endregion
targetList = list1.Concat(list2).ToList();
It's working fine I think so. As previously said, Concat returns a new sequence and while converting the result to List, it does the job perfectly. Implicit conversions may fail sometimes when using the AddRange method.
If some item(s) exist in both lists you may use
var all = list1.Concat(list2).Concat(list3) ... Concat(listN).Distinct().ToList();
As long as they are of the same type, it's very simple with AddRange:
list2.AddRange(list1);
var bigList = new List<int> { 1, 2, 3 }
.Concat(new List<int> { 4, 5, 6 })
.ToList(); /// yields { 1, 2, 3, 4, 5, 6 }
The AddRange method
aList.AddRange( anotherList );
List<string> list1 = new List<string>();
list1.Add("dot");
list1.Add("net");
List<string> list2 = new List<string>();
list2.Add("pearls");
list2.Add("!");
var result = list1.Concat(list2);
one way: List.AddRange() depending on the types?
One way, I haven't seen mentioned that can be a bit more robust, particularly if you wanted to alter each element in some way (e.g. you wanted to .Trim() all of the elements.
List<string> a = new List<string>();
List<string> b = new List<string>();
// ...
b.ForEach(x=>a.Add(x.Trim()));
See this link
public class ProductA
{
public string Name { get; set; }
public int Code { get; set; }
}
public class ProductComparer : IEqualityComparer<ProductA>
{
public bool Equals(ProductA x, ProductA y)
{
//Check whether the objects are the same object.
if (Object.ReferenceEquals(x, y)) return true;
//Check whether the products' properties are equal.
return x != null && y != null && x.Code.Equals(y.Code) && x.Name.Equals(y.Name);
}
public int GetHashCode(ProductA obj)
{
//Get hash code for the Name field if it is not null.
int hashProductName = obj.Name == null ? 0 : obj.Name.GetHashCode();
//Get hash code for the Code field.
int hashProductCode = obj.Code.GetHashCode();
//Calculate the hash code for the product.
return hashProductName ^ hashProductCode;
}
}
ProductA[] store1 = { new ProductA { Name = "apple", Code = 9 },
new ProductA { Name = "orange", Code = 4 } };
ProductA[] store2 = { new ProductA { Name = "apple", Code = 9 },
new ProductA { Name = "lemon", Code = 12 } };
//Get the products from the both arrays
//excluding duplicates.
IEnumerable<ProductA> union =
store1.Union(store2);
foreach (var product in union)
Console.WriteLine(product.Name + " " + product.Code);
/*
This code produces the following output:
apple 9
orange 4
lemon 12
*/
The two options I use are:
list1.AddRange(list2);
or
list1.Concat(list2);
However I noticed as I used that when using the AddRange method with a recursive function, that calls itself very often I got an SystemOutOfMemoryException because the maximum number of dimensions was reached.
(Message Google Translated)
The array dimensions exceeded the supported range.
Using Concat solved that issue.
I just wanted to test how Union works with the default comparer on overlapping collections of reference type objects.
My object is:
class MyInt
{
public int val;
public override string ToString()
{
return val.ToString();
}
}
My test code is:
MyInt[] myInts1 = new MyInt[10];
MyInt[] myInts2 = new MyInt[10];
int overlapFrom = 4;
Console.WriteLine("overlapFrom: {0}", overlapFrom);
Action<IEnumerable<MyInt>, string> printMyInts = (myInts, myIntsName) => Console.WriteLine("{2} ({0}): {1}", myInts.Count(), string.Join(" ", myInts), myIntsName);
for (int i = 0; i < myInts1.Length; i++)
myInts1[i] = new MyInt { val = i };
printMyInts(myInts1, nameof(myInts1));
int j = 0;
for (; j + overlapFrom < myInts1.Length; j++)
myInts2[j] = myInts1[j + overlapFrom];
for (; j < myInts2.Length; j++)
myInts2[j] = new MyInt { val = j + overlapFrom };
printMyInts(myInts2, nameof(myInts2));
IEnumerable<MyInt> myUnion = myInts1.Union(myInts2);
printMyInts(myUnion, nameof(myUnion));
for (int i = 0; i < myInts2.Length; i++)
myInts2[i].val += 10;
printMyInts(myInts2, nameof(myInts2));
printMyInts(myUnion, nameof(myUnion));
for (int i = 0; i < myInts1.Length; i++)
myInts1[i].val = i;
printMyInts(myInts1, nameof(myInts1));
printMyInts(myUnion, nameof(myUnion));
The output is:
overlapFrom: 4
myInts1 (10): 0 1 2 3 4 5 6 7 8 9
myInts2 (10): 4 5 6 7 8 9 10 11 12 13
myUnion (14): 0 1 2 3 4 5 6 7 8 9 10 11 12 13
myInts2 (10): 14 15 16 17 18 19 20 21 22 23
myUnion (14): 0 1 2 3 14 15 16 17 18 19 20 21 22 23
myInts1 (10): 0 1 2 3 4 5 6 7 8 9
myUnion (14): 0 1 2 3 4 5 6 7 8 9 20 21 22 23
So, everything works fine.

Categories