C# - fastest way of comparing a collection against itself to find duplicates

C# - fastest way of comparing a collection against itself to find duplicates - c#

public class TestObject
{
string TestValue { get; set; }
bool IsDuplicate { get; set; }
}
List<TestObject> testList = new List<TestObject>
{
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Bob" },
new TestObject { TestValue = "Alice" },
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Claire" },
new TestObject { TestValue = "Matt" }
};
Imagine testList is actually millions of objects long.
What's the fastest way to ensure that two of those three TestObjects with TestValue of Matt gets its IsDuplicate set to true? No matter how may instances of a given value there are, only one should come out of the process with IsDuplicate of false.
I am not averse to doing this via threading. And the collection doesn't have to be a list if converting it to another collection type is faster.
I need to keep duplicates and mark them as such, not remove them from the collection.
To expand, this is (as you might imagine) a simple expression of a much more complex problem. The objects in question already have an ordinal which I can use to order them.
After matching initial duplicates on exact string equality, I'm going to have to go back through the collection again and re-try the remainder using some fuzzy matching logic. The collection that exists at the start of this process won't be changed during the deduplication, or afterwards.
Eventually the original collection is going to be written out to a file, with likely duplicates flagged.

As others mentioned, the correct approach here would be to use the HashSet class.
var hashSet = new HashSet<string>();
foreach (var obj in testList)
{
if (!hashSet.Add(obj.TestValue))
{
obj.IsDuplicate = true;
}
}
When you add a value first time to the HashSet, it adds successfully and HashSet.Add() method returns true so you don't make any changes to the item. When you're trying to add it second time, HashSet.Add() returns false and you mark your item as a duplicate.
The list will have the following state after finishing running our marking duplicates method:
Matt
Bob
Alice
Claire
Matt DUPLICATE

This is probably quite performant:
foreach (var dupe in testList.GroupBy(x => x.TestValue).SelectMany(g => g.Skip(1)))
dupe.IsDuplicate = true;
[EDIT] This method turns out to be about a third of the speed of the accepted answer above, so that one should be used. This answer is merely of academic interest.

Probably I would go to check for the duplicates while building the collection of the TestValue to avoid looping two times on millions of elements. If this scenario is possible then I would use a Dictionary<string, List<TestValue>>
Dictionary<string, List<TestValue>> myList = new Dictionary<string, List<TestValue>>();
while(NotEndOfData())
{
TestValue obj = GetTestValue();
if(myList.ContainsKey(obj.Name))
{
obj.IsDuplicate = true;
myList[obj.Name].Add(obj);
}
else
{
obj.IsDuplicate = false;
myList.Add(obj.Name, new List<TestValue>() { obj};
}
}

SortedSet<string> sorted = new SortedSet<string>();
for (int i = 0; i < testList.Count; i++)
testList[i].IsDuplicate = !sorted.Add(testList[i].TestValue);
As you have allowed in the question, I'd change testList to be an array instead of a list, to make indexer faster.

Since you indicated that you have a property that keeps the ordinal of your items. We can use that property to reset the sort order back to its original after marking our items as duplicates.
The code below is self-explainatory. But just let me know in case you need any further explaination.
I have assumed that the property name is SortOrder. Modify the code accordingly.
void MarkDuplicates()
{
testList = testList.OrderBy(f => f.TestValue).ThenBy(f => f.SortOrder).ToList();
for (int i = 1; i < testList.Count; i++)
{
if (testList[i].TestValue == testList[i - 1].TestValue) testList[i].IsDuplicate = true;
}
testList = testList.OrderBy(f => f.SortOrder).ToList();
}
I'm not a performance expert. But you can time the various solutions provided here and check the performance for yourself.

Related

Best Way to compare 1 million List of object with another 1 million List of object in c#

i am differentiating 1 million list of object with another 1 million list of object.
i am using for , foreach but it takes too much of time to iterate those list.
can any one help me best way to do this
var SourceList = new List<object>(); //one million
var TargetList = new List<object>()); // one million
//getting data from database here
//SourceList with List of one million
//TargetList with List of one million
var DifferentList = new List<object>();
//ForEach
SourceList.ToList().ForEach(m =>
{
if (!TargetList.Any(s => s.Name == m.Name))
DifferentList.Add(m);
});
//for
for (int i = 0; i < SourceList .Count; i++)
{
if (!TargetList .Any(s => s == SourceList [i].Name))
DifferentList .Add(SourceList [i]);
}

I think it seems like a bad idea but IEnumerable magic will help you.
For starters, simplify your expression. It looks like this:
var result = sourceList.Where(s => targetList.Any(t => t.Equals(s)));
I recommend making a comparison in the Equals method:
public class CompareObject
{
public string prop { get; set; }
public new bool Equals(object o)
{
if (o.GetType() == typeof(CompareObject))
return this.prop == ((CompareObject)o).prop;
return this.GetHashCode() == o.GetHashCode();
}
}
Next add AsParallel. This can both speed up and slow down your program. In your case, you can add ...
var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));
CPU 100% loaded if you try to list all at once like this:
var cnt = result.Count();
But it’s quite tolerable to work if you get the results in small portions.
result.Skip(10000).Take(10000).ToList();
Full code:
static Random random = new Random();
public class CompareObject
{
public string prop { get; private set; }
public CompareObject()
{
prop = random.Next(0, 100000).ToString();
}
public new bool Equals(object o)
{
if (o.GetType() == typeof(CompareObject))
return this.prop == ((CompareObject)o).prop;
return this.GetHashCode() == o.GetHashCode();
}
}
void Main()
{
var sourceList = new List<CompareObject>();
var targetList = new List<CompareObject>();
for (int i = 0; i < 10000000; i++)
{
sourceList.Add(new CompareObject());
targetList.Add(new CompareObject());
}
var stopWatch = new Stopwatch();
stopWatch.Start();
var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));
var lr = result.Skip(10000).Take(10000).ToList();
stopWatch.Stop();
Console.WriteLine(stopWatch.Elapsed);
}
Update
I remembered what you can use Hashtable.Choos unique values from targetList and from sourceList next fill out the result whose values are not targetList.
Example:
static Random random = new Random();
public class CompareObject
{
public string prop { get; private set; }
public CompareObject()
{
prop = random.Next(0, 1000000).ToString();
}
public new int GetHashCode() {
return prop.GetHashCode();
}
}
void Main()
{
var sourceList = new List<CompareObject>();
var targetList = new List<CompareObject>();
for (int i = 0; i < 10000000; i++)
{
sourceList.Add(new CompareObject());
targetList.Add(new CompareObject());
}
var stopWatch = new Stopwatch();
stopWatch.Start();
var sourceHashtable = new Hashtable();
var targetHashtable = new Hashtable();
foreach (var element in targetList)
{
var hash = element.GetHashCode();
if (!targetHashtable.ContainsKey(hash))
targetHashtable.Add(element.GetHashCode(), element);
}
var result = new List<CompareObject>();
foreach (var element in sourceList)
{
var hash = element.GetHashCode();
if (!sourceHashtable.ContainsKey(hash))
{
sourceHashtable.Add(hash, element);
if(!targetHashtable.ContainsKey(hash)) {
result.Add(element);
}
}
}
stopWatch.Stop();
Console.WriteLine(stopWatch.Elapsed);
}

Scanning the target list to match the name is an O(n) operation, thus your loop is O(n^2). If you build a HashSet<string> of all the distinct names in the target list, you can check whether a name exists in the set in O(1) time using the Contains method.

//getting data from database here
You are getting the data out of a system that specializes in matching and sorting and filtering data, into your RAM that by default cannot yet do that task at all. And then you try to sort, filter and match yourself.
That will fail. No matter how hard you try, it is extremely unlikely that your computer with a single programmer working at a matching algorithm will outperform your specialized piece of hardware called a database server at the one operation this software is supposed to be really good at that was programmed by teams of experts and optimized for years.
You don't go into a fancy restaurant and ask them to give you huge bags of raw ingredients so you can throw them into a big bowl unpeeled and microwave them at home. No. You order a nice dish because it will be way better than anything you could do yourself.
The simple answer is: Do not do that. Do not take the raw data and rummage around in it for hours. Leave that job to the database. It's the one thing it's supposed to be good at. Use it's power. Write a query that will give you the result, don't get the raw data and then play database yourself.

Foreach performs a null check before each iteration, so using a standard for loop will provide slightly better performance that will be hard to beat.
If it is taking too long, can you break down the collection into smaller sets and/or process them in parallel?
Also you could look a PLinq (Parallel Linq) using .AsParallel()
Other areas to improve are the actual comparison logic that you are using, also how the data is stored in memory, depending on your problem, you may not have to load the entire object into memory for every iteration.
Please provide a code example so that we can assist further, when such large amounts of data are involved performance degredation is to be expected.
Again depending on the time that we are talking about here, you could upload the data into a database and use that for the comparison rather than trying to do it natively in C#, this type of solution is better suited to data sets that are already in a database or where the data changes much less frequently than the times you need to perform the comparison.

Search a List of string array to find a value in matching element and return another element in same array

So I have
List<string[]> listy = new List<string[]>();
listy.add('a','1','blue');
listy.add('b','2','yellow');
And i want to search through all of the list ti find the index where the array containing 'yellow' is, and return the first element value, in this case 'b'.
Is there a way to do this with built in functions or am i going to need to write my own search here?
Relatively new to c# and not aware of good practice or all the built in functions. Lists and arrays im ok with but lists of arrays baffles me somewhat.
Thanks in advance.

As others have already suggested, the easiest way to do this involves a very powerful C# feature called LINQ ("Language INtegrated Queries). It gives you a SQL-like syntax for querying collections of objects (or databases, or XML documents, or JSON documents).
To make LINQ work, you will need to add this at the top of your source code file:
using System.Linq;
Then you can write:
IEnumerable<string> yellowThings =
from stringArray in listy
where stringArray.Contains("yellow")
select stringArray[0];
Or equivalently:
IEnumerable<string> yellowThings =
listy.Where(strings => strings.Contains("yellow"))
.Select(strings => strings[0]);
At this point, yellowThings is an object containing a description of the query that you want to run. You can write other LINQ queries on top of it if you want, and it won't actually perform the search until you ask to see the results.
You now have several options...
Loop over the yellow things:
foreach(string thing in yellowThings)
{
// do something with thing...
}
(Don't do this more than once, otherwise the query will be evaluated repeatedly.)
Get a list or array :
List<string> listOfYellowThings = yellowThings.ToList();
string[] arrayOfYellowThings = yellowThings.ToArray();
If you expect to have exactly one yellow thing:
string result = yellowThings.Single();
// Will throw an exception if the number of matches is zero or greater than 1
If you expect to have either zero or one yellow things:
string result = yellowThings.SingleOrDefault();
// result will be null if there are no matches.
// An exception will be thrown if there is more than one match.
If you expect to have one or more yellow things, but only want the first one:
string result = yellowThings.First();
// Will throw an exception if there are no yellow things
If you expect to have zero or more yellow things, but only want the first one if it exists:
string result = yellowThings.FirstOrDefault();
// result will be null if there are no yellow things.

Based on the problem explanation provided by you following is the solution I can suggest.
List<string[]> listy = new List<string[]>();
listy.Add(new string[] { "a", "1", "blue"});
listy.Add(new string[] { "b", "2", "yellow"});
var target = listy.FirstOrDefault(item => item.Contains("yellow"));
if (target != null)
{
Console.WriteLine(target[0]);
}
This should solve your issue. Let me know if I am missing any use case here.

You might consider changing the data structure,
Have a class for your data as follows,
public class Myclas
{
public string name { get; set; }
public int id { get; set; }
public string color { get; set; }
}
And then,
static void Main(string[] args)
{
List<Myclas> listy = new List<Myclas>();
listy.Add(new Myclas { name = "a", id = 1, color = "blue" });
listy.Add(new Myclas { name = "b", id = 1, color = "yellow" });
var result = listy.FirstOrDefault(t => t.color == "yellow");
}

Your current situation is
List<string[]> listy = new List<string[]>();
listy.Add(new string[]{"a","1","blue"});
listy.Add(new string[]{"b","2","yellow"});
Now there are Linq methods, so this is what you're trying to do
var result = listy.FirstOrDefault(x => x.Contains("yellow"))?[0];

Test if all values in a list are unique

I have a small list of bytes and I want to test that they're all different values.
For instance, I have this:
List<byte> theList = new List<byte> { 1,4,3,6,1 };
What's the best way to check if all values are distinct or not?

bool isUnique = theList.Distinct().Count() == theList.Count();

Here's another approach which is more efficient than Enumerable.Distinct + Enumerable.Count (all the more if the sequence is not a collection type). It uses a HashSet<T> which eliminates duplicates, is very efficient in lookups and has a count-property:
var distinctBytes = new HashSet<byte>(theList);
bool allDifferent = distinctBytes.Count == theList.Count;
or another - more subtle and efficient - approach:
var diffChecker = new HashSet<byte>();
bool allDifferent = theList.All(diffChecker.Add);
HashSet<T>.Add returns false if the element could not be added since it was already in the HashSet. Enumerable.All stops on the first "false".

Okay, here is the most efficient method I can think of using standard .Net
using System;
using System.Collections.Generic;
public static class Extension
{
public static bool HasDuplicate<T>(
this IEnumerable<T> source,
out T firstDuplicate)
{
if (source == null)
{
throw new ArgumentNullException(nameof(source));
}
var checkBuffer = new HashSet<T>();
foreach (var t in source)
{
if (checkBuffer.Add(t))
{
continue;
}
firstDuplicate = t;
return true;
}
firstDuplicate = default(T);
return false;
}
}
essentially, what is the point of enumerating the whole sequence twice if all you want to do is find the first duplicate.
I could optimise this more by special casing an empty and single element sequences but that would depreciate from readability/maintainability with minimal gain.

The similar logic to Distinct using GroupBy:
var isUnique = theList.GroupBy(i => i).Count() == theList.Count;

I check if an IEnumerable (aray, list, etc ) is unique like this :
var isUnique = someObjectsEnum.GroupBy(o => o.SomeProperty).Max(g => g.Count()) == 1;

One can also do: Use Hashset
var uniqueIds = new HashSet<long>(originalList.Select(item => item.Id));
if (uniqueIds.Count != originalList.Count)
{
}

There are many solutions.
And no doubt more beautiful ones with the usage of LINQ as "juergen d" and "Tim Schmelter" mentioned.
But, if you bare "Complexity" and speed, the best solution will be to implement it by yourself.
One of the solution will be, to create an array of N size (for byte it's 256).
And loop the array, and on every iteration will test the matching number index if the value is 1 if it does, that means i already increment the array index and therefore the array isn't distinct otherwise i will increment the array cell and continue checking.

And another solution, if you want to find duplicated values.
var values = new [] { 9, 7, 2, 6, 7, 3, 8, 2 };
var sorted = values.ToList();
sorted.Sort();
for (var index = 1; index < sorted.Count; index++)
{
var previous = sorted[index - 1];
var current = sorted[index];
if (current == previous)
Console.WriteLine(string.Format("duplicated value: {0}", current));
}
Output:
duplicated value: 2
duplicated value: 7
http://rextester.com/SIDG48202

Map two string arrays without a switch

I have two arrays that need to be mapped. In code
var result = "[placeholder2] Hello my name is [placeholder1]";
var placeholder = { "[placeholder1]", "[placeholder2]", "[placeholder3]", "[placeholder4]" };
var placeholderValue = { "placeholderValue3", "placeholderValue2", "placeholderValue3" };
Array.ForEach(placeholder , i => result = result.Replace(i, placeholderValue));
given i, placeholderValue needs to be set in an intelligent way. I can implement a switch statement. The cyclomatic complexity would be unacceptable with 30 elements or so. What is a good pattern, extension method or otherwise means to achieve my goal?

I skipped null checks for simplicity
string result = "[placeholder2] Hello my name is [placeholder1]";
var placeHolders = new Dictionary<string, string>() {
{ "placeholder1", "placeholderValue1" },
{ "placeholder2", "placeholderValue2" }
};
var newResult = Regex.Replace(result,#"\[(.+?)\]",m=>placeHolders[m.Groups[1].Value]);

The smallest code change would be to just use a for loop, rather than a ForEach or, in your case, a ForEach taking a lambda. With a for loop you'll have the index of the appropriate value in the placehoderValue array.
The next improvement would be to make a single array of an object holding both a placeholder and it's value, rather than two 'parallel' arrays that you need to keep in sync.
Even better than that, and also even simpler to implement, is to just have a Dictionary with the key being a placeholder and the value being the placeholder value. This essentially does the above suggestion for you through the use of the KeyValuePair class (so you don't need to make your own).
At that point the pseudocode becomes:
foreach(key in placeholderDictionary) replace key with placeholderDictionary[key]

I think you want to use Zip to combine the placeholders with their values.
var result = "[placeholder2] Hello my name is [placeholder1]";
var placeholder = new[] { "[placeholder1]", "[placeholder2]", "[placeholder3]", "[placeholder4]" };
var placeholderValue = new[] { "placeholderValue1", "placeholderValue2", "placeholderValue3", "placeholderValue4" };
var placeHolderPairs = placeholder.Zip(placeholderValue, Tuple.Create);
foreach (var pair in placeHolderPairs)
{
result = result.Replace(pair.Item1, pair.Item2);
}

Removing duplicates from a list with "priority"

Given a collection of records like this:
string ID1;
string ID2;
string Data1;
string Data2;
// :
string DataN
Initially Data1..N are null, and can pretty much be ignored for this question. ID1 & ID2 both uniquely identify the record. All records will have an ID2; some will also have an ID1. Given an ID2, there is a (time-consuming) method to get it's corresponding ID1. Given an ID1, there is a (time-consuming) method to get Data1..N for the record. Our ultimate goal is to fill in Data1..N for all records as quickly as possible.
Our immediate goal is to (as quickly as possible) eliminate all duplicates in the list, keeping the one with more information.
For example, if Rec1 == {ID1="ABC", ID2="XYZ"}, and Rec2 = {ID1=null, ID2="XYZ"}, then these are duplicates, --- BUT we must specifically remove Rec2 and keep Rec1.
That last requirement eliminates the standard ways of removing Dups (e.g. HashSet), as they consider both sides of the "duplicate" to be interchangeable.

How about you split your original list into 3 - ones with all data, ones with ID1, and ones with just ID2.
Then do:
var unique = allData.Concat(id1Data.Except(allData))
.Concat(id2Data.Except(id1Data).Except(allData));
having defined equality just on the basis of ID2.
I suspect there are more efficient ways of expressing that, but the fundamental idea is sound as far as I can tell. Splitting the initial list into three is simply a matter of using GroupBy (and then calling ToList on each group to avoid repeated queries).
EDIT: Potentially nicer idea: split the data up as before, then do:
var result = new HashSet<...>(allData);
result.UnionWith(id1Data);
result.UnionWith(id2Data);
I believe that UnionWith keeps the existing elements rather than overwriting them with new but equal ones. On the other hand, that's not explicitly specified. It would be nice for it to be well-defined...
(Again, either make your type implement equality based on ID2, or create the hash set using an equality comparer which does so.)

This may smell quite a bit, but I think a LINQ-distinct will still work for you if you ensure the two compared objects come out to be the same. The following comparer would do this:
private class Comp : IEqualityComparer<Item>
{
public bool Equals(Item x, Item y)
{
var equalityOfB = x.ID2 == y.ID2;
if (x.ID1 == y.ID1 && equalityOfB)
return true;
if (x.ID1 == null && equalityOfB)
{
x.ID1 = y.ID1;
return true;
}
if (y.ID1 == null && equalityOfB)
{
y.ID1 = x.ID1;
return true;
}
return false;
}
public int GetHashCode(Item obj)
{
return obj.ID2.GetHashCode();
}
}
Then you could use it on a list as such...
var l = new[] {
new Item { ID1 = "a", ID2 = "b" },
new Item { ID1 = null, ID2 = "b" } };
var l2 = l.Distinct(new Comp()).ToArray();

I had a similar issue a couple of months ago.
Try something like this...
public static List<T> RemoveDuplicateSections<T>(List<T> sections) where T:INamedObject
{
Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
List<T> finalList = new List<T>();
int i = 0;
foreach (T currValue in sections)
{
if (!uniqueStore.ContainsKey(currValue.Name))
{
uniqueStore.Add(currValue.Name, 0);
finalList.Add(sections[i]);
}
i++;
}
return finalList;
}

records.GroupBy(r => r, new RecordByIDsEqualityComparer())
.Select(g => g.OrderByDescending(r => r, new RecordByFullnessComparer()).First())
or if you want to merge the records, then Aggregate instead of OrderByDescending/First.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - fastest way of comparing a collection against itself to find duplicates - c#

SortedSet<string> sorted = new SortedSet<string>(); for (int i = 0; i < testList.Count; i++) testList[i].IsDuplicate = !sorted.Add(testList[i].TestValue); As you have allowed in the question, I'd change testList to be an array instead of a list, to make indexer faster.

Related

Best Way to compare 1 million List of object with another 1 million List of object in c#

Search a List of string array to find a value in matching element and return another element in same array

Test if all values in a list are unique

Map two string arrays without a switch

Removing duplicates from a list with "priority"

Categories

Resources