Removing duplicates from a list with "priority"

Removing duplicates from a list with "priority" - c#

Given a collection of records like this:
string ID1;
string ID2;
string Data1;
string Data2;
// :
string DataN
Initially Data1..N are null, and can pretty much be ignored for this question. ID1 & ID2 both uniquely identify the record. All records will have an ID2; some will also have an ID1. Given an ID2, there is a (time-consuming) method to get it's corresponding ID1. Given an ID1, there is a (time-consuming) method to get Data1..N for the record. Our ultimate goal is to fill in Data1..N for all records as quickly as possible.
Our immediate goal is to (as quickly as possible) eliminate all duplicates in the list, keeping the one with more information.
For example, if Rec1 == {ID1="ABC", ID2="XYZ"}, and Rec2 = {ID1=null, ID2="XYZ"}, then these are duplicates, --- BUT we must specifically remove Rec2 and keep Rec1.
That last requirement eliminates the standard ways of removing Dups (e.g. HashSet), as they consider both sides of the "duplicate" to be interchangeable.

How about you split your original list into 3 - ones with all data, ones with ID1, and ones with just ID2.
Then do:
var unique = allData.Concat(id1Data.Except(allData))
.Concat(id2Data.Except(id1Data).Except(allData));
having defined equality just on the basis of ID2.
I suspect there are more efficient ways of expressing that, but the fundamental idea is sound as far as I can tell. Splitting the initial list into three is simply a matter of using GroupBy (and then calling ToList on each group to avoid repeated queries).
EDIT: Potentially nicer idea: split the data up as before, then do:
var result = new HashSet<...>(allData);
result.UnionWith(id1Data);
result.UnionWith(id2Data);
I believe that UnionWith keeps the existing elements rather than overwriting them with new but equal ones. On the other hand, that's not explicitly specified. It would be nice for it to be well-defined...
(Again, either make your type implement equality based on ID2, or create the hash set using an equality comparer which does so.)

This may smell quite a bit, but I think a LINQ-distinct will still work for you if you ensure the two compared objects come out to be the same. The following comparer would do this:
private class Comp : IEqualityComparer<Item>
{
public bool Equals(Item x, Item y)
{
var equalityOfB = x.ID2 == y.ID2;
if (x.ID1 == y.ID1 && equalityOfB)
return true;
if (x.ID1 == null && equalityOfB)
{
x.ID1 = y.ID1;
return true;
}
if (y.ID1 == null && equalityOfB)
{
y.ID1 = x.ID1;
return true;
}
return false;
}
public int GetHashCode(Item obj)
{
return obj.ID2.GetHashCode();
}
}
Then you could use it on a list as such...
var l = new[] {
new Item { ID1 = "a", ID2 = "b" },
new Item { ID1 = null, ID2 = "b" } };
var l2 = l.Distinct(new Comp()).ToArray();

I had a similar issue a couple of months ago.
Try something like this...
public static List<T> RemoveDuplicateSections<T>(List<T> sections) where T:INamedObject
{
Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
List<T> finalList = new List<T>();
int i = 0;
foreach (T currValue in sections)
{
if (!uniqueStore.ContainsKey(currValue.Name))
{
uniqueStore.Add(currValue.Name, 0);
finalList.Add(sections[i]);
}
i++;
}
return finalList;
}

records.GroupBy(r => r, new RecordByIDsEqualityComparer())
.Select(g => g.OrderByDescending(r => r, new RecordByFullnessComparer()).First())
or if you want to merge the records, then Aggregate instead of OrderByDescending/First.

Related

Search a List of string array to find a value in matching element and return another element in same array

So I have
List<string[]> listy = new List<string[]>();
listy.add('a','1','blue');
listy.add('b','2','yellow');
And i want to search through all of the list ti find the index where the array containing 'yellow' is, and return the first element value, in this case 'b'.
Is there a way to do this with built in functions or am i going to need to write my own search here?
Relatively new to c# and not aware of good practice or all the built in functions. Lists and arrays im ok with but lists of arrays baffles me somewhat.
Thanks in advance.

As others have already suggested, the easiest way to do this involves a very powerful C# feature called LINQ ("Language INtegrated Queries). It gives you a SQL-like syntax for querying collections of objects (or databases, or XML documents, or JSON documents).
To make LINQ work, you will need to add this at the top of your source code file:
using System.Linq;
Then you can write:
IEnumerable<string> yellowThings =
from stringArray in listy
where stringArray.Contains("yellow")
select stringArray[0];
Or equivalently:
IEnumerable<string> yellowThings =
listy.Where(strings => strings.Contains("yellow"))
.Select(strings => strings[0]);
At this point, yellowThings is an object containing a description of the query that you want to run. You can write other LINQ queries on top of it if you want, and it won't actually perform the search until you ask to see the results.
You now have several options...
Loop over the yellow things:
foreach(string thing in yellowThings)
{
// do something with thing...
}
(Don't do this more than once, otherwise the query will be evaluated repeatedly.)
Get a list or array :
List<string> listOfYellowThings = yellowThings.ToList();
string[] arrayOfYellowThings = yellowThings.ToArray();
If you expect to have exactly one yellow thing:
string result = yellowThings.Single();
// Will throw an exception if the number of matches is zero or greater than 1
If you expect to have either zero or one yellow things:
string result = yellowThings.SingleOrDefault();
// result will be null if there are no matches.
// An exception will be thrown if there is more than one match.
If you expect to have one or more yellow things, but only want the first one:
string result = yellowThings.First();
// Will throw an exception if there are no yellow things
If you expect to have zero or more yellow things, but only want the first one if it exists:
string result = yellowThings.FirstOrDefault();
// result will be null if there are no yellow things.

Based on the problem explanation provided by you following is the solution I can suggest.
List<string[]> listy = new List<string[]>();
listy.Add(new string[] { "a", "1", "blue"});
listy.Add(new string[] { "b", "2", "yellow"});
var target = listy.FirstOrDefault(item => item.Contains("yellow"));
if (target != null)
{
Console.WriteLine(target[0]);
}
This should solve your issue. Let me know if I am missing any use case here.

You might consider changing the data structure,
Have a class for your data as follows,
public class Myclas
{
public string name { get; set; }
public int id { get; set; }
public string color { get; set; }
}
And then,
static void Main(string[] args)
{
List<Myclas> listy = new List<Myclas>();
listy.Add(new Myclas { name = "a", id = 1, color = "blue" });
listy.Add(new Myclas { name = "b", id = 1, color = "yellow" });
var result = listy.FirstOrDefault(t => t.color == "yellow");
}

Your current situation is
List<string[]> listy = new List<string[]>();
listy.Add(new string[]{"a","1","blue"});
listy.Add(new string[]{"b","2","yellow"});
Now there are Linq methods, so this is what you're trying to do
var result = listy.FirstOrDefault(x => x.Contains("yellow"))?[0];

C# - fastest way of comparing a collection against itself to find duplicates

public class TestObject
{
string TestValue { get; set; }
bool IsDuplicate { get; set; }
}
List<TestObject> testList = new List<TestObject>
{
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Bob" },
new TestObject { TestValue = "Alice" },
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Claire" },
new TestObject { TestValue = "Matt" }
};
Imagine testList is actually millions of objects long.
What's the fastest way to ensure that two of those three TestObjects with TestValue of Matt gets its IsDuplicate set to true? No matter how may instances of a given value there are, only one should come out of the process with IsDuplicate of false.
I am not averse to doing this via threading. And the collection doesn't have to be a list if converting it to another collection type is faster.
I need to keep duplicates and mark them as such, not remove them from the collection.
To expand, this is (as you might imagine) a simple expression of a much more complex problem. The objects in question already have an ordinal which I can use to order them.
After matching initial duplicates on exact string equality, I'm going to have to go back through the collection again and re-try the remainder using some fuzzy matching logic. The collection that exists at the start of this process won't be changed during the deduplication, or afterwards.
Eventually the original collection is going to be written out to a file, with likely duplicates flagged.

As others mentioned, the correct approach here would be to use the HashSet class.
var hashSet = new HashSet<string>();
foreach (var obj in testList)
{
if (!hashSet.Add(obj.TestValue))
{
obj.IsDuplicate = true;
}
}
When you add a value first time to the HashSet, it adds successfully and HashSet.Add() method returns true so you don't make any changes to the item. When you're trying to add it second time, HashSet.Add() returns false and you mark your item as a duplicate.
The list will have the following state after finishing running our marking duplicates method:
Matt
Bob
Alice
Claire
Matt DUPLICATE

This is probably quite performant:
foreach (var dupe in testList.GroupBy(x => x.TestValue).SelectMany(g => g.Skip(1)))
dupe.IsDuplicate = true;
[EDIT] This method turns out to be about a third of the speed of the accepted answer above, so that one should be used. This answer is merely of academic interest.

Probably I would go to check for the duplicates while building the collection of the TestValue to avoid looping two times on millions of elements. If this scenario is possible then I would use a Dictionary<string, List<TestValue>>
Dictionary<string, List<TestValue>> myList = new Dictionary<string, List<TestValue>>();
while(NotEndOfData())
{
TestValue obj = GetTestValue();
if(myList.ContainsKey(obj.Name))
{
obj.IsDuplicate = true;
myList[obj.Name].Add(obj);
}
else
{
obj.IsDuplicate = false;
myList.Add(obj.Name, new List<TestValue>() { obj};
}
}

SortedSet<string> sorted = new SortedSet<string>();
for (int i = 0; i < testList.Count; i++)
testList[i].IsDuplicate = !sorted.Add(testList[i].TestValue);
As you have allowed in the question, I'd change testList to be an array instead of a list, to make indexer faster.

Since you indicated that you have a property that keeps the ordinal of your items. We can use that property to reset the sort order back to its original after marking our items as duplicates.
The code below is self-explainatory. But just let me know in case you need any further explaination.
I have assumed that the property name is SortOrder. Modify the code accordingly.
void MarkDuplicates()
{
testList = testList.OrderBy(f => f.TestValue).ThenBy(f => f.SortOrder).ToList();
for (int i = 1; i < testList.Count; i++)
{
if (testList[i].TestValue == testList[i - 1].TestValue) testList[i].IsDuplicate = true;
}
testList = testList.OrderBy(f => f.SortOrder).ToList();
}
I'm not a performance expert. But you can time the various solutions provided here and check the performance for yourself.

Test if all values in a list are unique

I have a small list of bytes and I want to test that they're all different values.
For instance, I have this:
List<byte> theList = new List<byte> { 1,4,3,6,1 };
What's the best way to check if all values are distinct or not?

bool isUnique = theList.Distinct().Count() == theList.Count();

Here's another approach which is more efficient than Enumerable.Distinct + Enumerable.Count (all the more if the sequence is not a collection type). It uses a HashSet<T> which eliminates duplicates, is very efficient in lookups and has a count-property:
var distinctBytes = new HashSet<byte>(theList);
bool allDifferent = distinctBytes.Count == theList.Count;
or another - more subtle and efficient - approach:
var diffChecker = new HashSet<byte>();
bool allDifferent = theList.All(diffChecker.Add);
HashSet<T>.Add returns false if the element could not be added since it was already in the HashSet. Enumerable.All stops on the first "false".

Okay, here is the most efficient method I can think of using standard .Net
using System;
using System.Collections.Generic;
public static class Extension
{
public static bool HasDuplicate<T>(
this IEnumerable<T> source,
out T firstDuplicate)
{
if (source == null)
{
throw new ArgumentNullException(nameof(source));
}
var checkBuffer = new HashSet<T>();
foreach (var t in source)
{
if (checkBuffer.Add(t))
{
continue;
}
firstDuplicate = t;
return true;
}
firstDuplicate = default(T);
return false;
}
}
essentially, what is the point of enumerating the whole sequence twice if all you want to do is find the first duplicate.
I could optimise this more by special casing an empty and single element sequences but that would depreciate from readability/maintainability with minimal gain.

The similar logic to Distinct using GroupBy:
var isUnique = theList.GroupBy(i => i).Count() == theList.Count;

I check if an IEnumerable (aray, list, etc ) is unique like this :
var isUnique = someObjectsEnum.GroupBy(o => o.SomeProperty).Max(g => g.Count()) == 1;

One can also do: Use Hashset
var uniqueIds = new HashSet<long>(originalList.Select(item => item.Id));
if (uniqueIds.Count != originalList.Count)
{
}

There are many solutions.
And no doubt more beautiful ones with the usage of LINQ as "juergen d" and "Tim Schmelter" mentioned.
But, if you bare "Complexity" and speed, the best solution will be to implement it by yourself.
One of the solution will be, to create an array of N size (for byte it's 256).
And loop the array, and on every iteration will test the matching number index if the value is 1 if it does, that means i already increment the array index and therefore the array isn't distinct otherwise i will increment the array cell and continue checking.

And another solution, if you want to find duplicated values.
var values = new [] { 9, 7, 2, 6, 7, 3, 8, 2 };
var sorted = values.ToList();
sorted.Sort();
for (var index = 1; index < sorted.Count; index++)
{
var previous = sorted[index - 1];
var current = sorted[index];
if (current == previous)
Console.WriteLine(string.Format("duplicated value: {0}", current));
}
Output:
duplicated value: 2
duplicated value: 7
http://rextester.com/SIDG48202

How to sort strings by a different value

I've tried looking for an existing question but wasn't sure how to phrase this and this retrieved no results anywhere :(
Anyway, I have a class of "Order Items" that has different properties. These order items are for clothing, so they will have a size (string).
Because I am OCD about these sorts of things, I would like to have the elements sorted not by the sizes as alphanumeric values, but by the sizes in a custom order.
I would also like to not have this custom order hard-coded if possible.
To break it down, if I have a list of these order items with a size in each one, like so:
2XL
S
5XL
M
With alphanumeric sorting it would be in this order:
2XL
5XL
M
S
But I would like to sort this list into this order (from smallest size to largest):
S
M
2XL
5XL
The only way I can think of to do this is to have a hard-coded array of the sizes and to sort by their index, then when I need to grab the size value I can grab the size order array[i] value. But, as I said, I would prefer this order not to be hard-coded.
The reason I would like the order to be dynamic is the order items are loaded from files on the hard disk at runtime, and also added/edited/deleted by the user at run-time, and they may contain a size that I haven't hard-coded, for example I could hard code all the way from 10XS to 10XL but if someone adds the size "110cm" (aka a Medium), it will turn up somewhere in the order that I don't want it to, assuming the program doesn't crash and burn.
I can't quite wrap my head around how to do this.

Also, you could create a Dictionary<int, string> and add Key as Ordering order below. Leaving some gaps between Keys to accomodate new sizes for the future. Ex: if you want to add L (Large), you could add a new item as {15, "L"} without breaking the current order.
Dictionary<int, string> mySizes = new Dictionary<int, string> {
{ 20, "2XL" }, { 1, "S" },
{ 30, "5XL" }, { 10, "M" }
};
var sizes = mySizes.OrderBy(s => s.Key)
.Select(s => new {Size = s.Value})
.ToList();

You can use OrderByDescending + ThenByDescending directly:
sizes.OrderByDescending(s => s == "S")
.ThenByDescending( s => s == "M")
.ThenByDescending( s => s == "2XL")
.ThenByDescending( s => s == "5XL")
.ThenBy(s => s);
I use ...Descending since a true is similar to 1 whereas a false is 0.

I would implement IComparer<string> into your own TShirtSizeComparer. You might have to do some regular expressions to get at the values you need.
IComparer<T> is a great interface for any sorting mechanism. A lot of built-in stuff in the .NET framework uses it. It makes the sorting reusable.
I would really suggest parsing the size string into a separate object that has the size number and the size size then sorting with that.

You need to implement the IComparer interface on your class. You can google how to do that as there are many examples out there

you'll have to make a simple parser for this. You can search inside the string for elements like XS XL and cm" if you then filter that out you have your unit. Then you can obtain the integer that is the value. If you have that you can indeed use an IComparer object but it doesn't have that much of an advantage.

I would make a class out of Size, it is likely that you will need to add more functionality to this in the future. I added the full name of the size, but you could also add variables like width and length, and converters for inches or cm.
private void LoadSizes()
{
List<Size> sizes = new List<Size>();
sizes.Add(new Size("2X-Large", "2XL", 3));
sizes.Add(new Size("Small", "S", 1));
sizes.Add(new Size("5X-Large", "5XL", 4));
sizes.Add(new Size("Medium", "M", 2));
List<string> sizesShortNameOrder = sizes.OrderBy(s => s.Order).Select(s => s.ShortName).ToList();
//If you want to use the size class:
//List<Size> sizesOrder = sizes.OrderBy(s => s.Order).ToList();
}
public class Size
{
private string _name;
private string _shortName;
private int _order;
public string Name
{
get { return _name; }
}
public string ShortName
{
get { return _shortName; }
}
public int Order
{
get { return _order; }
}
public Size(string name, string shortName, int order)
{
_name = name;
_shortName = shortName;
_order = order;
}
}

I implemented TShirtSizeComparer with base class Comparer<object>. Of course you have to adjust it to the sizes and objects you have available:
public class TShirtSizeComparer : Comparer<object>
{
// Compares TShirtSizes and orders them by size
public override int Compare(object x, object y)
{
var _sizesInOrder = new List<string> { "None", "XS", "S", "M", "L", "XL", "XXL", "XXXL", "110 cl", "120 cl", "130 cl", "140 cl", "150 cl" };
var indexX = -9999;
var indexY = -9999;
if (x is TShirt)
{
indexX = _sizesInOrder.IndexOf(((TShirt)x).Size);
indexY = _sizesInOrder.IndexOf(((TShirt)y).Size);
}
else if (x is TShirtListViewModel)
{
indexX = _sizesInOrder.IndexOf(((TShirtListViewModel)x).Size);
indexY = _sizesInOrder.IndexOf(((TShirtListViewModel)y).Size);
}
else if (x is MySelectItem)
{
indexX = _sizesInOrder.IndexOf(((MySelectItem)x).Value);
indexY = _sizesInOrder.IndexOf(((MySelectItem)y).Value);
}
if (indexX > -1 && indexY > -1)
{
return indexX.CompareTo(indexY);
}
else if (indexX > -1)
{
return -1;
}
else if (indexY > -1)
{
return 1;
}
else
{
return 0;
}
}
}
To use it you just have a List or whatever your object is and do:
tshirtList.Sort(new TShirtSizeComparer());
The order you have "hard-coded" is prioritized and the rest is put to the back.
I'm sure it can be done a bit smarter and more generalized to avoid hard-coding it all. You could e.g. look for sizes ending with an "S" and then check how many X's (e.g. XXS) or the number before X (e.g. 2XS) and sort by that, and then repeat for "L" and perhaps other "main sizes".

Using C# Dictionary to parse log file

I am trying to parse a rather long log file and creating a better more manageable listing of issues.
I am able to read and parse out the individual log line by line, but what I need to do is display only unique entries, as some errors occur more often than others and are always recorded with identical text.
What I was going to try to do was create a Dictionary object to hold each unique entry and as I work through the log file, search the Dictionary object to see if the same values are already in there.
Here is a crude sample of the code I have (a work in progress, I hope I have all syntax right) that does not work. For some reason this script never sees any distinct entries (if statement never passes):
string[] rowdta = new string[4];
Dictionary<string[], int> dict = new Dictionary<string[], int>();
int ctr = -1;
if (linectr == 1)
{
ctr++;
dict.Add(rowdta, ctr);
}
else
{
foreach (KeyValuePair<string[], int> pair in dict)
{
if ((pair.Key[1] != rowdta[1]) || (pair.Key[2] != rowdta[2])| (pair.Key[3] != rowdta[3]))
{
ctr++;
dict.Add(rowdta, ctr);
}
}
}
Some sample data:
First line
rowdta[0]="ErrorType";
rowdta[1]="Undefined offset: 0";
rowdta[2]="/url/routesDisplay2.svc.php";
rowdta[3]="Line Number 5";
2nd line
rowdta[0]="ErrorType";
rowdta[1]="Undefined offset: 0";
rowdta[2]="/url/routesDisplay2.svc.php";
rowdta[3]="Line Number 5";
3rd line
rowdta[0]="ErrorType";
rowdta[1]="Undefined variable: fvmsg";
rowdta[2]="/url/processes.svc.php";
rowdta[3]="Line Number 787";
So, with this, the Dictionary will have 2 items in it, first line and 3rd line.
I have also tried this with the following which nalso does not find any variations in the log file text.
if (!dict.ContainsKey(rowdta)) {}
Can someone please help me get this syntax right? I am just a newbie at C# but this should be relatively straightforward. As always, I am thinking that this should be enough information to get the conversation started. If you want/need more detail, please let me know.

Either create a wrapper for your strings which implements IEquatable.
public class LogFileEntry :IEquatable<LogFileEntry>
{
private readonly string[] _rows;
public LogFileEntry(string[] rows)
{
_rows = rows;
}
public override int GetHashCode()
{
return
_rows[0].GetHashCode() << 3 |
_rows[2].GetHashCode() << 2 |
_rows[1].GetHashCode() << 1 |
_rows[0].GetHashCode();
}
#region Implementation of IEquatable<LogFileEntry>
public override bool Equals(Object obj)
{
if (obj == null)
return base.Equals(obj);
return Equals(obj as LogFileEntry);
}
public bool Equals(LogFileEntry other)
{
if(other == null)
return false;
return _rows.SequenceEqual(other._rows);
}
#endregion
}
Then use that in your dictionary:
var d = new Dictionary<LogFileEntry, int>();
var entry = new LogFileEntry(rows);
if( d.ContainsKey(entry) )
{
d[entry] ++;
}
else
{
d[entry] = 1;
}
Or create a custom comparer similar to that proposed by #dasblinkenlight and use as follows
public class LogFileEntry
{
}
public class LogFileEntryComparer : IEqualityComparer<LogFileEntry>{ ... }
var d = new Dictionary<LogFileEntry, int>(new LogFileEntryComparer());
var entry = new LogFileEntry(rows);
if( d.ContainsKey(entry) )
{
d[entry] ++;
}
else
{
d[entry] = 1;
}

The reason that you see the problem is that an array of strings cannot be used as a key in a dictionary without supplying a custom IEqualityComparer<string[]> or writing a wrapper around it.
EDIT Here is a quick and dirty implementation of a custom comparer:
private class ArrayEq<T> : IEqualityComparer<T[]> {
public bool Equals(T[] x, T[] y) {
return x.SequenceEqual(y);
}
public int GetHashCode(T[] obj) {
return obj.Sum(o => o.GetHashCode());
}
}
Here is how you can use it:
var dd = new Dictionary<string[], int>(new ArrayEq<string>());
dd[new[] { "a", "b" }] = 0;
dd[new[] { "a", "b" }]++;
dd[new[] { "a", "b" }]++;
Console.WriteLine(dd[new[] { "a", "b" }]);

The problem is that array equality is reference equality. In other words, it does not depend on the values stored in the array, it depends only on the identity of the array.
Some solutions
use Tuple to hold the row data
use an anonymous type to hold the row data
create a custom type to hold the row data, and, if it is a class, override Equals and GetHashCode.
create a custom implementation of IEqualityComparer to compare the arrays according to their values, and pass that to the dictionary when you create it.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.