Find a dupe in a List with Linq

Find a dupe in a List with Linq - c#

I am building a list of Users. each user has a FullName.
I'm comparing users on FullName.
i'm taking a DataTable with the users from the old DB and parsing them to a 'User' Object. and adding them in a List<Users>. which in the code is a List<Deelnemer>
It goes like this:
List<Deelnemer> tempDeeln = new List<Deelnemer>();
bool dupes = false;
foreach (DataRow rij in deeln.Rows) {
Deelnemer dln = new Deelnemer();
dln.Dln_Creatiedatum = DateTime.Now;
dln.Dln_Email = rij["Ler_Email"].ToString();
dln.Dln_Inst_ID = inst.Inst_ID;
dln.Dln_Naam = rij["Ler_Naam"].ToString();
dln.Dln_Username = rij["LerLog_Username"].ToString();
dln.Dln_Voornaam = rij["Ler_Voornaam"].ToString();
dln.Dln_Update = (DateTime)rij["Ler_Update"];
if (!dupes && tempDeeln.Count(q => q.FullName.ToLower() == dln.FullName.ToLower()) > 0)
dupes = true;
tempDeeln.Add(dln);
}
then when the foreach is done, i look if the bool is true, check which ones are the doubles, and remove the oldest ones.
now, i think this part of the code is very bad:
if (!dupes && tempDeeln.Count(q => q.FullName.ToLower() == dln.FullName.ToLower()) > 0)
it runs for every user added, and runs over all the already created users.
my question: how would I optimize this.

You can use a set such as a HashSet<T> to track unique names observed so far. A hash-set supports constant-time insertion and lookup, so a full linear-search will not be required for every new item unlike you exising solution.
var uniqueNames = new HashSet<string>(StringComparer.CurrentCultureIgnoreCase);
...
foreach(...)
{
...
if(!dupes)
{
// Expression is true only if the set already contained the string.
dupes = !uniqueNames.Add(dln.FullName);
}
}
If you want to "remove" dupes (i.e. produce one representative element for each name) after you have assembled the list (without using a hash-set), you can do:
var distinctItems = tempDeeln.GroupBy(dln => dln.FullName,
StringComparer.CurrentCultureIgnoreCase)
.Select(g => g.First());

Try this out--
http://blogs.msdn.com/b/ericwhite/archive/2008/08/19/find-duplicates-using-linq.aspx

Count will go through whole set of items. Try to use Any, this way it will only check for first occurrence of the item.
if (!dupes && tempDeeln.Any(q => q.FullName.ToLower() == dln.FullName.ToLower()))
dupes = true;

Related

Select List to Display

if (gardenvlist.Count() == days)
{
var g = gardenvlist;
}
if (oceanvlist.Count() == days)
{
var o = oceanvlist;
}
if (cityvlist.Count() == days)
{
var c = cityvlist;
}
var final = g.Union(o).Union(c);
if (final.Count() > 0)
{
return new ObjectResult(final);
}
return NotFound();
So, what I have over here is I wanted to check if the gardenvlist are available within the period. If the list is available in the period, select the list. After that check the oceanvlist and so on. Next, it will check that if the final result contains any lists. If there's one or many lists, return those lists else return false.
Sorry if my explanation is not clear enough. I'm new to programming.

You do not need the Select in each if condition because you do not do anything with the projection (select). Also, declare your final variable at the beginning and add to it inside the if blocks.
var rateGroupIds = new List<int>();
if (gardenvlist.Count() == days)
{
rateGroupIds.AddRange(gardenvlist.Select(x => RateGroupID));
}
if (oceanvlist.Count() == days)
{
rateGroupIds.AddRange(oceanvlist.Select(x => RateGroupID));
}
if (cityvlist.Count() == days)
{
rateGroupIds.AddRange(cityvlist.Select(x => RateGroupID));
}

Try something like this:
var final = new List<T>();
// Add all the lists only if they match
UnionIfMatches(gardenvlist);
UnionIfMatches(oceanvlist);
UnionIfMatches(cityvlist);
if (final.Count() > 0)
return new ObjectResult(final);
return NotFound();
// ---- Local Functions ---- //
// Adds the list to final if it matches
void UnionIfMatches(List<T> list)
{
if (ListMatches(list))
final.Union(list);
}
// Checks if the list matches
bool ListMatches(List<T list> => list.Select(x => xmethod.RateGroupID.Count() == days);
You used the same code 3 times, so I just moved the testing if it matches into a new function ListMatches() to make it easier. I then made a second function that will add in that list only if it matches.
I don't know what type you were using, because you used var, so I'm just guessing it was a List<T>. If it wasn't, just swap it out with the real class and it should still work.
These local functions are only usable and visible from within the scope of the function, which is really useful.
I couldn't test this, so let me know how this works.

Loop to check for duplicate strings

I want to create a loop to check a list of titles for duplicates.
I currently have this:
var productTitles = SeleniumContext.Driver.FindElements(By.XPath(ComparisonTableElements.ProductTitle));
foreach (var x in productTitles)
{
var title = x.Text;
productTitles = SeleniumContext.Driver.FindElements(By.XPath(ComparisonTableElements.ProductTitle));
foreach (var y in productTitles.Skip(productTitles.IndexOf(x) + 1))
{
if (title == y.Text)
{
Assert.Fail("Found duplicate product in the table");
}
}
}
But this is taken the item I skip out of the array for the next loop so item 2 never checks it's the same as item 1, it moves straight to item 3.
I was under the impression that skip just passed over the index you pass in rather than removing it from the list.

You can use GroupBy:
var anyDuplicates = SeleniumContext
.Driver
.FindElements(By.XPath(ComparisonTableElements.ProductTitle))
.GroupBy(p => p.Text, p => p)
.Any(g => g.Count() > 1);
Assert.That(anyDuplicates, Is.False);
or Distinct:
var productTitles = SeleniumContext
.Driver
.FindElements(By.XPath(ComparisonTableElements.ProductTitle))
.Select(p => p.Text)
.ToArray();
var distinctProductTitles = productTitles.Distinct().ToArray();
Assert.AreEqual(productTitles.Length, distinctProductTitles.Length);
Or, if it is enough to find a first duplicate without counting all of them it's better to use a HashSet<T>:
var titles = new HashSet<string>();
foreach (var title in SeleniumContext
.Driver
.FindElements(By.XPath(ComparisonTableElements.ProductTitle))
.Select(p => p.Text))
{
if (!titles.Add(title))
{
Assert.Fail("Found duplicate product in the table");
}
}
All approaches are better in terms of computational complexity (O(n)) than what you propose (O(n2)).

You don't need a loop. Simply use the Where() function to find all same titles, and if there is more than one, then they're duplicates:
var productTitles = SeleniumContext.Driver.FindElements(By.XPath(ComparisonTableElements.ProductTitle));
foreach(var x in productTitles) {
if (productTitles.Where(y => x.Text == y.Text).Count() > 1) {
Assert.Fail("Found duplicate product in the table");
}
}

I would try a slightly different way since you only need to check for duplicates in a one-dimensional array.
You only have to check the previous element with the next element within the array/collection so using Linq to iterate through all of the items seems a bit unnecessary.
Here's a piece of code to better understand:
var productTitles = SeleniumContext.Driver.FindElements(By.XPath(ComparisonTableElements.ProductTitle))
for ( int i = 0; i < productionTitles.Length; i++ )
{
var currentObject = productionTitles[i];
for ( int j = i + 1; j < productionTitles.Length; j++ )
{
if ( currentObject.Title == productionTitles[j].Title )
{
// here's your duplicate
}
}
}
Since you've checked that item at index 0 is not the same as item placed at index 3 there's no need to check that again when you're at index 3. The items will remain the same.

The Skip(IEnumerable, n) method returns an IEnumerable that doesn't "contain" the n first element of the IEnumerable it's called on.
Also I don't know what sort of behaviour could arise from this, but I wouldn't assign a new IEnumerable to the variable over which the foreach is being executed.
Here's another possible solution with LINQ:
int i = 0;
foreach (var x in productTitles)
{
var possibleDuplicate = productTitles.Skip(i++).Find((y) => y.title == x.title);
//if possibleDuplicate is not default value of type
//do stuff here
}
This goes without saying, but the best solution for you will depend on what you are trying to do. Also, I think the Skip method call is more trouble than it's worth, as I'm pretty sure it will most certainly make the search less eficient.

Most efficient way to search enumerable

I am writing a small program that takes in a .csv file as input with about 45k rows. I am trying to compare the contents of this file with the contents of a table on a database (SQL Server through dynamics CRM using Xrm.Sdk if it makes a difference).
In my current program (which takes about 25 minutes to compare - the file and database are the exact same here both 45k rows with no differences), I have all existing records on the database in a DataCollection<Entity> which inherits Collection<T> and IEnumerable<T>
In my code below I am filtering using the Where method and then doing a logic based the count of matches. The Where seems to be the bottleneck here. Is there a more efficient approach than this? I am by no means a LINQ expert.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age);
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
EDIT: I can confirm that all existingRecords are in memory before this code is executed. There is no IO or DB access in the above loop.

Himbrombeere is right, you should execute the query first and put the result into a collection before you use Any, Count, AddRange or whatever method will execute the query again. In your code it's possible that the query is executed 5 times in every loop iteration.
Watch out for the term deferred execution in the documentation. If a method is implemented in that way, then it means that this method can be used to construct a LINQ query(so you can chain it with other methods and at the end you have a query). But only methods that don't use deferred execution like Count, Any, ToList(or a plain foreach) will actually execute it. If you dont want that the whole query is executed everytime and you have to access this query multiple times , it's better to store the result in a collection(.f.e with ToList).
However, you could use a different approach which should be much more efficient, a Lookup<TKey, TValue> which is similar to a dictionary and can be used with an anonymous type as key:
var lookup = existingRecords.Entities.ToLookup(r => new
{
fund = r["field_1"].ToString(),
bps = Convert.ToDecimal(r["field_2"]),
withdrawalPct = Convert.ToDecimal(r["field_3"]),
percentile = Convert.ToDecimal(r["field_4"]),
age = Convert.ToDecimal(r["field_5"])
});
Now you can access this lookup in the loop very efficiently.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = lookup[new {fund, bps, withdrawalPct, percentile, age}].ToList();
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
Note that this will work even if the key does not exist(an empty list is returned).

Add a ToList after your Convert.ToDecimal(r["field_5"]) == age);-line to force an immediate execution of the query.
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age)
.ToList();
The Where doesn´t actually execute your query, it just prepares it. The actual execution happens later in a delayed way. In your case that happens when calling Count which itself will iterate the entire collection of items. But if the first condition fails, the second one is checked leading to a second iteration of the complete collection when calling Count. In this case you actually execute that query a thrird time when calling matchingRows.First().
When forcing an immediate execution you´re executing the query only once and thus iterating the entire collection only once also which will decrease your overall-time.

Another option, which is basically along the same lines as the other answers, is to prepare your data first, so that you're not repeatedly calling things like r["field_2"] (which are relatively slow to look up).
This is a (1) clean your data, (2) query/join your data, (3) process your data approach.
Do this:
(1)
var inputs =
inputDataLines
.Select(record =>
{
var fields = record.Split(',');
return new
{
fund = fields[0],
bps = Convert.ToDecimal(fields[1]),
withdrawalPct = Convert.ToDecimal(fields[2]),
percentile = Convert.ToInt32(fields[3]),
age = Convert.ToInt32(fields[4]),
bombOutTerm = Convert.ToDecimal(fields[5]),
record
};
})
.ToArray();
var entities =
existingRecords
.Entities
.Select(entity => new
{
fund = entity["field_1"].ToString(),
bps = Convert.ToDecimal(entity["field_2"]),
withdrawalPct = Convert.ToDecimal(entity["field_3"]),
percentile = Convert.ToInt32(entity["field_4"]),
age = Convert.ToInt32(entity["field_5"]),
bombOutTerm = Convert.ToDecimal(entity["field_6"]),
entity
})
.ToArray()
.GroupBy(x => new
{
x.fund,
x.bps,
x.withdrawalPct,
x.percentile,
x.age
}, x => new
{
x.bombOutTerm,
x.entity,
});
(2)
var query =
from i in inputs
join e in entities on new { i.fund, i.bps, i.withdrawalPct, i.percentile, i.age } equals e.Key
select new { input = i, matchingRows = e };
(3)
foreach (var x in query)
{
entitiesFound.AddRange(x.matchingRows.Select(y => y.entity));
if (x.matchingRows.Count() == 0)
{
rowsToAdd.Add(x.input.record);
}
else if (x.matchingRows.Count() == 1)
{
if (x.matchingRows.First().bombOutTerm != x.input.bombOutTerm)
{
rowsToUpdate.Add(x.input.record);
entitiesToUpdate.Add(x.matchingRows.First().entity);
}
}
else
{
entitiesToDelete.AddRange(x.matchingRows.Select(y => y.entity));
rowsToAdd.Add(x.input.record);
}
}
I would suspect that this will be the among the fastest approaches presented.

Linq Query to check if the records are all the same

I am very much new to the Linq queries. I have the set of records in the csv which is like below
ProdID,Name,Color,Availability
P01,Product1,Red,Yes
P02,Product2,Blue,Yes
P03,Product1,Yellow,No
P01,Product1,Red,Yes
P04,Product1,Black,Yes
I need to check for the Names of the each product and if its is not the same in all the records then I need to send an error message.I know the below query is used to find the duplicates in the records but not sure how can I modify it check if it all has the same values.
ProductsList.GroupBy(p => p.Name).Where(p => p.Count() > 1).SelectMany(x => x);

var first = myObjects.First();
bool allSame = myObjects.All(x=>x.Name == first.Name);
Enumerable.All() will return true if the lambda returns true for all elements of the collection. In this case we're checking that every object's Name property is equal to the first (and thus that they're all equal to each other; the transitive property is great, innit?). You can one-line this by inlining myObjects.First() but this will slow performance as First() will execute once for each object in the collection. You can also theoretically Skip() the first element as we know it's equal to itself.

if I understand correctly you want to check if product exists in the list
using System.Linq;
private bool ItemExists(string nameOfProduct) {
return ProductsList.Any(p=> p.Name== nameOfProduct);
}

UPD after author comment:
To know all the records that are not having the same name as the first record:
var firstName = ProductsList[0].Name;
var differentNames = ProductsList.Where(p => p.Name != firstName);
Another option (just to have all other names): ProductsList.Select(p => p.Name).Where(n => n != firstName).Distinct()
Old version
So, if there are at least two different names then you should return an error?
LINQ way: return ProductsList.Select(p => p.Name).Distinct().Count() <= 1
More optimizied way:
if (ProductsList.Count == 0)
return true;
var name = ProductsList[0].Name;
for (var i = 1; i < ProductsList.Count; i++)
{
if (ProductsList[i].Name != name)
return false;
}
return true;

Removing duplicates from a sorted list c#

I have a list of details about a large number of files. This list contains the file ID, last modified date and the file path. The problem is there are duplicates of the files which are older versions and sometimes have different file paths. I want to only store the newest version of a file regardless of file path. So I created a loop that iterates through the ordered list, checks to see if the ID is unique and if it is, it gets stored in a new unique list.
var ordered = list.OrderBy(x => x.ID).ThenByDescending(x => x.LastModifiedDate);
List<Item> unique = new List<Item>();
string curAssetId = null;
foreach (Item result in ordered)
{
if (!result.ID.Equals(curAssetId))
{
unique.Add(result);
curAssetId = result.ID;
}
}
However this is still allowing duplicates into the DB and I can't figure out why this code isn't working as expected. By duplicates I mean, the files have the same ID but different file paths, which like I said before shouldn't be an issue. I just want the latest version regardless of pathway. Can anyone else see what the issue is? Thanks
var ordered = listOfItems.OrderBy(x => x.AssetID).ThenByDescending(x => x.LastModifiedDate);
List<Item> uniqueItems = new List<Item>();
foreach (Item result in ordered)
{
if (!uniqueItems.Any(x => x.AssetID.Equals(result.AssetID)))
{
uniqueItems.Add(result);
}
}
this is what I have now and it is still allowing duplicates

This is because , you are not searching entire list to check whether the id is unique or not
List<Item> unique = new List<Item>();
string curAssetId = null; // here is the problem
foreach (Item result in ordered)
{
if (!result.ID.Equals(curAssetId)) // here you only compare the last value.
{
unique.Add(result);
curAssetId = result.ID; // You are only assign the current ID value and
}
}
to solve this , change the following
if (!result.ID.Equals(curAssetId)) // here you only compare the last value.
{
unique.Add(result);
curAssetId = result.ID; // You are only assign the current ID value and
}
to
if (!unique.Any(x=>x.ID.Equals(result.ID)))
{
unique.Add(result);
}

I don't know if this code is just simplified, but have you considered grouping on ID, sorting on LastModifiedDate, then just taking the first from each group?
Something like:
var unique = list.GroupBy(i => i.ID).Select(x => x.OrderByDescending(y => y.LastModifiedDate).First());

var ordered = list.OrderBy(x => x.ID).ThenByDescending(x => x.LastModifiedDate).Distinct() ??

For this purpose you have to create your own EquityComparer and after that you could use linq's Distinct method. Enumerable.Distinct at msdn
Also I think you could stay with your current code but you have to modify it in such a way (as a sample):
var ordered = list.OrderByDescending(x => x.LastModifiedDate);
var unique = new List<Item>();
foreach (Item result in ordered)
{
if (unique.Any(x => x.ID == result.ID))
continue;
unique.Add(result);
}

List<Item> p = new List<Item>();
var x = p.Select(c => new Item
{
AssetID = c.AssetID,
LastModifiedDate = c.LastModifiedDate.Date
}).OrderBy(y => y.id).ThenByDescending(c => c.LastModifiedDate).Distinct();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find a dupe in a List with Linq - c#

Try this out-- http://blogs.msdn.com/b/ericwhite/archive/2008/08/19/find-duplicates-using-linq.aspx

Count will go through whole set of items. Try to use Any, this way it will only check for first occurrence of the item. if (!dupes && tempDeeln.Any(q => q.FullName.ToLower() == dln.FullName.ToLower())) dupes = true;

Related

Select List to Display

Loop to check for duplicate strings

Most efficient way to search enumerable

Linq Query to check if the records are all the same

Removing duplicates from a sorted list c#

Categories

Resources