Remove duplicates from array of objects - c#

I have a class called Customer that has several string properties like
firstName, lastName, email, etc.
I read in the customer information from a csv file that creates an array of the class:
Customer[] customers
I need to remove the duplicate customers having the same email address, leaving only 1 customer record for each particular email address.
I have done this using 2 loops but it takes nearly 5 minutes as there are usually 50,000+ customer records. Once I am done removing the duplicates, I need to write the customer information to another csv file (no help needed here).
If I did a Distinct in a loop how would I remove the other string variables that are a part of the class for that particular customer as well?
Thanks,
Andrew

With Linq, you can do this in O(n) time (single level loop) with a GroupBy
var uniquePersons = persons.GroupBy(p => p.Email)
.Select(grp => grp.First())
.ToArray();
Update
A bit on O(n) behavior of GroupBy.
GroupBy is implemented in Linq (Enumerable.cs) as this -
The IEnumerable is iterated only once to create the grouping. A Hash of the key provided (e.g. "Email" here) is used to find unique keys, and the elements are added in the Grouping corresponding to the keys.
Please see this GetGrouping code. And some old posts for reference.
What's the asymptotic complexity of GroupBy operation?
What guarantees are there on the run-time complexity (Big-O) of LINQ methods?
Then Select is obviously an O(n) code, making the above code O(n) overall.
Update 2
To handle empty/null values.
So, if there are instances where the value of Email is null or empty, the simple GroupBy will take just one of those objects from null & empty each.
One quick way to include all those objects with null/empty value is to use some unique keys at the run time for those objects, like
var tempEmailIndex = 0;
var uniqueNullAndEmpty = persons
.GroupBy(p => string.IsNullOrEmpty(p.Email)
? (++tempEmailIndex).ToString() : p.Email)
.Select(grp => grp.First())
.ToArray();

I'd do it like this:
public class Person {
public Person(string eMail, string Name) {
this.eMail = eMail;
this.Name = Name;
}
public string eMail { get; set; }
public string Name { get; set; }
}
public class eMailKeyedCollection : System.Collections.ObjectModel.KeyedCollection<string, Person> {
protected override string GetKeyForItem(Person item) {
return item.eMail;
}
}
public void testIt() {
var testArr = new Person[5];
testArr[0] = new Person("Jon#Mullen.com", "Jon Mullen");
testArr[1] = new Person("Jane#Cullen.com", "Jane Cullen");
testArr[2] = new Person("Jon#Cullen.com", "Jon Cullen");
testArr[3] = new Person("John#Mullen.com", "John Mullen");
testArr[4] = new Person("Jon#Mullen.com", "Test Other"); //same eMail as index 0...
var targetList = new eMailKeyedCollection();
foreach (var p in testArr) {
if (!targetList.Contains(p.eMail))
targetList.Add(p);
}
}
If the item is found in the collection, you could easily pick (and eventually modify) it with:
if (!targetList.Contains(p.eMail))
targetList.Add(p);
else {
var currentPerson=targetList[p.eMail];
//modify Name, Address whatever...
}

Related

Slow performance in getting model from list model using enumerable linq

I decided to pour database records into List<> model and use enumerable Linq to get record from it. It have 141,856 records in it. What we found instead is it is pretty slow.
So, any suggestion or recommendation on making it run very quickly?
public class Geography
{
public string Zipcode { get; set; }
public string City { get; set; }
public string State { get; set; }
}
var geography = new List<Geography>();
geography.Add(new Geography() { Zipcode = "32245", City = "Jacksonville", State = "Florida" });
geography.Add(new Geography() { Zipcode = "00001", City = "Atlanta", State = "Georgia" });
var result = geography.Where(x => (string.Equals(x.Zipcode, "32245", String Comparison.InvariantCulterIgnoreCase))).FirstOrDefault();
When we have 86,000 vehicles in Inventory and we want to use parallel task to get it done quickly but it become very slow when geography is being looked up.
await Task.WhenAll(vehicleInventoryRecords.Select(async inventory =>
{
var result = geography.Where(x => (string.Equals(x.Zipcode, inventory.Zipcode, String Comparison.InvariantCulterIgnoreCase))).FirstOrDefault();
}));
Use dictionary<string, Geography> to store geography data. Looking up data in dictionary by key is O(1) operation while for list it is O(n)
You haven't mentioned if your ZIP codes are unique, so I'll assume they aren't. If they are - look at Giorgi's answer and skip to part 2 of my answer.
1. Use lookups
Since you're looking up your geography list multiple times by the same property, you should group the values by Zipcode. You can do this easily by using ToLookup - this will create a Lookup object. It is similar to a Dictionary, except it can multiple values as it's value. Passing a StringComparer.InvariantCultureIgnoreCase as the second parameter to your ToLookup will make it case-insensitive.
var geography = new List<Geography>();
geography.Add(new Geography { Zipcode = "32245", City = "Jacksonville", State = "Florida" });
geography.Add(new Geography { Zipcode = "00001", City = "Atlanta", State = "Georgia" });
var geographyLookup = geography.ToLookup(x => x.Zipcode, StringComparer.InvariantCultureIgnoreCase);
var result = geographyLookup["32245"].FirstOrDefault();
This should increase your performance considerably.
2. Parallelize with PLINQ
The way you parallelize your lookups is questionable. Luckily, .NET has PLINQ. You can use AsParallel and a parallel Select to asynchronously iterate over your vehicleInventoryRecords like this:
var results = vehicleInventoryRecords.AsParallel().Select(x => geographyLookup[x].FirstOrDefault());
Using Parallel.ForEach is another good option.

Moving an item where there are multiple matches to the top of a list

I have a list which can contain multiple records of people:
result =people.Select(p => new PersonDetail
{
Involvement = p._Discriminator,
FullName = p.BuildFullName(),
DateOfBirth = p.DateOfBirth != null ? p.DateOfBirth.Value.ToShortDateString() : string.Empty,
Race = MappingExtensions.MergeCodeAndText(p.RaceCode, p.RaceText),
Sex = MappingExtensions.MergeCodeAndText(p.SexCode, p.SexText),
Height = p.Height,
Weight = p.Weight != null ? p.Weight.ToString() : string.Empty,
EyeColor = MappingExtensions.MergeCodeAndText(p.EyeColorCode, p.EyeColorText),
HairColor = MappingExtensions.MergeCodeAndText(p.HairColor1Code, p.HairColor1Text),
//...
}).ToList();
I want to order this list by Involvement type (victim, suspect, witness).
I've tried the following using Remove and Insert:
foreach (var i in result.Where(i => i.Involvement.ToLower() =="suspect"))
{
result.Remove(i);
result.Insert(0, i);
}
return result;
On the first loop it works as I would expect however on the second loop I get an exception thrown. I suspect there is some recursion going on or the exception is thrown because it keeps finding the record I promoted on the first pass and can't get by it.
I wanted to perform a loop vs just one pass as there might be multiple records that are marked as suspect. I need to promote all of these to the top above witness or victims. The other involvements are not relevant in ordering.
example:
Bob "suspect"
Jane"suspect"
Kenny "witness"
Joe "victim"
Any suggestions on how to select multiple records and ensure they are placed at the top of the list above others?
Thanks for any ideas or suggestions
First of all your can't change result collection in foreach. The collection used in foreach is immutable and here is MSDN link with explanation: foreach
You can use OrderBy to reorder your collection:
result = result.OrderBy(r => r.Involvement.ToLower() =="suspect" ? 0 : 1).ToList();
Expression in OrderBy will promote "suspect" items to the top of the result.
The currently accepted answer will only succeed if the list being ordered only cares about the precedence of Suspect being at the top of the list, i.e.,
Kenny, Witness
Bob, Suspect
Joe, Victim
Jane, Suspect
In other words, if the ordering precedence also includes Witness and Victim, the the accepted answer will be correct because Witness is already taking precedence in the order over Victim; when using the accepted answer, the result will correctly be:
Bob, Suspect
Jane, Suspect
Kenny, Witness
Joe, Victim
However, if the ordering precedence must include strings other than "Suspect", then the accepted answer will fail, i.e. if the list comes in as,
Jill, Victim
Kenny, Witness
Bob, Suspect
Joe, Victim
Jane, Suspect
Then the result will be:
Bob, Suspect
Jane, Suspect
Jill, Victim
Kenny, Witness
Joe, Victim
But, the correct result should be (assuming Witness takes precedence over Victim):
Bob, Suspect
Jane, Suspect
Kenny, Witness
Jill, Victim
Joe, Victim
To custom sort based on a non-alpha sort, you'll need to provide some type of IComparer or similar:
// Sample class with a name and involvement:
public class Detail
{
public string Name { get; set; }
public string Involvement { get; set; }
public Detail( string name, string involvement )
{
Name = name;
Involvement = involvement;
}
}
// implementation of IComparer that uses a custom alpha sort:
public class DetailComparer : IComparer<Detail>
{
static readonly List<string> Ordered = new List<string> { "suspect", "witness", "victim" };
public int Compare( Detail x, Detail y )
{
int i = Ordered.FindIndex( str => str.Equals( x.Involvement, StringComparison.OrdinalIgnoreCase ) );
int j = Ordered.FindIndex( str => str.Equals( y.Involvement, StringComparison.OrdinalIgnoreCase ) );
if( i > j ) return 1;
if( i < j ) return -1;
return 0;
}
}
The list can then be sorted by providing the comparer:
var details = new List<Detail>
{
new Detail("Jill", "Victim"),
new Detail("Kenny", "Witness"),
new Detail("Bob", "Suspect"),
new Detail("Joe", "Victim"),
new Detail("Jane", "Suspect"),
};
details.Sort( new DetailComparer() );
By providing the custom IComparer, any precedence can be declared for every "Involvement".

How to quickly match names (fname, lname) in different order with full name c#

I have this linq query I am trying to optimize. I want to replace this query with a fast constant (preferably) retrieval of the value. I thought about a twin key dictionary but I have no idea which order the fname or lname will come first. I wanted to ask here if there is a fast way to do this.
I wanted to take a list of names, search through it for fname-lname the - is the delimeter and return all that match the full name that is searched. The list of people could be moderately large.
var nameList = from p in listOfPeople
where ((p.lname+"-"+p.fname == fullName)
|| (p.fname+"-"+p.lname == fullname))
select p;
Edit: listOfPeople can be any datatype, not necessarily a list.
Here's how you can create your dictionary.
var nameLookup = new Dictionary<Tuple<string,string>, List<Person>>();
foreach(var person in listOfPeople)
{
List<Person> people = null;
var firstLast = Tuple.Create(person.fname, person.lname);
if(nameLookup.TryGetValue(firstLast, out people))
{
people.Add(person);
}
else
{
nameLookup.Add(firstLast, new List<Person> { person });
}
// If the person's first and last name are the same we don't want to add them twice.
if(person.fname == person.lname)
{
continue;
}
var lastFirst = Tuple.Create(person.lname, person.fname);
if(nameLookup.TryGetValue(lastFirst, out people))
{
people.Add(person);
}
else
{
nameLookup.Add(lastFirst, new List<Person> { person });
}
}
Then your lookup would be
// split by the delimiter to get these if needed
var fullName = Tuple.Create(firstName, lastName);
List<Person> nameList = null;
if(!nameLookup.TryGetValue(fullName, out nameList))
{
nameList = new List<Person>();
}
It's important to keep the first and last names separate or you have to pick a delimiter that will not show up the the first or last name. Hyphen "-" could be part of a first or last name. If the delimiter is guaranteed to not be part of the first or last name you can just substitute the use of the Tuple.Create(x,y) with x + delimiter + y and change the dictionary to Dictionary<string, List<Person>>.
Additionally the reason for having a List<Person> as the value of the dictionary is to handle cases like "Gary William" and "William Gary" being two different people.
In your "P" definition, which I guess it's a "People" type, I would add a "FullName" property, which will be your comparator:
public string FullName {get {return fname + "-" + lname;}}
And modify your LINQ with:
Where string.Equals(p.FullName, fullName) .
If you REALLY want to use with ANY datatype, which would include just string or even DataTable, i really don't see any better way than the way you did...
I tested with Stopwatch and this appears to be a little more effective
var nameList = from n in(
from p in listOfPeople
select new{FullName = p.fname +"-"+ p.lname}
)
where n.FullName==fullName
select n;

How can I take objects from the second set of objects which don't exist in the first set of objects in fast way?

I have records in two databases. That is the entity in the first database:
public class PersonInDatabaseOne
{
public string Name { get; set; }
public string Surname { get; set; }
}
That is the entity in the second database:
public class PersonInDatabaseTwo
{
public string FirstName { get; set; }
public string LastName { get; set; }
}
How can I get records from the second database which don't exist in the first database (the first name and the last name must be different than in the first database). Now I have something like that but that is VERY SLOW, too slow:
List<PersonInDatabaseOne> peopleInDatabaseOne = new List<PersonInDatabaseOne>();
// Hear I generate objects but in real I take it from database:
for (int i = 0; i < 100000; i++)
{
peopleInDatabaseOne.Add(new PersonInDatabaseOne { Name = "aaa" + i, Surname = "aaa" + i });
}
List<PersonInDatabaseTwo> peopleInDatabaseTwo = new List<PersonInDatabaseTwo>();
// Hear I generate objects but in real I take it from database:
for (int i = 0; i < 10000; i++)
{
peopleInDatabaseTwo.Add(new PersonInDatabaseTwo { FirstName = "aaa" + i, LastName = "aaa" + i });
}
for (int i = 0; i < 10000; i++)
{
peopleInDatabaseTwo.Add(new PersonInDatabaseTwo { FirstName = "bbb" + i, LastName = "bbb" + i });
}
List<PersonInDatabaseTwo> peopleInDatabaseTwoWhichNotExistInDatabaseOne = new List<PersonInDatabaseTwo>();
// BELOW CODE IS VERY SLOW:
foreach (PersonInDatabaseTwo personInDatabaseTwo in peopleInDatabaseTwo)
{
if (!peopleInDatabaseOne.Any(x => x.Name == personInDatabaseTwo.FirstName && x.Surname == personInDatabaseTwo.LastName))
{
peopleInDatabaseTwoWhichNotExistInDatabaseOne.Add(personInDatabaseTwo);
}
};
The fastest way is dependent on the number of entities, and what indexes you already have.
If there's a few entities, what you already have performs better because multiple scans of a small set takes less than creating HashSet objects.
If all of your entities fit in the memory, the best way is to build HashSet out of them, and use Except which is detailed nicely by #alex.feigin.
If you can't afford loading all entities in the memory, you need to divide them into bulks based on the comparison key and load them into memory and apply the HashSet method repeatedly. Note that bulks can't be based on the number of records, but on the comparison key. For example, load all entities with names starting with 'A', then 'B', and so on.
If you already have an index on the database on the comparison key (like, in your case, FirstName and LastName) in one of the databases, you can retrieve a sorted list from the database. This will help you do binary search (http://en.wikipedia.org/wiki/Binary_search_algorithm) on the sorted list for comparison. See https://msdn.microsoft.com/en-us/library/w4e7fxsh(v=vs.110).aspx
If you already have an index on the database on the comparison key on both databases, you can get to do this in O(n), and in a scalable way (any number of records). You need to loop through both lists and find the differences only once. See https://stackoverflow.com/a/161535/187996 for more details.
Edit: with respect to the comments - using a real model and a dictionary instead of a simple set:
Try hashing your list into a Dictionary to hold your people objects, as the key - try a Tuple instead of a name1==name2 && lname1==lname2.
This will potentially then look like this:
// Some people1 and people2 lists of models already exist:
var sw = Stopwatch.StartNew();
var removeThese = people1.Select(x=>Tuple.Create(x.FirstName,x.LastName));
var dic2 = people2.ToDictionary(x=>Tuple.Create(x.Name,x.Surname),x=>x);
var result = dic2.Keys.Except(removeThese).Select(x=>dic2[x]).ToList();
Console.WriteLine(sw.Elapsed);
I hope this helps.

How do i get the difference in two lists in C#?

Ok so I have two lists in C#
List<Attribute> attributes = new List<Attribute>();
List<string> songs = new List<string>();
one is of strings and and one is of a attribute object that i created..very simple
class Attribute
{
public string size { get; set; }
public string link { get; set; }
public string name { get; set; }
public Attribute(){}
public Attribute(string s, string l, string n)
{
size = s;
link = l;
name = n;
}
}
I now have to compare to see what songs are not in the attributes name so for example
songs.Add("something");
songs.Add("another");
songs.Add("yet another");
Attribute a = new Attribute("500", "http://google.com", "something" );
attributes.Add(a);
I want a way to return "another" and "yet another" because they are not in the attributes list name
so for pseudocode
difference = songs - attributes.names
var difference = songs.Except(attributes.Select(s=>s.name)).ToList();
edit
Added ToList() to make it a list
It's worth pointing out that the answers posted here will return a list of songs not present in attributes.names, but it won't give you a list of attributes.names not present in songs.
While this is what the OP wanted, the title may be a little misleading, especially if (like me) you came here looking for a way to check whether the contents of two lists differ. If this is what you want, you can use the following:-
var differences = new HashSet(songs);
differences.SymmetricExceptWith(attributes.Select(a => a.name));
if (differences.Any())
{
// The lists differ.
}
This is the way to find all the songs which aren't included in attributes names:
var result = songs
.Where(!attributes.Select(a => a.name).ToList().Contains(song));
The answer using Except is also perfect and probably more efficient.
EDIT: This sintax has one advantage if you're using it in LINQ to SQL: it translates into a NOT IN SQL predicate. Except is not translated to anything in SQL. So, in that context, all the records would be recovered from the database and excepted on the app side, which is much less efficient.
var diff = songs.Except(attributes.Select(a => a.name)).ToList();

Categories