Slow performance in getting model from list model using enumerable linq - c#

I decided to pour database records into List<> model and use enumerable Linq to get record from it. It have 141,856 records in it. What we found instead is it is pretty slow.
So, any suggestion or recommendation on making it run very quickly?
public class Geography
{
public string Zipcode { get; set; }
public string City { get; set; }
public string State { get; set; }
}
var geography = new List<Geography>();
geography.Add(new Geography() { Zipcode = "32245", City = "Jacksonville", State = "Florida" });
geography.Add(new Geography() { Zipcode = "00001", City = "Atlanta", State = "Georgia" });
var result = geography.Where(x => (string.Equals(x.Zipcode, "32245", String Comparison.InvariantCulterIgnoreCase))).FirstOrDefault();
When we have 86,000 vehicles in Inventory and we want to use parallel task to get it done quickly but it become very slow when geography is being looked up.
await Task.WhenAll(vehicleInventoryRecords.Select(async inventory =>
{
var result = geography.Where(x => (string.Equals(x.Zipcode, inventory.Zipcode, String Comparison.InvariantCulterIgnoreCase))).FirstOrDefault();
}));

Use dictionary<string, Geography> to store geography data. Looking up data in dictionary by key is O(1) operation while for list it is O(n)

You haven't mentioned if your ZIP codes are unique, so I'll assume they aren't. If they are - look at Giorgi's answer and skip to part 2 of my answer.
1. Use lookups
Since you're looking up your geography list multiple times by the same property, you should group the values by Zipcode. You can do this easily by using ToLookup - this will create a Lookup object. It is similar to a Dictionary, except it can multiple values as it's value. Passing a StringComparer.InvariantCultureIgnoreCase as the second parameter to your ToLookup will make it case-insensitive.
var geography = new List<Geography>();
geography.Add(new Geography { Zipcode = "32245", City = "Jacksonville", State = "Florida" });
geography.Add(new Geography { Zipcode = "00001", City = "Atlanta", State = "Georgia" });
var geographyLookup = geography.ToLookup(x => x.Zipcode, StringComparer.InvariantCultureIgnoreCase);
var result = geographyLookup["32245"].FirstOrDefault();
This should increase your performance considerably.
2. Parallelize with PLINQ
The way you parallelize your lookups is questionable. Luckily, .NET has PLINQ. You can use AsParallel and a parallel Select to asynchronously iterate over your vehicleInventoryRecords like this:
var results = vehicleInventoryRecords.AsParallel().Select(x => geographyLookup[x].FirstOrDefault());
Using Parallel.ForEach is another good option.

Related

Grouping and sum

I have a list as follows which will contain the following poco class.
public class BoxReportView
{
public DateTime ProductionPlanWeekStarting { get; set; }
public DateTime ProductionPlanWeekEnding { get; set; }
public string BatchNumber { get; set; }
public string BoxRef { get; set; }
public string BoxName { get; set; }
public decimal Qty { get; set; }
public FUEL_KitItem KitItem { get; set; }
public decimal Multiplier { get; set; }
}
I am wanting to group the report and sum it by using the BoxName and also the Qty SO I tried the following
var results = from line in kitItemsToGroup
group line by line.BoxName into g
select new BoxReportView
{
BoxRef = g.First().BoxRef,
BoxName = g.First().BoxName,
Qty = g.Count()
};
In My old report I was just doing this
var multiplier = finishedItem.SOPOrderReturnLine.LineQuantity -
finishedItem.SOPOrderReturnLine.StockUnitDespatchReceiptQuantity;
foreach (KitItem kItem in kitItems.Cast<KitItem().Where(z => z.IsBox == true).ToList())
{
kittItemsToGroup.Add(new BoxReportView() {
BatchNumber = _batchNumber,
ProductionPlanWeekEnding = _weekEndDate,
ProductionPlanWeekStarting = _weekStartDate,
BoxRef = kItem.StockCode,
KitItem = kItem,
Multiplier = multiplier,
Qty = kItem.Qty });
}
}
Then I was just returning
return kitItemsToGroup;
But as I am using it as a var I cannot what is best way to handle the grouping and the sum by box name and qty.
Whether it is the best way depends upon your priorities. Is processing speed important, or is it more important that the code is easy to understand, easy to test, easy to change and easy to debug?
One of the advantages of LINQ is, that it tries to avoid enumeration of the source more than necessary.
Are you sure that the users of this code will always need the complete collection? Can it be, that now, or in near future, someone only wants the first element? Or decides to stop enumeration after he fetched the 20th element and saw that there was nothing of interest for him?
When using LINQ, try to return IEnumerable<...> as long as possible. Let only the end-user who will interpret your LINQed data decide whether he wants to take only the FirstOrDefault(), or Count() everything, or put it in a Dictionary, or whatever. It is a waste of processing power to create a List if it is not going to be used as a List.
your LINQ code and your foreach do some completely different things. Alas it is quite common here on StackOverflow for people to ask for LINQ statements without really specifying their requirements. So I'll have to guess something in between your LINQ statement and your foreach.
Requirement Group the input sequence of kitItems, which are expected to be Fuel_KitItems into groups of BoxReportViews with the same BoxName, and select several properties from every Fuel_KitItem in each group.
var kitItemGroups = kitItems
.Cast<Fuel_KitItem>() // only needed if kitItems is not IEnumerable<Fuel_KitItem>
// make groups of Fuel_KitItems with same BoxName:
.GroupBy(fuelKitItem => fuelKitItem.BoxName,
// ResultSelector, take the BoxName and all fuelKitItems with this BoxName:
(boxName, fuelKitItemsWithThisBoxName) => new
{
// Select only the properties you plan to use:
BoxName = boxName,
FuelKitItems = fuelKitItemsWithThisBoxName.Select(fuelKitItem => new
{
// Only Select the properties that you plan to use
BatchNumber = fuelKitItem.BatchNumber,
Qty = fuelKitItem.Qty,
...
// Not needed, they are all equal to boxName:
// BoxName = fuelKitItem.BoxName
})
// only do ToList if you are certain that the user of the result
// will need the complete list of fuelKitItems in this group
.ToList(),
});
Usage:
var kitItemGroups = ...
// I only need the KitItemGroups with a BoxName starting with "A"
var result1 = kitItemGroups.Where(group => group.BoxName.StartsWith("A"))
.ToList();
// Or I only want the first three after sorting by group size
var result2 = kitItemGroups.OrderBy(group => group.FuelKitItems.Count())
.Take(3)
.ToList();
Efficiency Improvements: As long as you don't know how your LINQ will be used, don't make it a List. If you know that chances are high that the Count of group.FuelKitItems is needed, to a ToList

Using Rx to merge multiple sources by key

I'm kinda new to the reactive extensions, but since I have a very data-flow heavy problem, I'm assuming, it could massively simplify my implementation. But it seems my problem is a bit more exotic than I anticipated.
Problem
I have multiple data sources, which all emit part of the data for the same entity. eg I have datasource1, which emits the first name of a person, and datasource2 which emits the last name of a person. The arrival of these data is completely unpredictable.
What I need to do now, is to observe both those sources, and to use some kind of operator or subject, which allows me to await both source-observables. I only want to continue if both datasources return their specific part. Both my sources also pass a key for the data, so it's possible to link the together at a later point.
Is there a construct built into reactive, which allows me to that? Or is reactive simply the wrong toolset to solve my problem?
I can't judge whether Rx or async/await or TPL-Dataflow is a better solution, since that would probably depend on your larger application. Some reproducible code would really help.
Anyhow, here's an Rx solution. I'm assuming for now datasource1 and datasource2 are observables of different types, or easily convertible to observables of different types. If they were observables of the same type, this solution would also work, but you would have other options as well:
var firstNameSource = new Subject<FirstNameMessage>();
var lastNameSource = new Subject<LastNameMessage>();
var timeout = TimeSpan.FromSeconds(1); //Set to length of time willing to wait
var join = firstNameSource.Join(lastNameSource,
fnm => Observable.Timer(timeout),
lnm => Observable.Timer(timeout),
(fnm, lnm) => new { FirstNameMessage = fnm, LastNameMessage = lnm }
)
.Where(a => a.FirstNameMessage.Id == a.LastNameMessage.Id)
.Select(a => Tuple.Create(a.FirstNameMessage.Name, a.LastNameMessage.Name))
.Timeout(timeout)
.Catch(Observable.Empty<Tuple<string, string>>());
Using these sample classes:
public class FirstNameMessage
{
public int Id { get; set; }
public string Name { get; set; }
}
public class LastNameMessage
{
public int Id { get; set; }
public string Name { get; set; }
}
Here's some sample subscription/execution code:
join.Subscribe(t => Console.WriteLine($"{t.Item1} {t.Item2}"), () => Console.WriteLine("No more names!"));
firstNameSource.OnNext(new FirstNameMessage{Id = 1, Name = "John" });
lastNameSource.OnNext(new LastNameMessage{Id = 1, Name = "Smith" });
lastNameSource.OnNext(new LastNameMessage { Id = 2, Name = "Jones" });
await Task.Delay(TimeSpan.FromMilliseconds(500));
firstNameSource.OnNext(new FirstNameMessage { Id = 2, Name = "Paul" });
firstNameSource.OnNext(new FirstNameMessage { Id = 3, Name = "Larry" });
await Task.Delay(TimeSpan.FromMilliseconds(1500));
lastNameSource.OnNext(new LastNameMessage { Id = 3, Name = "Fail" });
firstNameSource.OnNext(new FirstNameMessage { Id = 4, Name = "Won't Work" });
lastNameSource.OnNext(new LastNameMessage { Id = 4, Name = "Subscription terminated" });
Explanation:
The crucial part of this solution is the Join operator. Whereas a standard DB/LINQ Join joins things by key, Rx's Join joins by time window. So the Join above joins any FirstNameMessage and LastNameMessage that are within timeout timespan of each other. Since we also want to join by key, that's why the Where clause is there.
The TimeOut and Catch calls at the end are possibly superfluous: They just serve to terminate the subscription. It sounds like your solution may just be waiting for one value, not multiple, so that may be required.

Remove duplicates from array of objects

I have a class called Customer that has several string properties like
firstName, lastName, email, etc.
I read in the customer information from a csv file that creates an array of the class:
Customer[] customers
I need to remove the duplicate customers having the same email address, leaving only 1 customer record for each particular email address.
I have done this using 2 loops but it takes nearly 5 minutes as there are usually 50,000+ customer records. Once I am done removing the duplicates, I need to write the customer information to another csv file (no help needed here).
If I did a Distinct in a loop how would I remove the other string variables that are a part of the class for that particular customer as well?
Thanks,
Andrew
With Linq, you can do this in O(n) time (single level loop) with a GroupBy
var uniquePersons = persons.GroupBy(p => p.Email)
.Select(grp => grp.First())
.ToArray();
Update
A bit on O(n) behavior of GroupBy.
GroupBy is implemented in Linq (Enumerable.cs) as this -
The IEnumerable is iterated only once to create the grouping. A Hash of the key provided (e.g. "Email" here) is used to find unique keys, and the elements are added in the Grouping corresponding to the keys.
Please see this GetGrouping code. And some old posts for reference.
What's the asymptotic complexity of GroupBy operation?
What guarantees are there on the run-time complexity (Big-O) of LINQ methods?
Then Select is obviously an O(n) code, making the above code O(n) overall.
Update 2
To handle empty/null values.
So, if there are instances where the value of Email is null or empty, the simple GroupBy will take just one of those objects from null & empty each.
One quick way to include all those objects with null/empty value is to use some unique keys at the run time for those objects, like
var tempEmailIndex = 0;
var uniqueNullAndEmpty = persons
.GroupBy(p => string.IsNullOrEmpty(p.Email)
? (++tempEmailIndex).ToString() : p.Email)
.Select(grp => grp.First())
.ToArray();
I'd do it like this:
public class Person {
public Person(string eMail, string Name) {
this.eMail = eMail;
this.Name = Name;
}
public string eMail { get; set; }
public string Name { get; set; }
}
public class eMailKeyedCollection : System.Collections.ObjectModel.KeyedCollection<string, Person> {
protected override string GetKeyForItem(Person item) {
return item.eMail;
}
}
public void testIt() {
var testArr = new Person[5];
testArr[0] = new Person("Jon#Mullen.com", "Jon Mullen");
testArr[1] = new Person("Jane#Cullen.com", "Jane Cullen");
testArr[2] = new Person("Jon#Cullen.com", "Jon Cullen");
testArr[3] = new Person("John#Mullen.com", "John Mullen");
testArr[4] = new Person("Jon#Mullen.com", "Test Other"); //same eMail as index 0...
var targetList = new eMailKeyedCollection();
foreach (var p in testArr) {
if (!targetList.Contains(p.eMail))
targetList.Add(p);
}
}
If the item is found in the collection, you could easily pick (and eventually modify) it with:
if (!targetList.Contains(p.eMail))
targetList.Add(p);
else {
var currentPerson=targetList[p.eMail];
//modify Name, Address whatever...
}

How can I take objects from the second set of objects which don't exist in the first set of objects in fast way?

I have records in two databases. That is the entity in the first database:
public class PersonInDatabaseOne
{
public string Name { get; set; }
public string Surname { get; set; }
}
That is the entity in the second database:
public class PersonInDatabaseTwo
{
public string FirstName { get; set; }
public string LastName { get; set; }
}
How can I get records from the second database which don't exist in the first database (the first name and the last name must be different than in the first database). Now I have something like that but that is VERY SLOW, too slow:
List<PersonInDatabaseOne> peopleInDatabaseOne = new List<PersonInDatabaseOne>();
// Hear I generate objects but in real I take it from database:
for (int i = 0; i < 100000; i++)
{
peopleInDatabaseOne.Add(new PersonInDatabaseOne { Name = "aaa" + i, Surname = "aaa" + i });
}
List<PersonInDatabaseTwo> peopleInDatabaseTwo = new List<PersonInDatabaseTwo>();
// Hear I generate objects but in real I take it from database:
for (int i = 0; i < 10000; i++)
{
peopleInDatabaseTwo.Add(new PersonInDatabaseTwo { FirstName = "aaa" + i, LastName = "aaa" + i });
}
for (int i = 0; i < 10000; i++)
{
peopleInDatabaseTwo.Add(new PersonInDatabaseTwo { FirstName = "bbb" + i, LastName = "bbb" + i });
}
List<PersonInDatabaseTwo> peopleInDatabaseTwoWhichNotExistInDatabaseOne = new List<PersonInDatabaseTwo>();
// BELOW CODE IS VERY SLOW:
foreach (PersonInDatabaseTwo personInDatabaseTwo in peopleInDatabaseTwo)
{
if (!peopleInDatabaseOne.Any(x => x.Name == personInDatabaseTwo.FirstName && x.Surname == personInDatabaseTwo.LastName))
{
peopleInDatabaseTwoWhichNotExistInDatabaseOne.Add(personInDatabaseTwo);
}
};
The fastest way is dependent on the number of entities, and what indexes you already have.
If there's a few entities, what you already have performs better because multiple scans of a small set takes less than creating HashSet objects.
If all of your entities fit in the memory, the best way is to build HashSet out of them, and use Except which is detailed nicely by #alex.feigin.
If you can't afford loading all entities in the memory, you need to divide them into bulks based on the comparison key and load them into memory and apply the HashSet method repeatedly. Note that bulks can't be based on the number of records, but on the comparison key. For example, load all entities with names starting with 'A', then 'B', and so on.
If you already have an index on the database on the comparison key (like, in your case, FirstName and LastName) in one of the databases, you can retrieve a sorted list from the database. This will help you do binary search (http://en.wikipedia.org/wiki/Binary_search_algorithm) on the sorted list for comparison. See https://msdn.microsoft.com/en-us/library/w4e7fxsh(v=vs.110).aspx
If you already have an index on the database on the comparison key on both databases, you can get to do this in O(n), and in a scalable way (any number of records). You need to loop through both lists and find the differences only once. See https://stackoverflow.com/a/161535/187996 for more details.
Edit: with respect to the comments - using a real model and a dictionary instead of a simple set:
Try hashing your list into a Dictionary to hold your people objects, as the key - try a Tuple instead of a name1==name2 && lname1==lname2.
This will potentially then look like this:
// Some people1 and people2 lists of models already exist:
var sw = Stopwatch.StartNew();
var removeThese = people1.Select(x=>Tuple.Create(x.FirstName,x.LastName));
var dic2 = people2.ToDictionary(x=>Tuple.Create(x.Name,x.Surname),x=>x);
var result = dic2.Keys.Except(removeThese).Select(x=>dic2[x]).ToList();
Console.WriteLine(sw.Elapsed);
I hope this helps.

How do i get the difference in two lists in C#?

Ok so I have two lists in C#
List<Attribute> attributes = new List<Attribute>();
List<string> songs = new List<string>();
one is of strings and and one is of a attribute object that i created..very simple
class Attribute
{
public string size { get; set; }
public string link { get; set; }
public string name { get; set; }
public Attribute(){}
public Attribute(string s, string l, string n)
{
size = s;
link = l;
name = n;
}
}
I now have to compare to see what songs are not in the attributes name so for example
songs.Add("something");
songs.Add("another");
songs.Add("yet another");
Attribute a = new Attribute("500", "http://google.com", "something" );
attributes.Add(a);
I want a way to return "another" and "yet another" because they are not in the attributes list name
so for pseudocode
difference = songs - attributes.names
var difference = songs.Except(attributes.Select(s=>s.name)).ToList();
edit
Added ToList() to make it a list
It's worth pointing out that the answers posted here will return a list of songs not present in attributes.names, but it won't give you a list of attributes.names not present in songs.
While this is what the OP wanted, the title may be a little misleading, especially if (like me) you came here looking for a way to check whether the contents of two lists differ. If this is what you want, you can use the following:-
var differences = new HashSet(songs);
differences.SymmetricExceptWith(attributes.Select(a => a.name));
if (differences.Any())
{
// The lists differ.
}
This is the way to find all the songs which aren't included in attributes names:
var result = songs
.Where(!attributes.Select(a => a.name).ToList().Contains(song));
The answer using Except is also perfect and probably more efficient.
EDIT: This sintax has one advantage if you're using it in LINQ to SQL: it translates into a NOT IN SQL predicate. Except is not translated to anything in SQL. So, in that context, all the records would be recovered from the database and excepted on the app side, which is much less efficient.
var diff = songs.Except(attributes.Select(a => a.name)).ToList();

Categories