LINQ to Objects Performance - Huge dataset for long running process - c#

I have a detail list (200,000 records) which was pulled from database and I need to find the locations for each detail and below is the code which is looping through the detail list and assigning location to the list. This loop is taking more than 15 minutes to execute but if don’t populate the Locations property then it takes less than a minute.
How can I optimize this code?
class Program
{
static void Main(string[] args)
{
List<Details> databaseDetailList = GetDetailsFromdatabase();
List<Location1> databaseLocation1List = GetLocations1Fromdatabase();
List<Location2> databaseLocation2List = GetLocations2Fromdatabase();
List<Details> detailList = new List<Details>();
foreach (var x in databaseDetailList)
{
detailList.Add(new Details
{
DetailId = x.DetailId,
Code = x.Code,
//If I comment out the Locations then it works faster
Locations = new LocationIfo {
Locations1 = databaseLocation1List
.Where(l=>l.DetailId == x.DetailId && l.Code == x.Code).ToList(),
Locations2 = databaseLocation2List
.Where(l => l.DetailId == x.DetailId && l.Code == x.Code).ToList()
}
});
}
}
private static List<Details> GetDetailsFromdatabase()
{
//This returns 200,000 records from database
return new List<Details>();
}
private static List<Location1> GetLocations1Fromdatabase()
{
//This returns 100,000 records from database
return new List<Location1>();
}
private static List<Location2> GetLocations2Fromdatabase()
{
//This returns 100,000 records from database
return new List<Location2>();
}
}
public class Details
{
public string DetailId { get; set; }
public string Code { get; set; }
public LocationIfo Locations { get; set; }
}
public class LocationIfo
{
public List<Location1> Locations1 { get; set; }
public List<Location2> Locations2 { get; set; }
}
public class Location1
{
public int LocationId { get; set; }
public string DetailId { get; set; }
public string Code { get; set; }
public string OtherProperty { get; set; }
}
public class Location2
{
public int LocationId { get; set; }
public string DetailId { get; set; }
public string Code { get; set; }
public string OtherProperty { get; set; }
}

What you're doing conceptually here is a Join. Using the proper operations will ensure that it executes much more effectively. Ideally you would even be doing the Join on the database side of things, rather than after pulling all of the data down into lists, but even if you do pull all of the data down, joining it in memory using Join will be much more efficient.
var query = from detail in databaseDetailList
join location1 in databaseLocation1List
on new { detail.DetailId, detail.Code }
equals new { location1.DetailId, location1.Code }
into locations1
join location2 in databaseLocation2List
on new { detail.DetailId, detail.Code }
equals new { location2.DetailId, location2.Code }
into locations2
select new Details
{
Code = detail.Code,
DetailId = detail.DetailId,
Locations = new LocationIfo
{
Locations1 = locations1.ToList(),
Locations2 = locations2.ToList(),
}
};

This is a very similar problem I faced a couple of days ago (in Ruby).
Finally the best solution I found was to convert the List(or Array) into a Dictionary(or Hash)
Create a Dictionary with your searching fields concatenated as a key and your list item as as value:
var dict1 = databaseLocation1List.ToDictionary(x => x.DetailId.ToString() + x.Code.ToString(), x);
And then the search in the Dictionary by key will be very fast:
Locations1 = dict1[x.DetailId.ToString() + x.Code.ToString()].ToList()

This is a bit hacky - but you could use lookups:
If it's the look-up portion of your code that is taking the longest - and you can't change the database code, as suggested with a Join - you could consider indexing your Location return-types in an in-memory Lookup.
Build a Lookup key, that's a combination of your two values - this will need to be specifically unique, dependent on your requirements.
DetailId-Code - "123123-343"
private static ILookup<int, Location1> GetLocations1Fromdatabase()
{
//This returns 100,000 records from database
return new List<Location2>()
.ToLookup(l => l.DetailId + "-" + l.Code);
}
Then:
Locations1 = databaseLocation1List[l.DetailId + "-" + l.Code].ToList()

Related

LINQ query grouping without duplication of data

So I want to display an output that is group by two fields: SubsidiaryCode and AssetCreatedDate. My problem is it displays the grouping values redundantly.
I suspect it duplicates because of my Detail class.
What I want is:
But it displays like this:
LINQ query:
public DateTime FromDate { get; set; }
public DateTime ToDate { get; set; }
public IList<AssetListTemplate> List = new List<AssetListTemplate>();
public IList<AssetListTemplate> GetList()
{
using (var ctx = LinqExtensions.GetDataContext<NXpert.FixedAsset.DataAccess.FixedAssetDataContext>("AccountingDB"))
{
var list = (from x in ctx.DataContext.AssetRegistryEntities
where x.SubsidiaryCode2 != "" && x.SubsidiaryCode2.ToUpper().Contains("y-") && x.AssetCreatedDate>=FromDate && x.AssetCreatedDate <= ToDate
group new { x.SubsidiaryCode2, x.AssetCreatedDate,x.AssetCategoryID } by x into groupedList
select new AssetListTemplate
{
IsSelected = false,
SubsidiaryCode = groupedList.Key.SubsidiaryCode2,
AssetCreatedDate = groupedList.Key.AssetCreatedDate,
AssetCategory = groupedList.Key.AssetCategoryID
}
).OrderBy(x => x.SubsidiaryCode).ThenBy(y => y.AssetCreatedDate).ToList();
List = list;
foreach (var item in List)
{
var details = (from x in ctx.DataContext.AssetRegistryEntities
join y in ctx.DataContext.AssetCategoryEntities on x.AssetCategoryID equals y.AssetCategoryID
join z in ctx.DataContext.FixedAssetOtherInfoEntities on x.AssetCode equals z.AssetCode
where x.SubsidiaryCode2 == item.SubsidiaryCode
select new Details
{
AssetCode = x.AssetCode,
AssetCodeDesc = y.AssetCategoryDesc,
AssetDesc = x.AssetCodeDesc,
DepInCharge = z.DepartmentInCharge,
SerialNo = x.SerialNumber,
ModelNo = x.ModelNumber
}).ToList();
item.Details = details;
}
return List;
}
}
}
public class AssetListTemplate
{
public bool IsSelected { get; set; }
public string SubsidiaryCode { get; set; }
public DateTime? AssetCreatedDate { get; set; }
public string AssetCategory { get; set; }
public List<Details> Details { get; set; }
}
public class Details {
public string AssetCode { get; set; }
public string AssetCodeDesc { get; set; }
public string AssetDesc { get; set; }
public string DepInCharge { get; set; }
public string SerialNo { get; set; }
public string ModelNo { get; set; }
}
SQL Query:
SELECT Are_SubsidiaryCode2[SubsidiaryCode],Are_AssetCreatedDate[AssetCreatedDate],Are_AssetCategoryID[AssetCategory]
FROM E_AssetRegistry
WHERE Are_SubsidiaryCode2<>''
AND Are_SubsidiaryCode2 LIKE '%Y-%'
GROUP BY Are_SubsidiaryCode2
,Are_AssetCreatedDate
,Are_AssetCategoryID
ORDER BY AssetCreatedDate ASC
You don't seem to be using the grouping for any aggregate function , so you could make life simpler by just using distinct:
from x in ctx.DataContext.AssetRegistryEntities
where x.SubsidiaryCode2.Contains("y-") && x.AssetCreatedDate>=FromDate && x.AssetCreatedDate <= ToDate
select new AssetListTemplate
{
IsSelected = false,
SubsidiaryCode = groupedList.Key.SubsidiaryCode2,
AssetCreatedDate = groupedList.Key.AssetCreatedDate.Value.Date,
AssetCategory = groupedList.Key.AssetCategoryID
}
).Distinct().OrderBy(x => x.SubsidiaryCode).ThenBy(y => y.AssetCreatedDate).ToList();
Side note, you don't need to assign a list to a clas variable and also return it; I'd recommend just to return it. If you're looking to cache the results, make the class level var private, assign it and return it first time and just return it the second time (use the null-ness of the class level var to determine if the query has been run)
Expanding on the comment:
You don't need to store your data in a public property and also return it. Don't do this:
public class Whatever{
public string Name {get;set;}
public string GetName(){
var name = "John";
Name = name;
return name;
}
Typically we would either return it:
public class Whatever{
public string GetName(){
var name = MySlowDatabaseCallToCalculateAName();
return name;
}
//use it like:
var w = new Whatever();
var name = w.GetName();
Or we would store it:
public class Whatever{
public string Name {get;set;}
public void PopulateName(){
Name = MySlowDatabaseCallToCalculateAName();
}
//use it like
var w = new Whatever();
w.PopulateName();
var name = w.Name;
We might have something like a mix of the two if we were providing some sort of cache, like if the query is really slow and the data doesn't change often, but it is used a lot:
public class Whatever{
private string _name;
private DateTime _nameGeneratedAt = DateTime.MinValue;
public string GetName(){
//if it was more than a day since we generated the name, generate a new one
if(_nameGeneratedAt < DateTime.UtcNow.AddDays(-1)){
_name = MySlowDatabaseCallToCalculateAName();
_nameGeneratedAt = DateTime.UtcNow; //and don't do it again for at least a day
}
return _name;
}
This would mean that we only have to do the slow thing once a day, but generally in a method like "Get me a name/asset list/whatever" we wouldn't set a public property as well as return the thing; it's confusing for callers which one to use - they want the name; should they access the Name property or call GetName? If the property was called "MostRecentlyGeneratedName" and the method called "GenerateLatestName" it would make more sense - the caller can know they might call Generate..( first, and then they could use MostRecently.. - it's like a caching; the calling class can decide whether to get the latest, or reuse a recently generated one (but it does introduce the small headache of what happens if some other operation does a generate in the middle of the first operation using the property..)..
..but we probably wouldn't do this; instead we'd just provide the Generate..( method and if the caller wants to cache it and reuse the result, it can

How to improve performance while processing data from a huge list in C#?

I have a class Test1 which call Test2 class method.
public class Test1
{
public void Testmethod1(List<InputData> request)
{
//get data from sql : Huge list inputs around more then 150K
var inputs = new List<InputData>();
var output = Test2.Testmethod2(inputs);
}
}
Test2 class has processing method as below:
public class Test2
{
//request list count 200K
public static List<OutputData> Testmethod2(List<InputData> request)
{
object sync = new Object();
var output = new List<OutputData>();
var output1 = new List<OutputData>();
//data count: 20K
var data = request.Select(x => x.Input2).Distinct().ToList();
//method calling using for each : processing time 4 hours
foreach (var n in data)
{
output.AddRange(ProcessData(request.Where(x => x.Input1 == n)));
}
// method calling using Parallel.ForEach,processing time 4 hours
var options = new ParallelOptions { MaxDegreeOfParallelism = 3 };
Parallel.ForEach(data, options, n =>
{
lock (sync)
{
output1.AddRange(ProcessData(request.Where(x => x.Input1 == n)));
}
});
return output;
}
public static List<OutputData> ProcessData(IEnumerable<InputData> inputData)
{
var result = new List<OutputData>();
//processing on the input data
return result;
}
}
public class InputData
{
public int Input1 { get; set; }
public int Input2 { get; set; }
public int Input3 { get; set; }
public DateTime Input4 { get; set; }
public int Input5 { get; set; }
public int Input6 { get; set; }
public string Input7 { get; set; }
public int Input8 { get; set; }
public int Input9 { get; set; }
}
public class OutputData
{
public int Ouputt1 { get; set; }
public int Output2 { get; set; }
public int Output3 { get; set; }
public int output4 { get; set; }
}
its taking quite a long time to process data around 4 hours.Even Parallel.foreach working like normal one.
I am thinking to use Dictionary to store input data however the data is not unique and doesnt have unique row.
Is there a better approach where we can optimize it?
Thanks!
Right now, the code is using brute force to perform 20K full searches for 20K items. That's 400M iterations.
I suspect performance will improve far more simply by creating a dictionary or a lookup (if there are multiple items per key), eg:
var myIndex=request.ToLookup(x=>x.Input1);
var output = new List<OutputData>(20000);
foreach (var n in data)
{
output.AddRange(ProcessData(myIndex[n]));
}
I specify a capacity in the list constructor to reduce reallocations each time the list's internal buffer gets full. There's no need for a precise value.
If the code is still slow, one approach would be to use Parallel.ForEach or use PLINQ, eg :
var output= ( from n in data.AsParallel().WithDegreeOfParallelism(3)
let dt=myIndex[n]
select ProcessData(dt)
).ToList();
(from n in request
//Group items in request by unique values of Input2
group n by n.Input2)
.AsParallel()
.WithDegreeOfParallelism(4)
.Select(data => Test2.ProcessData(
//Filter inputs
data.Where(x => x.Input1 == data.Key)
))
.Cast<IEnumerable<OutputData>>()
//Combine the output
.Aggregate(Enumerable.Concat)
//Generate the final list
.ToList();
The idea is to group request by InputData.Input2 values, process the batches in parallel and collect all the results.
Conceptually, this is a variation of #[Panagiotis Kanavos]'s answer

Filter list 1 from list 2

Im having a problem here with performance.
I got to List contains (50k items)
and List contains (120k items)
WholeSaleEntry is
public class WholeSaleEntry
{
public string SKU { get; set; }
public string Stock { get; set; }
public string Price { get; set; }
public string EAN { get; set; }
}
and ProductList
public class ProductList
{
public string SKU { get; set; }
public string Price { get; set; }
public string FinalPrice { get; set; }
public string AlternateID { get; set; }
}
I need to filter WholeSaleEntry By its EAN and Its SKU in case that their EAN or SKU is in the ProductList.AlternateID
I wrote this code which works but the performance is really slow
List<WholeSaleEntry> filterWholeSale(List<WholeSaleEntry> wholeSaleEntry, List<ProductList> productList)
{
List<WholeSaleEntry> list = new List<WholeSaleEntry>();
foreach (WholeSaleEntry item in wholeSaleEntry)
{
try
{
string productSku = item.SKU;
string productEan = item.EAN;
var filteredCollection = productList.Where(itemx => (itemx.AlternateID == productEan) || (itemx.AlternateID == productSku)).ToList();
if (filteredCollection.Count > 0)
{
list.Add(item);
}
}
catch (Exception)
{
}
}
return list;
}
Is there any better filtering system or something that can filter it in bulk?
The use of .Where(...).ToList() will find every matching item, and in the end you only need to know if there is a matching item. That can be fixed using Any(...) which stops as soon as a match is found, like this:
var hasAny = productList.Any(itemx => itemx.AlternateID == productEan || itemx.AlternateID == productSku);
if (hasAny)
{
list.Add(item);
}
Update: the algorithm can be simplified to this. First get the unique alternate IDs using a HashSet, which stores duplicate items just once, and which is crazy fast for lookup. Then get all WholeSale items that match with that.
There will not be many faster strategies, and the code is small and I think easy to understand.
var uniqueAlternateIDs = new HashSet<string>(productList.Select(w => w.AlternateID));
var list = wholeSaleEntry
.Where(w => uniqueAlternateIDs.Contains(w.SKU)
|| uniqueAlternateIDs.Contains(w.EAN))
.ToList();
return list;
A quick test revealed that for 50k + 50k items, using HashSet<string> took 28ms to get the answer. Using a List<string> filled using .Distinct().ToList() took 48 seconds, with all other code the same.
If you do not want a specific method and avoid the list just do this.
var filteredWholeSale = wholeSaleEntry.Where(x => productList.Any(itemx => itemx.AlternateID == x.EAN || itemx.AlternateID == x.SKU));

Inner join not working when use equal with %, What an alternative way to use it like like %

I have Medals class, I call a service that return all Medal class fields except for the two fields ArabicName and ISOCode; Which I have to bring them from another table class "CountrysTBLs" , I made this join code:
The Medal class:
public class Medals {
public int bronze_count { get; set; }
public string country_name { get; set; }
public int gold_count { get; set; }
public string id { get; set; }
public int place { get; set; }
public int silver_count { get; set; }
public int total_count { get; set; }
public string ArabicName { get; set; } // this field not return by service
public string ISOCode { get; set; } // this field not return by service
}
The code:
var cntrs = db.CountrysTBLs.ToList();
List<Medals> objs = call_Service_Of_Medals_Without_ISOCode();
IEnumerable<Medals> query = from obj in objs
join cntry in cntrs on obj.country_name equals '%' + cntry.CommonName + '%'
select new Medals
{
ArabicName = cntry.ArabicName,
ISOCode = cntry.ISOCode,
country_name = obj.country_name,
place = obj.place,
gold_count = obj.gold_count,
silver_count = obj.silver_count,
bronze_count = obj.bronze_count,
total_count = obj.total_count
};
I get no result?!
How to fix that? Is there is any way to bring the two fields (ISOCode, ArabicName) without even use the inner join, and in same time best performance?
You want something like this to achieve LIKE functionality
List<Medals> medals = new List<Medals>();
var list = medals.Select(x => new
{
obj = x,
country = countries.FirstOrDefault(c => c.CommonName.Contains(x.country_name))
});
or something like this (if you want to just enrich each medal)
foreach (var medal in medals)
{
var country = countries.FirstOrDefault(c => c.CommonName.Contains(x.country_name));
medal.ISOCode = country.ISOCode;
medal.ArabicName = country.ArabicName;
}
Do note that this is not as performant as a Dictionary<string,Coutnry> of countries where the key is the country name, but as you need a LIKE comparison you would need to bring in a custom data structure such as Lucene index for fast lookups. But check first, if the lists are small enough, it probably won't be a problem. Otherwise, why not make the Medal.Country_Name and Country.Name the same? So you can do quick Dictionary (hashtable) lookups

Optimizing simple usage of Linq in C#

I have replicated a stripped-down version of my code that has recently been re-written to use linq to access the database.
However, in my opinion, the linq is really simple and could probably be optimized quite a bit, especially around line 90 where there is a linq statement inside a foreach loop.
It'd be really helpful to see how someone else would go about writing this simple task using linq. Thanks in advance! Below is a snippet of my source code.
// Model objects - these are to be populated from the database,
// which has identical fields and table names.
public class Element
{
public Element()
{
Translations = new Collection<Translation>();
}
public int Id { get; set; }
public string Name { get; set; }
public Collection<Translation> Translations { get; set; }
public class Translation
{
public int Id { get; set; }
public string Title { get; set; }
public string Content { get; set; }
public Language Lang { get; set; }
}
}
public class Language
{
public int Id { get; set; }
public string Name { get; set; }
public string Code { get; set; }
}
// Stripped-down functions for adding and loading Element
// objects to/from the database:
public static class DatabaseLoader
{
// Add method isn't too bulky, but I'm sure it could be optimised somewhere.
public static void Add(string name, Collection<Translation> translations)
{
using (var db = DataContextFactory.Create<ElementContext>())
{
var dbElement = new Database.Element()
{
Name = name
};
db.Elements.InsertOnSubmit(dbElement);
// Must be submit so the dbElement gets it's Id set.
db.SubmitChanges();
foreach (var translation in translations)
{
db.Translations.InsertOnSubmit(
new Database.Translation()
{
FK_Element_Id = dbElement.Id,
FK_Language_Id = translation.Lang.Id,
Title = translation.Title,
Content = translation.Content
});
}
// Submit all the changes outside the loop.
db.SubmitChanges();
}
}
// This method is really bulky, and I'd like to see the many separate linq
// calls merged into one clever statement if possible (?).
public static Element Load(int id)
{
using (var db = DataContextFactory.Create<ElementContext>())
{
// Get the database object of the relavent element.
var dbElement =
(from e in db.Elements
where e.Id == id
select e).Single();
// Find all the translations for the current element.
var dbTranslations =
from t in db.Translations
where t.Fk_Element_Id == id
select t;
// This object will be used to buld the model object.
var trans = new Collection<Translation>();
foreach (var translation in dbTranslations)
{
// Build up the 'trans' variable for passing to model object.
// Here there is a linq statement for EVERY itteration of the
// foreach loop... not good (?).
var dbLanguage =
(from l in db.Languages
where l.Id == translation.FK_Language_Id
select l).Single();
trans.Add(new Translation()
{
Id = translation.Id,
Title = translation.Title,
Content = translation.Content,
Language = new Language()
{
Id = dbLanguage.Id,
Name = dbLanguage.Name,
Code = dbLanguage.Code
}
});
}
// The model object is now build up from the database (finally).
return new Element()
{
Id = id,
Name = dbElement.Name,
Translations = trans
};
}
}
}
Using some made-up constructors to oversimplify:
public static Element Load(int id)
{
using (var db = DataContextFactory.Create<ElementContext>())
{
var translations = from t in db.Translations
where t.Fk_Element_Id == id
join l in db.Languages on t.FK_Language_Id equals l.Id
select new Translation(t, l);
return new Element(db.Elements.Where(x => x.Id == id).Single(), translations);
}
}
First thing I don't like in here is all the "new Translation() { bla = bla } because they're big blocks of code, I would put them in a method where you hand them the objects and they return the new for you.
Translations.InsertOnSubmit(CreateDatabaseTranslation(dbElement, translation));
and
trans.Add(CreateTranslationWithLanguage(translation, dbLanguage));
etc, wherever you have code like this, it just muddles the readability of what you're doing in here.

Categories