Linq performance when diffing two lists using inner Contains

Linq performance when diffing two lists using inner Contains - c#

EDIT 01: I seem to have found a solution (click for the answer) that works for me. Going from and hour to merely seconds by pre-computing and then applying the .Except() extension method; but leaving this open if anyone else encounters this problem or if anyone else finds a better solution.
ORIGINAL QUESTION
I have the following set of queries, for differend kind of objects I'm staging from a source system so I can keep it in sync and make a delta stamp myself, as the sourcesystem doesn't provide it, nor can we build or touch it.
I get all data in memory an then for example perform this query, where I look for objects that don't exist any longer in the source system, but are present in the staging database - and thus have to be marked "deleted". The bottleneck is the first part of the LINQ query - on the .Contains(), how can I improve it's performance - mayve with .Except(), with a custom comparer?
Or should I best put them in a hashing list and them perform the compare?
The problem is though I have to have the staged objects afterwards to do some property transforms on them, this seemed the simplest solution, but unfortunately it's very slow on 20k objects
stagedSystemObjects.Where(stagedSystemObject =>
!sourceSystemObjects.Select(sourceSystemObject => sourceSystemObject.Code)
.Contains(stagedSystemObject.Code)
)
.Select(x =>
{
x.ActiveStatus = ActiveStatuses.Disabled;
x.ChangeReason = ChangeReasons.Edited;
return x;
})
.ToList();

Based on Yves Schelpe's answer. I made a little tweaks to make it faster.
The basic idea is to cancel the first two ToList and use PLINQ. See if this help
var stagedSystemCodes = stagedSystemObjects.Select(x => x.Code);
var sourceSystemCodes = sourceSystemObjects.Select(x => x.Code);
var codesThatNoLongerExistInSourceSystem = stagedSystemCodes.Except(sourceSystemCodes).ToArray();
var y = stagedSystemObjects.AsParallel()
.Where(stagedSystemObject =>
codesThatNoLongerExistInSourceSystem.Contains(stagedSystemObject.Code))
.Select(x =>
{
x.ActiveStatus = ActiveStatuses.Disabled;
x.ChangeReason = ChangeReasons.Edited;
return x;
}).ToArray();
Note that PLINQ may only work well for computational limited task with multi-core CPU. It could make things worse in other scenarios.

I have found a solution for this problem - which brought it down to mere seconds in stead of an hour for 200k objects.
It's done by pre-computing and then applying the .Except() extension method
So no longer "chaining" linq queries, or doing .Contains inside a method... but make it "simpler" by first projecting both to a list of strings, so that inner calculation doesn't have to happen over and over again in the original question's example code.
Here is my solution, that for now is satisfactory. However I'm leaving this open if anyone comes up with a refined/better solution!
var stagedSystemCodes = stagedSystemObjects.Select(x => x.Code).ToList();
var sourceSystemCodes = sourceSystemObjects.Select(x => x.Code).ToList();
var codesThatNoLongerExistInSourceSystem = stagedSystemCodes.Except(sourceSystemCodes).ToList();
return stagedSystemObjects
.Where(stagedSystemObject =>
codesThatNoLongerExistInSourceSystem.Contains(stagedSystemObject.Code))
.Select(x =>
{
x.ActiveStatus = ActiveStatuses.Disabled;
x.ChangeReason = ChangeReasons.Edited;
return x;
})
.ToList();

Related

Reusing a where array

Say I needed to do a whole bunch of queries from various tables like so
var weights = db.weights.Where(x => ids.Contains(x.ItemId)).Select(x => x.weight).ToList();
var heights = db.heights.Where(x => ids.Contains(x.ItemId)).Select(x => x.height).ToList();
var lengths = db.lengths.Where(x => ids.Contains(x.ItemId)).Select(x => x.length).ToList();
var widths = db.widths.Where( x => ids.Contains(x.ItemId)).Select(x => x.width ).ToList();
Okay, its really not that stupid in reality but its just for illustrating the question. Basically, that array "ids" gets sent to the database 4 times in this example. I was thinking I could save some bandwidth by just sending it once. Is it possible to do that? Sorta like
db.SetTempVariable("ids", ids);
var weights = db.weights.Where(x => db.TempVariable["ids"].Contains(x.ItemId)).Select(x => x.weight).ToList();
var heights = db.heights.Where(x => db.TempVariable["ids"].Contains(x.ItemId)).Select(x => x.height).ToList();
var lengths = db.lengths.Where(x => db.TempVariable["ids"].Contains(x.ItemId)).Select(x => x.length).ToList();
var widths = db.widths.Where( x => db.TempVariable["ids"].Contains(x.ItemId)).Select(x => x.width ).ToList();
db.DeleteTempVariable("ids");
I'm just imagining the possible syntax here. In essence, SetTempVariable would send the data to the database and db.TempVariables["ids"] would be just like a dummy object to use in expressions that really only contains a reference to previously sent data and then the database magically understands this and reuses the list of ids i sent it instead of me sending it again and again.
So, how can I do that?

Well, this is more a database design problem than anything. A properly designed database would have one table that contains weights, heights, lengths and widths for every item (or "id" as you call it), so one query on the item returns everything at once.
I'm reluctant to suggest bandaid fixes for the broken database design you're using, because you really should just fix that, but you'll find a large improvement in performance if you open a transaction first and run all 4 of your queries in it. Or just join the tables on id (they seem the same?), and then your queries become one query.
To answer your actual question, and again you're barking up the wrong tree here, that's what temp tables are. You can upload your data to a temp table and then join it against your other table(s).

How to query based on an previous object result? Entity Framework

I am extremely stuck with getting the right information from the DB. So, basically the problem is that I need to add where closure in my statement to validate that it only retrieves the real and needed information.
public async Task<IEnumerable<Post>> GetAllPosts(int userId, int pageNumber)
{
var followersIds = _dataContext.Followees.Where(f => f.CaUserId == userId).AsQueryable();
pageNumber *= 15;
var posts = await _dataContext.Posts
.Include(p => p.CaUser)
.Include(p => p.CaUser.Photos)
.Include(c => c.Comments)
.Where(u => u.CaUserId == followersIds.Id) <=== ERROR
.Include(l => l.LikeDet).ToListAsync();
return posts.OrderByDescending(p => p.Created).Take(pageNumber);
}
As you can see the followersIds contains all the required Id which I need to validate in the post variable. However I have tried with a foreach but nothing seems to work here. Can somebody help me with this issue?

The short version is that you can change that error line you have marked above to something like .Where(u => followersIds.Contains(u.CaUserId) which will return all entities with an CaUserID that is contained in the followersIds variable, however this still has the potential to return a much larger dataset than you will actually need and be quite a large query. (You also might need to check the sytax just a bit, shooting from memory without an IDE open) You are including a lot of linked entities in that query above, so maybe you'd be better off using a Select query vs a Where query, which would load only the properties that you need from each entity.
Take a look at this article from Jon Smith, who wrote the book "Entity Framework Core In Action", where he talks about using Select queries and DTO's to only get out what you need. Chances are, you don't need every property of every entity you are asking for in the query you have above. (Maybe you do, what do I know :p) Using this might help you get something much more efficient for just the dataset you need. More lines of code in the query, but potentaily better performance on the back end and a lighter memory footprint.

Difference between datatable.Rows.Cast<DataRow> and datatable.AsEnumerable() in Linq C#

I am working on same datatable related operation on data, what would be the most efficient way to use linq on datatable-
var list = dataSet.Tables[0]
.AsEnumerable()
.Where(p => p.Field<String>("EmployeeName") == "Jams");
OR
var listobj = (EnumerableRowCollection<DataRow>) dataSet.Tables[0].Rows
.Cast<DataRow>()
.Where(dr => dr["EmployeeName"].ToString() == "Jams");

.AsEnumerable() internally uses .Rows.Cast<DataRow>(), at least in the reference implementation. It does a few other bits as well but nothing that would appreciably affect performance.

.AsEnumerable() and .Field do a lot of extra work that is not needed in most cases.
Also, field lookup by index is faster than lookup by name:
int columnIndex = dataTable.Columns["EmployeeName"].Ordinal;
var list = dataTable.Rows.Cast<DataRow>().Where(dr => "Jams".Equals(dr[columnIndex]));
For multiple names, the lookup is faster if the results are cached in a Dictionary or Lookup:
int colIndex = dataTable.Columns["EmployeeName"].Ordinal;
var lookup = dataTable.Rows.Cast<DataRow>().ToLookup(dr => dr[colIndex]?.ToString());
// .. and later when the result is needed:
var list = lookup["Jams"];

Define "efficient".
From performance standpoint, I doubt that there are any significant differences between these two options: the overall run-time will be dominated by the time required to do network I/O, not the time required to do casting.
From pure code style point of view, second one looks too unelegant to me. If you can get away with all-LINQ solution, go with it as it's generally (IMO, at least) more readable by virtue of being declarative.

Interestingly enough, AsEnumerable() returns EnumerableRowCollection<DataRow>
which if you look into the code for this, you will see the following:
this._enumerableRows = Enumerable.Cast<TRow>((IEnumerable) table.Rows);
So I would say that they are basically equivalent!

Efficiency of C# Find on 1000+ records

I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}

If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html

If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.

I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.

Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)

Code efficiency/optimization

Not really sure if this is a proper question or not, but I figure that I'll give it a go and see what kind of answers pop up.
We're at the point in our development that we are going on to User Acceptance Testing, and one of the things the users have found to be a little lacking was the speed in which tabs are loading after a search result is selected. I've implemented logging methods and have come up with a few culprits as to the methods and data retrieval/manipulation that are causing the perceived slowness. The below is the biggest issue. The purpose of the method is to select all payments received towards a policy or any sub-policies, group them together by both due date and paid date, and then return a GroupedClass that will sum the amounts paid towards the whole policy. I'm wondering if there's any way this can be made more efficient. I've noticed that working with this old UniVerse data that things tend to break if they aren't cast .AsEnumerable() before being utilized:
var mc = new ModelContext();
var policy = mc.Polmasts.Find("N345348");
var payments =
mc.Paymnts.Where(p => p.POLICY.Contains(policy.ID)).GroupBy(p => new { p.PAYDUE_, p.PAYPD_ }).Select(
grp =>
new GroupedPayments
{
PAYPD_ = grp.Key.PAYPD_,
PAYDUE_ = grp.Key.PAYDUE_,
AMOUNT = grp.Sum(a => a.AMOUNT),
SUSP = grp.Sum(a => a.SUSP)
}).AsEnumerable().OrderByDescending(g => g.PAYDUE_).Take(3);

I've noticed that working with this old UniVerse data that things tend to break if they aren't cast .AsEnumerable() before being utilized
This goes to the root of your problems. By saying AsEnumerable, you are forcing all records in the sequence at that point to be brought down, before you sort and take the first three. Obviously, this will get slower and slower for more data.
Fixing this could be difficult, given what you say. In general, LINQ providers provide varying amounts of functionality in terms of what can be evaluated on the server and what can't. From your above comment, it sounds like LINQ-to-UniVerse doesn't do particularly well at doing things on the server.
For example, I would expect any good database LINQ provider to be able to do (using made-up definitions)
context.Products.Where(p => p.Type == 4).OrderBy(p => p.Name)
on the server; however, your code above is more taxing. Try splitting it into smaller pieces and establishing if it's possible to get the server to do the sort and Take(3). It might be that the best thing to do is to one query (which can be done on the server) to get the bottom three PAYDUE_ values, then another to actually get the amounts for those dates, pulling all relevant records down to the client.

Assuming you're running against SQL Server I would enable profiling, Linq has a habit of not producing the SQL you'd like it to. It's much more likely that the slowdown is from bad SQL than from in memory operations.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Linq performance when diffing two lists using inner Contains - c#

Related

Reusing a where array

How to query based on an previous object result? Entity Framework

Difference between datatable.Rows.Cast<DataRow> and datatable.AsEnumerable() in Linq C#

Efficiency of C# Find on 1000+ records

Code efficiency/optimization

Categories

Resources