System.OutOfMemoryException when trying to use ToDictionary on a LINQ Query

System.OutOfMemoryException when trying to use ToDictionary on a LINQ Query - c#

I am trying to populate a dictionary object from a sales table that has 6 million records. This is because the original linq query times out. I am joining 3 or 4 tables to the sales table and it is just too much and times out. I thought I could get the sales table in memory to prevent the time out from sql server.
My code is:
var sales = dc.Sales
.Where(c => c.Active == true)
.Select(s => s)
.ToDictionary(s => s.Id, s => s);
Does anyone know how to get my dictionary object to populate without raising an out of memory exception?

This question hints at serious misunderstanding how database querying works. Usually, when query is timing out, you have to limit the data you are pulling out. If you really have to pull so much data (which I think is error in business requirements), then you should pull them in batches.
we normally use linq to sql
Linq-to-SQL has been depreciated for years now. It is practically incompetent in creating usable SQL queries. This is probably why are you getting the timeouts. I would consider LINQ-to-SQL not only as bad choice, but outright risk equivalent to bomb in your software that is ready to explode. You should either move to Entity Framework with LINQ-to-Entities or use hand-crafted SQL. This hints at second problem:
we load data into dictionaries to speed up the application and just join the dictionaries together and extract the data
I never heard anyone who would ever use this approach or consider it acceptable to use. SQL does this easily and much faster. If your team really thinks this is the way to do things, then serious study of SQL is required.

Thanks everyone for the feedback. I found out what was causing the out of memory exception. I decided to not use a dictionary because it was just not going to work. I went to sql server and replicated my linq query and ran it and only got 30,000 records back. So I knew that the linq query was not causing the out of memory. My original query was just a simple lookup to the database with a few .Where() clauses. Turns out that I was doing a .Distinct() on the collection that was populated with the 30,000 records and this was causing the out of memory exception. I removed the .Distinct() call on the linq query and it fixed it. The out of memory exception makes sense because it was trying to handle 30,000 records in the collection and distinct them and it was just to much to compare.

I Would suggest using ctx.Configuration.LazyLoadingEnabled = true; to enable the lazy loading and then when the data is required then required entities will be loaded on its own. The loaded entities can then remain in the context and reused if you do not close the context and open again.
I have tried the program with 100000 entities and using lazy loading allow the object required to be opened instantly.

Related

In Entity Framework Core, how can I query data on joined tables with a condition at DB level (not local LINQ)?

As part of a small .NET Core 3 project, I'm trying to use the data model based in Entity Framework, but I'm having some troubles related with queries on joined tables.
When looking for data matching a condition in a single table, the model is easy to understand
List<Element> listOfElements = context.Elements.Where(predicate).ToList();
However, when this query requires joined tables, I'm not sure how to do it efficiently. After some investigation, it seems that the include (and theninclude) methods are the way to go, but I have the impression that the Where clause after the include is not executed at DB level but after all the data has been retrieved. This might work with small datasets, but I don't think it's a good idea for a production system with millions of rows.
List<Element> listOfElements = context.Elements.Include(x => x.SubElement).
Where(predicate).ToList();
I've seen some examples using EF+ library, but I'm looking for a solution using the nominal EF Core. Is there any clean/elegant way to do it?
Thank you.

There are a few scenarios when the data from DB is populating:
Deferred query execution: this is when you try to access your query results, for example, in the foreach statement.
Immediate Query Execution: this when you call ToList() (or conversions to other collections, like ToArray()).
I think the answer's to your question:
... but I have the impression that the Where clause after the include is not executed at DB level but after all the data has been retrieved.
is that your assumptions are wrong because you are calling ToList() at the end, not before the Where method.
For more information please also check here.
I've another suggestion also: to be sure about what is exactly executing at the DB level run SQL Server Profiler when executing your query.
Hope this will help ))

Best way to improve performance in WPF application

I'm currently working on a WPF application which was build using entity framework to access data (SQL Server database) (database first).
In the past, the database was on an internal server and I did not notice any problem regarding the performance of the application even though the database is very badly implemented (only tables, no views, no indexes or stored procedure). I'm the one who created it but it was my first job and I was not very good with databases so I felt like entity framework was the best approach to focus mainly on code.
However, the database is now on another server which is waaay slower. As you guessed it, the application has now big performance issues (more than 10 seconds to load a dozen rows, same to insert new rows,...).
Should I stay with entity framework but try to improve performance by altering the database adding views and stored procedure ?
Should I get rid off entity framework and use only "basic" code (and improve the database at the same time) ?
Is there a simple ORM I could use instead of EF ?
Time is not an issue here, I can use all the time I want to improve the application but I can't seem to make a decision about the best way to make my application evolved.
The database is quite simple (around 10 tables), the only thing that could complicates thing is that I store files in there. So I'm not sure I can really use whatever I want. And I don't know if it's important but I need to display quite a few calculated fields. Any advice ?
Feel free to ask any relevant questions.

For performance profiling, the first place I recommend looking is an SQL profiler. This can capture the exact SQL statements that EF is running, and help identify possible performance culprits. I cover a few of these here. The Schema issues are probably the most relevant place to start. The title targets MVC, but most of the items relate to WPF and any application.
A good, simple profiler that I use for SQL Server is ExpressProfiler. (https://github.com/OleksiiKovalov/expressprofiler)
With the move to a new server, and it now sending the data over the wire rather than pulling from a local database, the performance issues you're noticing will most likely be falling under the category of "loading too much, too often". Now you won't only be waiting for the database to load the data, but also for it to package it up and send it over the wire. Also, does the new database represent the same data volume and serve only a single client, or now serving multiple clients? Other catches for developers is "works on my machine" where local testing databases are smaller and not dealing with concurrent queries from the server. (where locks and such can impact performance)
From here, run a copy of the application with an isolated database server (no other clients hitting it to reduce "noise") with the profiler running against it. The things to look out for:
Lazy Loading - This is cases where you have queries to load data, but then see lots (dozens to hundreds) of additional queries being spun off. Your code may say "run this query and populate this data" which you expect should be 1 SQL query, but by touching lazy-loaded properties, this can spin off a great many other queries.
The solution to lazy loading: If you need the extra data, eager load it with .Include(). If you only need some of the data, look into using .Select() to select view models / DTO of the data you need rather than relying on complete entities. This will eliminate lazy load scenarios, but may require some significant changes to your code to work with view models/dtos. Tools like Automapper can help greatly here. Read up on .ProjectTo() to see how Automapper can work with IQueryable to eliminate lazy load hits.
Reading too much - Loading entities can be expensive, especially if you don't need all of that data. Culprits for performance include excessive use of .ToList() which will materialize entire entity sets where a subset of data is needed, or a simple exists check or count would suffice. For example, I've seen code that does stuff like this:
var data = context.MyObjects.SingleOrDefault(x => x.IsActive && x.Id = someId);
return (data != null);
This should be:
var isData = context.MyObjects.Where(x => x.IsActive && x.Id = someId).Any();
return isData;
The difference between the two is that in the first example, EF will effectively do a SELECT * operation, so in the case where data is present it will return back all columns into an entity, only to later check if the entity was present. The second statement will run a faster query to simply return back whether a row exists or not.
var myDtos = context.MoyObjects.Where(x => x.IsActive && x.ParentId == parentId)
.ToList()
.Select( x => new ObjectDto
{
Id = x.Id,
Name = x.FirstName + " " + x.LastName,
Balance = calculateBalance(x.OrderItems.ToList()),
Children = x.Children.ToList()
.Select( c => new ChildDto
{
Id = c.Id,
Name = c.Name
}).ToList()
}).ToList();
Statements like this can go on and get rather complex, but the real problems is the .ToList() before the .Select(). Often these creep in because devs try to do something that EF doesn't understand, like call a method. (i.e. calculateBalance()) and it "works" by first calling .ToList(). The problem here is that you are materializing the entire entity at that point and switching to Linq2Object. This means that any "touches" on related data, such as .Children will now trigger lazy loads, and again further .ToList() calls can saturate more data to memory which might otherwise be reduced in a query. The culprit to look out for is .ToList() calls and to try removing them. Select simpler values before calling .ToList() and then feed that data into view models where the view models can calculate resulting data.
The worst culprit like this I've seen was due to a developer wanting to use a function in a Where clause:
var data = context.MyObjects.ToList().Where(x => calculateBalance(x) > 0).ToList();
That first ToList() statement will attempt to saturate the whole table to entities in memory. A big performance impact beyond just the time/memory/bandwidth needed to load all of this data is simply the # of locks the database must make to reliably read/write data. The fewer rows you "touch" and the shorter you touch them, the nicer your queries will play with concurrent operations from multiple clients. These problems magnify greatly as systems transition to being used by more users.
Provided you've eliminated extra lazy loads and unnecessary queries, the next thing to look at is query performance. For operations that seem slow, copy the SQL statement out of the profiler and run that in the database while reviewing the execution plan. This can provide hints about indexes you can add to speed up queries. Again, using .Select() can greatly increase query performance by using indexes more efficiently and reducing the amount of data the server needs to pull back.
For file storage: Are these stored as columns in a relevant table, or in a separate table that is linked to the relevant record? What I mean by this, is if you have an Invoice record, and also have a copy of an invoice file saved in the database, is it:
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFileData
or
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFile
InvoiceId
InvoiceFileData
It is a better structure to keep large, seldom used data in separate tables rather than combined with commonly used data. This keeps queries to load entities small and fast, where that expensive data can be pulled up on-demand when needed.
If you are using GUIDs for keys (as opposed to ints/longs) are you leveraging newsequentialid()? (assuming SQL Server) Keys set to use newid() or in code, Guid.New() will lead to index fragmentation and poor performance. If you populate the IDs via database defaults, switch them over to use newsequentialid() to help reduce the fragmentation. If you populate IDs via code, have a look at writing a Guid generator that mimics newsequentialid() (SQL Server) or pattern suited to your database. SQL Server vs. Oracle store/index GUID values differently so having the "static-like" part of the UUID bytes in the higher order vs. lower order bytes of the data will aid indexing performance. Also consider index maintenance and other database maintenance jobs to help keep the database server running efficiently.
When it comes to index tuning, database server reports are your friends. After you've eliminated most, or at least some serious performance offenders from your code, the next thing is to look at real-world use of your system. The best thing here to learn where to target your code/index investigations are the most used and problem queries that the database server identifies. Where these are EF queries, you can usually reverse-engineer based on the tables being hit which EF query is responsible. Grab these queries and feed them through the execution plan to see if there is an index that might help matters. Indexing is something that developers either forget, or get prematurely concerned about. Too many indexes can be just as bad as too few. I find it's best to monitor real-world usage before deciding on what indexes to add.
This should hopefully give you a start on things to look for and kick the speed of that system up a notch. :)

First you need to run a performance profiler and find put what is the bottle neck here, it can be database, entity framework configuration, entity framework queries and so on
In my experience, entity framework is a good option to this kind of applications, but you need understand how it works.
Also, What entity framework are you using? the lastest version is 6.2 and has some performance improvements that olders does not have, so if you are using a old one i suggest that update it

Based on the comments I am going to hazard a guess that it is mostly a bandwidth issue.
You had an application that was working fine when it was co-located, perhaps a single switch, gigabit ethernet and 200m of cabling.
Now that application is trying to send or retrieve data to/from a remote server, probably over the public internet through an unknown number of internal proxies in contention with who knows what other traffic, and it doesn't perform as well.
You also mention that you store files in the database, and your schema has fields like Attachment.data and Doc.file_content. This suggests that you could be trying to transmit large quantities (perhaps megabytes) of data for a simple query and that is where you are falling down.
Some general pointers:
Add indexes for anywhere you are joining tables or values you
commonly query on.
Be aware of the difference between Lazy & Eager
loading in Entity Framework. There is no right or wrong answer,
but you should be know what you approach you are using and why.
Split any file content
into its own table, with the same primary key as the main table or
play with different EF classes to make sure you only retrieve files
when you need to use them.

.Include() vs .Load() performance in EntityFramework

When querying a large table where you need to access the navigation properties later on in code (I explicitly don't want to use lazy loading) what will perform better .Include() or .Load()? Or why use the one over the other?
In this example the included tables all only have about 10 entries and employees has about 200 entries, and it can happen that most of those will be loaded anyway with include because they match the where clause.
Context.Measurements.Include(m => m.Product)
.Include(m => m.ProductVersion)
.Include(m => m.Line)
.Include(m => m.MeasureEmployee)
.Include(m => m.MeasurementType)
.Where(m => m.MeasurementTime >= DateTime.Now.AddDays(-1))
.ToList();
or
Context.Products.Load();
Context.ProductVersions.Load();
Context.Lines.Load();
Context.Employees.Load();
Context.MeasurementType.Load();
Context.Measurements.Where(m => m.MeasurementTime >= DateTime.Now.AddDays(-1))
.ToList();

It depends, try both
When using Include(), you get the benefit of loading all of your data in a single call to the underlying data store. If this is a remote SQL Server, for example, that can be a major performance boost.
The downside is that Include() queries tend to get really complicated, especially if you have any filters (Where() calls, for example) or try to do any grouping. EF will generate very heavily nested queries using sub-SELECT and APPLY statements to get the data you want. It is also much less efficient -- you get back a single row of data with every possible child-object column in it, so data for your top level objects will be repeated a lot of times. (For example, a single parent object with 10 children will product 10 rows, each with the same data for the parent-object's columns.) I've had single EF queries get so complex they caused deadlocks when running at the same time as EF update logic.
The Load() method is much simpler. Each query is a single, easy, straightforward SELECT statement against a single table. These are much easier in every possible way, except you have to do many of them (possibly many times more). If you have nested collections of collections, you may even need to loop through your top level objects and Load their sub-objects. It can get out of hand.
Quick rule-of-thumb
Try to avoid having any more than three Include calls in a single query. I find that EF's queries get too ugly to recognize beyond that; it also matches my rule-of-thumb for SQL Server queries, that up to four JOIN statements in a single query works very well, but after that it's time to consider refactoring.
However, all of that is only a starting point.
It depends on your schema, your environment, your data, and many other factors.
In the end, you will just need to try it out each way.
Pick a reasonable "default" pattern to use, see if it's good enough, and if not, optimize to taste.

Include() will be written to SQL as JOIN: one database roundtrip.
Each Load()-instruction is "explicitly loading" the requested entities, so one database roundtrip per call.
Thus Include() will most probably be the more sensible choice in this case, but it depends on the database layout, how often this code is called and how long your DbContext lives. Why don't you try both ways and profile the queries and compare the timings?
See Loading Related Entities.

I agree with #MichaelEdenfield in his answer but I did want to comment on the nested collections scenario. You can get around having to do nested loops (and the many resulting calls to the database) by turning the query inside out.
Rather than loop down through a Customer's Orders collection and then performing another nested loop through the Order's OrderItems collection say, you can query the OrderItems directly with a filter such as the following.
context.OrderItems.Where(x => x.Order.CustomerId == customerId);
You will get the same resulting data as the Loads within nested loops but with just a single call to the database.
Also, there is a special case that should be considered with Includes. If the relationship between the parent and the child is one to one then the problem with the parent data being returned multiple times would not be an issue.
I am not sure what the effect would be if the majority case was where no child existed - lots of nulls? Sparse children in a one to one relationship might be better suited to the direct query technique that I outlined above.

Include is an example of eager loading, where as you not only load the entities you are querying for, but also all related entities.
Load is an manual override of the EnableLazyLoading. If this one is set to false. You can still lazily load the entity you asked for with .Load()

It's always hard to decide whether to go with Eager, Explicit or even Lazy Loading.
What I would recommend anyway is always to perform some profiling. That's the only way to be sure your request will be performant or not.
There're a lot of tools that will help you out. Have a look at this article from Julie Lerman where she lists several different ways to do profiling. One simple solution is to start profiling in your SQL Server Management Studio.
Do not hesitate to talk with a DBA (if you have on near you) that will help you to understand the execution plan.
You could also have a look a this presentation where I wrote a section about loading data and performance.

One more thing to add to this thread. It depends on what server you use. If you are working on sql server it's ok to use eager loading but for sqlite you will have to use .Load() to avoid crossloading exception cause sqlite can not deal with some include statements that go deeper than one dependency level

Updated answer: As of EF Core 5.0 you can use AsSplitQuery().
This is particularly useful and I personally use it all the time when I have many joins
which will result in a possible cartesian explosion, or will just take more time to complete.
As the name implies, EF will execute separate queries for each Entity instead of using joins.
So, where you would use Explicit loading, you can now use Eager loading with split queries to achieve the same result, and it's definitely more readable imo.
See https://learn.microsoft.com/en-us/ef/core/querying/single-split-queries

Entity Framework - Excessive data loads with Views

We are having a problem navigating between entities one of which is based on a view.
The problem is when we go
TableEntity.ViewEntity.Where(x => x.Id == Id).FirstOrDefault())
In the background it is loading all records in the view which is not what we want or expect.
However when we go
_objectContext.TableEntityView
.Where(x => x.TableObjectId == TableObjectId && x.Id == Id)
Then it just loads up the one row which is what we are expecting
In short using the navigation properties causes a massive data load – it’s like the query is being realised early.
We are using EF 4 with SQL 2005 database. The views are used to provide aggregated information which EF couldn’t easily do without big data loads (ironically). We have manually constructed 1: Many associations between the views.
Why then do we get the large data load in the first instance but not the second?
Many thanks for all/any help

That's how navigation collections work in EF: accessing the collection loads all entities, and any linq queries you run thereafter simply query against the objects in memory. I don't think there's anything you can do about it short of a custom query like you've already done.
FWIW I'm told NHibernate supports more fine-grained navigation loads, but that feature has yet to make its way into Entity Framework.
EDIT
This answer from Ladislav Mrnka shows a possible solution to your problem from the CTP days. Not sure if anything has changed since then. It uses the DbContext, so you still won't be able to just plow through the navigation property, but it's probably as close as you're going to get.
int count = context.Entry(myAccount)
.Collection(a => a.Orders).Query().Count();`
or for your case, I'm guessing it would be
TableEntityView obj = context.Entry(TableEntity)
.Collection(a => a.ViewEntity)
.Query().FirstOrDefault(x => x.Id == Id);

I've had some issues with the way that EntityFramework generates SQL so first of all I will suggest that you use LinqPad and one or more of the following: EntityFramework profiler (paid for software), SQL Profiler (assuming you are using SQL Server) and/or EFTracingProvider
I've had issues where the same table was inner joined several times on some queries so having those tools generally helps find out what's causing the issue.
The things that I've tried that have often made some queries run faster:
Writing a full Linq Query rather than using Lambda expressions: they are often easier to read and they look a lot more like sql so it's easier to see the relationship between your code and the generated sql
and
EntitySet.Include(x=>x.Property)
This tells Linq2Entities to include the property in the query

A Better DataTable

I have an application that uses DataTables to perform grouping, filtering and aggregation of data. I want to replace datatables with my own data structures so we don't have any unnecessary overhead that we get from using datatables. So my question is if Linq can be used to perform the grouping, filtering and aggregation of my data and if it can is the performance comparable to datatables or should I just hunker down and write my own algorithms to do it?
Thanks
Dan R.

Unless you go for simple classes (POCO etc), your own implementation is likely to have nearly as much overhead as DataTable. Personally, I'd look more at using tools like LINQ-to-SQL, Entity Framework, etc. Then you can use either LINQ-to-Objects against local data, or the provider-specific implementation for complex database queries without pulling all the data to the client.
LINQ-to-Objects can do all the things you mention, but it involves having all the data in memory. If you have non-trivial data, a database is recommended. SQL Server Express Edition would be a good starting point if you look at LINQ-to-SQL or Entity Framework.
Edited re comment:
Regular TSQL commands are fine and dandy, but you ask about the difference... the biggest being that LINQ-to-SQL will provide the entire DAL for you, which is a huge time saver, as well as making it possible to get a lot more compile-time safety. But is also allows you to use the same approach to look at your local objects and your database - for example, the following is valid C# 3.0 (except for [someDataSource], see below):
var qry = from row in [someDataSource]
group row by row.Category into grp
select new {Category = grp.Key, Count = grp.Count(),
TotalValue = grp.Sum(x=>x.Value) };
foreach(var x in qry) {
Console.WriteLine("{0}, {1}, {2}", x.Category, x.Count, x.TotalValue);
}
If [someDataSource] is local data, such as a List<T>, this will execute locally; but if this is from your LINQ-to-SQL data-context, it can build the appropriate TSQL at the database server. This makes it possible to use a single query mechanism in your code (within the bounds of LOLA, of course).

You'd be better off letting your database handle grouping, filtering and aggregation. DataTables are actually relatively good at this sort of thing (their bad reputation seems to come primarily from inappropriate usage), but not as good as an actual database. Moreover, without a lot of work on your part, I would put my money on the DataTable's having better performance than your homegrown data structure.

Why not use a local database like Sqlserver CE or firebird embedded? (or even ms access! :)). Store the data in the local database, do the processing using simple sql queries and pull the data back. Much simpler and likely less overhead, plus you don't have to write all the logic for grouping/aggregates etc. as the database systems already have that logic built in, debugged and working.

Yes, you can use LINQ to do all those things using your custom objects.
And I've noticed a lot of people suggest that you do this type of stuff in the database... but you never indicated where the database was coming from.
If the data is coming from the database then at the very least the filtering should probably happen there, unless you are doing something specialized (e.g. working from a cached set of data). And even then, if you are working with a significant amount of cached data, you might do well to put that data into an embedded database like SQLite, as someone else has already mentioned.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.