I'm paging 150k rows in pages of 10,000. I mean I bring 10k rows of the database and iterate over them and then the next 10k and do the same until there aren't more rows. The problem is as I bring more rows I see the memory graph increasing in the performance tab of the task manager and in each iteration to bring the 10k rows the query execution last longer and longer until throws an OutOfMemoryException.
The query is a join of 6 tables. I load the results in a list using EF 4.
Al the end of each iteration I clear the list, set it to null and call GC.Collect() but this doesn't have any effect.
What can I make to free memory of the rows I already checked.
I faced a very similar issue when attempting return large datasets from a database.
By executing the query from a BackgroundWorker it removed the load from the UI thread, thus reducing the complexity of the task and eliminate this issue.
Before using this technique for my query I saw that application go from a few hundred MB's to top out at about 1.2GB before throwing the OutOfMemoryException. After implimenting the BackgroundWorkder the increase would only be about 10/15mb, then reduce back down once it had completed. I would also suggest the query executed slightly faster.
-Yes, I realise that this doesn't sound like it would work, but it actually did and I would recommend it.
Additionally if you were running on a 64bit OS (and targeting 64 bit architecture) you could raise the memory limit from 1.2GB to 4GB. This option can be found in the Project Properties under the Build tab...
Depending on the DB, you may be able to shift some of the work off to it by creating a view and querying that to gather your result set. (- Update)
I'm pretty sure the problem is caused by holding the same context instance to fetch all the pages. In this way, even though you no longer need previously fetched context still keep track of all of them.
You should create new DbContext instance before fetching new batch of records.
You could also try using AsNoTracking() method.
Related
I have a ssis data flow task which loads xml data into sql database - more than 60,000 xml files. My first few thousands of xml files gets loaded into the table faster. But as time progresses, the loading speed is reduced drastically.
first 10k files gets loaded in 10 minutes approx. next 10k takes 25 minutes, then slowly the performance degrades. By the time all my 60k+ files get loaded, it takes around 4 hours.
Is there any way to keep a check on the performance and load the files with the same speed as that for the initial files.
I have tried with bulk copy in c# too. But the issue exist even there as well. Is their any work around method to improve my performance ?
Parts of your code would make it easier for us to give you tips and ideas!
I believe that this issue is memory related. Are you reading all of the files into the memory before putting it in the sql database?
Check the Task Manager! If the memory usage keeps growing and growing, you have a potential issue with the memory usage.
I don't know how the files are stored or named, but if you could - why not work with like 1-5000 at a time, move them and take the next?
Try doing it with multiple DFT's instead of a single DFT. Limiting each one to around 5k/10k. This would result in lesser time frame hopefully.
Also, the difference in time might be due to the indexing on the table. Remove the indexing. Load the records. Reapply indexing once the loading is done. To query record sets on an indexed table is fast. But performing Insert on an indexed table and that too 60k records is a time consuming process.
1.execute SQL Task (Drop index Before Loading)
2.for loop ( Multiple Control flow for xml file load)
3.execute SQL Task (Recreate Index)
I need to load multiple sql statements from SQL Server into DataTables. Most of the statements return some 10.000 to 100.000 records and each take up to a few seconds to load.
My guess is that this is simply due to the amount of data that needs to be shoved around. The statements themselves don't take much time to process.
So I tried to use Parallel.For() to load the data in parallel, hoping that the overall processing time would decrease. I do get a 10% performance increase, but that is not enough. A reason might be that my machine is only a dual core, thus limiting the benefit here. The server on which the program will be deployed has 16 cores though.
My question is, how I could improve the performance more? Would the use of Asynchronous Data Service Queries be a better solution (BeginExecute, etc.) than PLINQ? Or maybe some other approach?
The SQl Server is running on the same machine. This is also the case on the deployment server.
EDIT:
I've run some tests with using a DataReader instead of a DataTable. This already decreased the load times by about 50%. Great! Still I am wondering whether parallel processing with BeginExecute would improve the overall load time if a multiprocessor machine is used. Does anybody have experience with this? Thanks for any help on this!
UPDATE:
I found that about half of the loading time was consumed by processing the sql statement. In SQL Server Management Studio the statements took only a fraction of the time, but somehow they take much longer through ADO.NET. So by using DataReaders instead of loading DataTables and adapting the sql statements I've come down to about 25% of the initial loading time. Loading the DataReaders in parallel threads with Parallel.For() does not make an improvement here. So for now I am happy with the result and leave it at that. Maybe when we update to .NET 4.5 I'll give the asnchronous DataReader loading a try.
My guess is that this is simply due to the amount of data that needs to be shoved around.
No, it is due to using a SLOW framework. I am pulling nearly a million rows into a dictionary in less than 5 seconds in one of my apps. DataTables are SLOW.
You have to change the nature of the problem. Let's be honest, who needs to view 10.000 to 100.000 records per request? I think no one.
You need to consider to handle paging and in your case, paging should be done on sql server. To make this clear, lets say you have stored procedure named "GetRecords". Modify this stored procedure to accept page parameter and return only data relevant for specific page (let's say 100 records only) and total page count. Inside app just show this 100 records (they will fly) and handle selected page index.
Hope this helps, best regards!
Do you often have to load these requests? If so, why not use a distributed cache?
I am working on a batch process which dumps ~800,000 records from a slow legacy database (1.4-2ms per record fetch time...it adds up) into MySQL which can perform a little faster. To optimize this, I have been loading all of the MySQL records into memory which puts usage to about 200MB. Then, I start dumping from the legacy database and updating the records.
Originally, when this would complete updating the records I would then call SaveContext which would then make my memory jump from ~500MB-800MB to 1.5GB. Very soon, I would get out of memory exceptions (the virtual machine this is running on has 2GB of RAM) and even if I were to give it more RAM, 1.5-2GB is still a little excessive and that would be just putting a band-aid on the problem. To remedy this, I started calling SaveContext every 10,000 records which helped things along a bit and since I was using delegates to fetch the data from the legacy database and update it in MySQL I didn't receive too horrible a hit in performance since after the 5 second or so wait while it was saving it would then run through the update in memory for the 3000 or so records that had backed up. However, the memory usage still keeps going up.
Here are my potential issues:
The data comes out of the legacy database in any order, so I can't chunk the updates and periodically release the ObjectContext.
If I don't grab all of the data out of MySQL beforehand and instead look it up during the update process by record, it is incredibly slow. I instead grab it all beforehand, cast it to a dictionary indexed by the primary key, and as I update the data I remove the records from the dictionary.
One possible solution I thought of is to somehow free the memory being used by entities that I know I will never touch again since they have already been updated (like clearing the cache, but only for a specific item), but I don't know if that is even possible with Entity Framework.
Does anyone have any thoughts?
You can call the Detach method on the context passing it the object you no longer need:
http://msdn.microsoft.com/en-us/library/system.data.objects.objectcontext.detach%28v=vs.90%29.aspx
I'm wondering if your best bet isn't another tool as previously suggested or just forgoing the use of Entity Framework. If you instead do the code without an ORM, you can:
Tune the SQL statements to improve performance
Easily control, and change, the scope of transactions to get the best performance.
You can batch the updates so that you aren't calling the server to accomplish multiple updates instead of them being performed one at a time.
I am reading a text file into a database through EF4. This file has over 600,000 rows in it and therefore speed is important.
If I call SaveChanges after creating each new entity object, then this process takes about 15 mins.
If I call SaveChanges after creating 1024 objects, then it is down to 4 mins.
1024 was an arbitrary number I picked, it has no reference point.
However, I wondered if there WAS an optimum number of objects to load into my Entity Set before calling SaveChanges?
And if so...how do you work it out (other than trial and error) ?
This is actually a really interesting issue, EF becomes much slower as the context gets very large. You can actually combat this and make drastic performance improvements by disabling AutoDetectChanges for the duration of your batch insert. In general however the more items you can include in a transaction in SQL the better.
Take a look at my post on EF performance here http://blog.staticvoid.co.nz/2012/03/entity-framework-comparative.html, and my post on how disabling AutoDetectChanges improves this here http://blog.staticvoid.co.nz/2012/05/entityframework-performance-and.html, these will also give you a good idea of how batch size affects performance.
Profile the application and see what's taking the time on the two you'ce chosen. That should geve you some good numbers to extrapolate from.
Personally I have to query why you are using EF for loading a text file - it's seems like gross overkill for something that should be easy to fire into a DB using BCP, or straight SQLCommands.
I have wrote a ETL process that perform ETL process. The ETL process needs to process more than 100+ million or rows overall for 2 years worth of records. To avoid out of memory issue, we chunk the data loading down to every 7 days. For each chunk process, it loads up all the required reference data, then the process open a sql connection and load the source data one by one, transform it, and write it to the data warehouse.
The drawback of processing the data by chunk is it is slow.
This process has been working fine for most of the tables, but there is one table I still run into out of memory. The process has loaded too many reference data. I would like to avoid chunk the data down to 3 days so that it has a decent performance.
Is there any other strategies that I can use to avoid OutOfMemoryException?
For example, local database, write the reference data to files, spawn another .NET process to hold more memory in Windows, use CLR stored procedure to do ETL...
Environment: Windows 7 32 bit OS. 4 GB of RAM. SQL Server Standard Edition.
The only one solution is to use a store procedure and let SQL Server handle the ETL. However, I am trying to avoid it because the program needs to support Oracle as well.
Other performance improvement I tried are added indexes to improve the loading queries. Create custom data access class to only load the necessary columns, instead of loading the entire row into memory.
Thanks
Without knowing how you exactly process the data it is hard to say, but a naive solution that can be implemented in any case is to use a 64-bit OS and compile your application as 64-bit. In 32-bit mode .NET heap will only grow to about 1.5GB which might be limiting you.
I know its old post but for people searching for better points to write data operations with programming languages.
I am not sure if you have considered to study how ETL tools perform their data loading operations and replicate similar strategy in your code.
One such suggestion, parallel data pipes. Here each pipe will perform the ETL on a single chunks based on partitioning of the data from the source. For example, you could consider spawning processes for different weeks data in parallel. This still will not solve your memory issues within a single process. Though can be used in case you reach a limit with memory allocation within heap within single process. This is also useful to read the data in parallel with random access. Though will require a master process to coordinate and complete the process as a single ETL operation.
I assume you perform in your transformation a lot of lookup operation before finally writing your data to database. Assuming the master transaction table is huge and reference data is small. You need to focus on data structure operation and alogirthm. There are few tips below for the same. Refer to the characteristics of your data before choosing what suites best when writing the algorithm.
Generally, Lookup data (reference data) is stored in cache. Choose a simple data structure that is efficient for read and search operation (say Array list). If possible sort this array by the key you will join to be efficient in your search algorithm.
There is different strategy for lookup operations in your transformation tasks. In database world you can call it as join operation.
Merge Join algorithm :
Ideal when the source is already sorted on join attribute key. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time. For sample code, https://en.wikipedia.org/wiki/Sort-merge_join
Nested Join:
works like a nested loop, where each value of the index of the outer loop is taken as a limit (or starting point or whatever applicable) for the index of the inner loop, and corresponding actions are performed on the statement(s) following the inner loop. So basically, if the outer loop executes R times and for each such execution the inner loop executes S times, then the total cost or time complexity of the nested loop is O(RS).
Nested-loop joins provide efficient access when tables are indexed on join columns. Furthermore,in many small transactions, such as those affecting only a small set of rows, index nested loopsjoins are far superior to both sort -merge joins and hash joins
I am only describing two methods that can be thought in your lookup operation. The main idea to remember in ETL is all about lookup and retrieve the tuples (as set) for further operation. Search will be based on key and resultant transaction keys will extract all the records (projection). Take this and load the rows from the file in one reading operation. This is more of suggestion in case you don't need all the records for transformation operations.
Another very costly operation is writing back to the database. There might be tendency to process the extraction, transformation and loading one row at a time. Think of operations that can be vectorized where in you can perform it together with a data structure operation in bulk. For example, lambada operation on a multi dimensional vector rather than looping every row one at a time and performing transformation and operations across all columns for a given row. We then can write this vector into file or database. This will avoid memory pressure.
This was a very old question, and it is more a design question and I am sure there are many solutions to it, unless I get into more specific details.
Ultimately, I wrote SQL Stored Procedure using Merge to handle the ETL process for the data type that took too long to process thought C# application. In addition, the business requirement was changed such that we dropped Oracle support, and only support 64-bit server, which reduced maintenance cost and avoid ETL out of memory issue.
In addition, we added many indexes whenever we see an opportunity to improve the querying performance.
Instead of chunking by a day range, the ETL process also chunks the data by count (5000) and commit on every transaction, this reduced the transaction log file size and if the ETL fails , the process only needs to rollback a subset of the data.
Lastly, we implemented caches (key,value) so that frequently referenced data within the ETL date range are loaded in memory to reduce database querying.