I'm working with SQL Server and Entity Framework. I need to fetch a large number of records and process them.
I'm afraid of "out-of-memory" or other performance problems, so I want to implement the fetching and processing in batches. The problem is that between each fetch, the underlying data in the database might change, resulting in omitting elements (a record is removed, then next offset is applied to the select operation but the data in the database has 'moved to the left' and the first item of the next page is omitted).
This is important because the purpose of the select is not just presentation but further processing.
I thought about setting the transaction's isolation level to repeatable read to avoid changes coming from other users, but the processing takes a lot of time and that would lock the changes to the table for the whole time.
I also thought about paginating using the key rather than the offset (SQL where/limit). This way no data that was initially in the db could be omitted, only the changes that happen after the pagination has started could be omitted or not. However the user is informed about the number of items to be processed in the beginning of the process. And this number might not be correct, but we only learn about it in the end of the long process.
What would your advice be? Am I missing something?
Related
I am working with aws sqs queue. The queue may having massive messages i.e if i do not process there will be more than a million mesasge per hour.
I am processing all the messages and putting them into a mysql table. Innodb with 22 columns. Insert on Duplicate Key Update. I have a primary key and unique key.
I am working with C# where i ran 80 threads in order to pull messages from sqs.
I applied transaction in c# run the query like "insert on duplicate key update"
at the same time i am using lock in c# so only single thread can update the table. if id do not use C# lock then an exception is thrown from mysql dead lock occured.
Problem is here i can see there are a lot of threads are waiting before C# lock and this time gradually increasing. Can any body suggest me what is the best way to do this..
Note, i have 8GB RAM intell xeon 2.53 with 1GE internet speed. please suggest me in this regard.
If I were to do it, the C# program would primarily be creating the CSV file to empty your SQS queue. Or at least a significant chunk of it. The file would then be used for bulk insert into an empty non-indexed in anyway worktable. I would steer for non-temporary but whatever. I see no reason to add temporary to the mix when this is recurring, and when done the worktable is truncated anyway.
The bulk insert would be achieved through LOAD DATA FROM INFILE fired off from the c# program. Alternatively, a value in a new row in some other table could be written with an incrementer saying file2 is ready, file3 is ready, and the LOAD happens in an event triggered, say every n minutes. An event that was put together with mysql Create Event. Six of one, half dozen of another.
But the benefits of a sentinal, a mutex, might be of value, as this whole thing happens in batches. And the next batch(es) to be processed need to be suspended while this occurs. Let's call this concept The Blocker, and the one being worked on is row N.
Ok, now your data is in the worktable. And it is safe from being stomped on until processed. Let's say you have 250k rows. Other batches shortly to follow. If you have special processing to have happen, you may wish to create indexes. But at this moment there are none.
You perform a normal insert on duplicate key update (IODKU) to the REAL table using this worktable. It would, in that IODKU follow a normal insert into select pattern, where the select part comes from the worktable.
At the end of that statement, the worktable is truncated, any indexes dropped, row N has its status set to complete, and The Blocker is free to work on row N+1 when it appears.
The indexes are dropped to facilitate the next round of bulk insert, where maintaining indexes is of least importance. And indexes on the worktable may very well be overhead baggage unnecessary during IODKU.
In this manner, you get the best of both worlds
LOAD DATA FROM INFILE
IODKU
And the focus is taken off of multi-threading, a good thing to take one's focus off of.
Here is a nice article on performance and strategies titled Testing the Fastest Way to Import a Table into MySQL. Don't let the mysql version of the title or inside the article scare you away. Jumping to the bottom and picking up some conclusions:
The fastest way you can import a table into MySQL without using raw
files is the LOAD DATA syntax. Use parallelization for InnoDB for
better results, and remember to tune basic parameters like your
transaction log size and buffer pool. Careful programming and
importing can make a >2-hour problem became a 2-minute process. You
can disable temporarily some security features for extra performance
I would separate the C# routine entirely from the actual LOAD DATA and IODKU update effort and leave that to the event mentioned with Create Event for several reasons. Mainly better design. As such the C# program is only dealing with SQS and writing out files with incrementing file #'s.
I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?
The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.
Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform
I have a c# process that works against a queue using TPL to process in parallel. After handling each record, I want to establish a physical record of each record ID processed so that if the process fails or is interrupted, I can be sure to not process that record a second time. It is imperative that records only be processed once.
I have tried serializing record IDs to a simple text file AND to a Sqlite table. In both cases, the time to save these small record IDs (Guid's) takes 50% of the total process time for the record itself. I've even tried using an open Sqlite connection and a parameritized insert query to do inserts so I'm not opening/closing the database file and it's no better.
My question is, how can I maintain a list of Guid's (maybe 1000-2000 of them) in a persistent way such that if my process dies, I'll have them saved so I can pick up where I left off? I'm willing to try anything as long as it's fast and will still be there if the server reboots or the process is killed.
Any ideas?
Anything that is persistent enough to survive a reboot will have to be written to disk sooner or later (preferably sooner).
This means that you have pretty much enumerated your choices already.
The next questions that you have to ask is what is the expense of verifying whether or not the record has already been processed AND what is the danger level of an end-user inadvertently removing the tracking mechanism.
If you just write the information to a text file, it should be a fast write, but a slow read (unless you cache the information) and the likelihood that a user will remove the file is fairly high.
If you use a database of any kind, the write should still be reasonably fast and the retrieval should be faster than that of the text file and the likelihood that a user will remove the storage mechanism is much lower.
Based on these factors, I would strongly recommend a database of some sort. I would model (or research) a couple of different databases for performance to see which provides the best bang for the buck, which should include cost of implementation, deployment, and maintenance.
I have various large data modification operations in a project built on c# and Fluent NHibernate.
The DB is sqlite (on disk rather than in memory as I'm interested in performance)
I wanted to check performance of these so I created some tests to feed in large amounts of data and let the processes do their thing. The results from 2 of these processes have got me pretty confused.
The first is a fairly simple case of taking data supplied in an XML file doing some light processing and importing it. The XML contains around 172,000 rows and the process takes a total of around 60 seconds to run with the actual inserts taking around 40 seconds.
In the next process, I do some processing on the same set of data. So I have a DB with approx 172,000 rows in one table. The process then works through this data, doing some heavier processing and generating a whole bunch of DB updates (inserts and updates to the same table).
In total, this results in around 50,000 rows inserted and 80,000 updated.
In this case, the processing takes around 30 seconds, which is fine, but saving the changes to the DB takes over 30 mins! and it crashes before it finishes with an sqlite 'disk or i/o error'
So the question is: why are the inserts/updates in the second process so much slower? They are working on the same table of the same database with the same connection. In both cases, IStatelessSession is used and ado.batch_size is set to 1000.
In both cases, the code looks that does the update like this:
BulkDataInsert((IStatelessSession session) =>
{
foreach (Transaction t in transToInsert) { session.Insert(t); }
foreach (Transaction t in transToUpdate) { session.Update(t); }
});
(although the first process has no 'transToUpdate' line as it's only inserts - Removing the update line and just doing the inserts still takes almost 10 minutes.)
The transTo* variables are List with the objects to be updated/inserted.
BulkDataInsert creates the session and handles the DB transaction.
I didn't understand your second process. However, here are some things to consider:
Are there any clustered or non-clustered indexes on the table?
How many disk drives do you have?
How many threads are writing to the DB in the second test?
It seems that you are experiencing IO bottlenecks that can be resolved by having more disks, more threads, indexes, etc.
So, assuming a lot of things, here is what I "think" is happening:
In the first test your table probably has no indexes, and since you are just inserting data, it is a sequential insert in a single thread which can be pretty fast - especially if you are writing to one disk.
Now, in the second test, you are reading data and then updating data. Your SQL instance has to find the record that it needs to update. If you do not have any indexes this "find" action is basically a table scan, which will happen for each one of those 80,000 row updates. This will make your application really really slow.
The simplest thing you could probably do is add a clustered index on the table for a unique key, and the best option is to use the columns that you are using in the where clause to "update" those rows.
Hope this helps.
DISCLAIMER: I made quite a few assumptions
The problem was due to my test setup.
As is pretty common with nhibernate based projects, I had been using in-memory sqlite databases for unit testing. These work great but one downside is that if you close the session, it destroys the database.
Consequently, my unit of work implementation contains a 'PreserveSession' property to keep the session alive and just create new transactions when needed.
My new performance tests are using on-disk databases but they still use the common code for setting up test databases and so have PreserveSession set to true.
It seems that having several sessions all left open (even though they're not doing anything) starts to cause problems after a while including the performance drop off and the disk IO error.
I re-ran the second test with PreserveSession set to false and immediately I'm down from over 30 minutes to under 2 minutes. Which is more where I'd expect it to be.
I have a couple of tables in a SQL Server database, most of them are updated only rarely, i.e. they are mostly-read.
In order not to have to go to the database every time I read an entry, what we have done is, on startup we load all tables completely into memory of our .net process (the data is small enough), and at intervals of 10 seconds we reread the whole thing and replace our in-memory representation of the data.
This in-memory representation of the data is then used for reading, and we don't have to go synchronously to the DB, unless we want to update the data.
Suffice to say this currently hand-coded process (for each table we have to write code that SELECT * and handles the received rows) is tedious, and bound to attract bugs during the maintenance cycle. In addition, it is obviously inefficient to always read the whole DB and reprocess all entries, even though nothing has changed.
I can think of a couple of meaningful optimizations to the above procedure, but my point is, I don't want to have to do manually what looks like a feature that could come out of the box: The replication of a set of tables into memory of a process to speed up read access.
I guess if I went ORM and used nhibernate etc., I could get something like that in addition to the ORM layer (by means of caching and eager loading).
Now if I don't want the ORM part, just the replication of the lower relational level, is there anything that I can just switch on?
You can look at the metadata and make something generic which can load any table into some kind of structure you like or simply use an ADO.NET DataSet.
Also, instead of reloading your data on a timer even when it hasn't changed, you can subscribe to changes using SqlDependency