AWS SQS with mysql and C# with hundreds of thousands messages

AWS SQS with mysql and C# with hundreds of thousands messages - c#

I am working with aws sqs queue. The queue may having massive messages i.e if i do not process there will be more than a million mesasge per hour.
I am processing all the messages and putting them into a mysql table. Innodb with 22 columns. Insert on Duplicate Key Update. I have a primary key and unique key.
I am working with C# where i ran 80 threads in order to pull messages from sqs.
I applied transaction in c# run the query like "insert on duplicate key update"
at the same time i am using lock in c# so only single thread can update the table. if id do not use C# lock then an exception is thrown from mysql dead lock occured.
Problem is here i can see there are a lot of threads are waiting before C# lock and this time gradually increasing. Can any body suggest me what is the best way to do this..
Note, i have 8GB RAM intell xeon 2.53 with 1GE internet speed. please suggest me in this regard.

If I were to do it, the C# program would primarily be creating the CSV file to empty your SQS queue. Or at least a significant chunk of it. The file would then be used for bulk insert into an empty non-indexed in anyway worktable. I would steer for non-temporary but whatever. I see no reason to add temporary to the mix when this is recurring, and when done the worktable is truncated anyway.
The bulk insert would be achieved through LOAD DATA FROM INFILE fired off from the c# program. Alternatively, a value in a new row in some other table could be written with an incrementer saying file2 is ready, file3 is ready, and the LOAD happens in an event triggered, say every n minutes. An event that was put together with mysql Create Event. Six of one, half dozen of another.
But the benefits of a sentinal, a mutex, might be of value, as this whole thing happens in batches. And the next batch(es) to be processed need to be suspended while this occurs. Let's call this concept The Blocker, and the one being worked on is row N.
Ok, now your data is in the worktable. And it is safe from being stomped on until processed. Let's say you have 250k rows. Other batches shortly to follow. If you have special processing to have happen, you may wish to create indexes. But at this moment there are none.
You perform a normal insert on duplicate key update (IODKU) to the REAL table using this worktable. It would, in that IODKU follow a normal insert into select pattern, where the select part comes from the worktable.
At the end of that statement, the worktable is truncated, any indexes dropped, row N has its status set to complete, and The Blocker is free to work on row N+1 when it appears.
The indexes are dropped to facilitate the next round of bulk insert, where maintaining indexes is of least importance. And indexes on the worktable may very well be overhead baggage unnecessary during IODKU.
In this manner, you get the best of both worlds
LOAD DATA FROM INFILE
IODKU
And the focus is taken off of multi-threading, a good thing to take one's focus off of.
Here is a nice article on performance and strategies titled Testing the Fastest Way to Import a Table into MySQL. Don't let the mysql version of the title or inside the article scare you away. Jumping to the bottom and picking up some conclusions:
The fastest way you can import a table into MySQL without using raw
files is the LOAD DATA syntax. Use parallelization for InnoDB for
better results, and remember to tune basic parameters like your
transaction log size and buffer pool. Careful programming and
importing can make a >2-hour problem became a 2-minute process. You
can disable temporarily some security features for extra performance
I would separate the C# routine entirely from the actual LOAD DATA and IODKU update effort and leave that to the event mentioned with Create Event for several reasons. Mainly better design. As such the C# program is only dealing with SQS and writing out files with incrementing file #'s.

Related

Concurrent inserts into database

I'm building a program that takes push data from six different sources and inserts the data into a database. Each source has its own function to execute the inserts as soon as they come, but all sources write to the same table.
I would have the following questions:
If one source is currently writing to the table and another source begins to write at the same time is there any chance the inserts will block each other?
The table is also constantly being used to read the data via a view that join some more tables to show the data, can this pose any problems?
Currently each source has its own DB connection to write data, would it be better to have only one connection, or have each use its own?

If one source is currently writing to the table and another source
begins to write at the same time is there any chance the inserts will
block each other?
It depends on the indexes. If the index keys have the same or contiguous values, you may see short=term blocking for the duration of the transaction.
The table is also constantly being used to read the data via a view
that join some more tables to show the data, can this pose any
problems?
It depends on the isolation level. No blocking will occur if:
SELECT queries are running in READ_COMMITTED isolation level and
the READ_COMMITTED_SNAPSHSOT database option is turned on
the SELECT queries don't touch uncommitted data
the SELECT queries run in READ_UNCOMMITTED isolation level
Even if blocking does occur, it may be short-lived if the INSERT transactions are short.
Currently each source has its own DB connection to write data, would
it be better to have only one connection, or have each use its own?
Depends on the problem you are trying to solve. A single connection will ensure inserts don't block/deadlock with each other but might not be an issue anyway.

Please find the below inline answer
If one source is currently writing to the table and another source begins to write at the same time is there any chance the inserts will block each other?
In this case another resource will wait for it.(Insert will be in waiting state for next one)
The table is also constantly being used to read the data via a view that join some more tables to show the data, can this pose any problems?
No problem.
Currently each source has its own DB connection to write data, would it be better to have only one connection, or have each use its own?
Its better to have one DB connection.

Block "each other" i.e. dead-lock is not possible.
No problem. Only if select is too slow, it can delay next insert.
No problem with different connections.

I have roughly 30M rows to Insert Update in SQL Server per day what are my options?

I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?

The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.

Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform

Persist processed records

I have a c# process that works against a queue using TPL to process in parallel. After handling each record, I want to establish a physical record of each record ID processed so that if the process fails or is interrupted, I can be sure to not process that record a second time. It is imperative that records only be processed once.
I have tried serializing record IDs to a simple text file AND to a Sqlite table. In both cases, the time to save these small record IDs (Guid's) takes 50% of the total process time for the record itself. I've even tried using an open Sqlite connection and a parameritized insert query to do inserts so I'm not opening/closing the database file and it's no better.
My question is, how can I maintain a list of Guid's (maybe 1000-2000 of them) in a persistent way such that if my process dies, I'll have them saved so I can pick up where I left off? I'm willing to try anything as long as it's fast and will still be there if the server reboots or the process is killed.
Any ideas?

Anything that is persistent enough to survive a reboot will have to be written to disk sooner or later (preferably sooner).
This means that you have pretty much enumerated your choices already.
The next questions that you have to ask is what is the expense of verifying whether or not the record has already been processed AND what is the danger level of an end-user inadvertently removing the tracking mechanism.
If you just write the information to a text file, it should be a fast write, but a slow read (unless you cache the information) and the likelihood that a user will remove the file is fairly high.
If you use a database of any kind, the write should still be reasonably fast and the retrieval should be faster than that of the text file and the likelihood that a user will remove the storage mechanism is much lower.
Based on these factors, I would strongly recommend a database of some sort. I would model (or research) a couple of different databases for performance to see which provides the best bang for the buck, which should include cost of implementation, deployment, and maintenance.

what can affect nhibernate bulk insert performance?

I have various large data modification operations in a project built on c# and Fluent NHibernate.
The DB is sqlite (on disk rather than in memory as I'm interested in performance)
I wanted to check performance of these so I created some tests to feed in large amounts of data and let the processes do their thing. The results from 2 of these processes have got me pretty confused.
The first is a fairly simple case of taking data supplied in an XML file doing some light processing and importing it. The XML contains around 172,000 rows and the process takes a total of around 60 seconds to run with the actual inserts taking around 40 seconds.
In the next process, I do some processing on the same set of data. So I have a DB with approx 172,000 rows in one table. The process then works through this data, doing some heavier processing and generating a whole bunch of DB updates (inserts and updates to the same table).
In total, this results in around 50,000 rows inserted and 80,000 updated.
In this case, the processing takes around 30 seconds, which is fine, but saving the changes to the DB takes over 30 mins! and it crashes before it finishes with an sqlite 'disk or i/o error'
So the question is: why are the inserts/updates in the second process so much slower? They are working on the same table of the same database with the same connection. In both cases, IStatelessSession is used and ado.batch_size is set to 1000.
In both cases, the code looks that does the update like this:
BulkDataInsert((IStatelessSession session) =>
{
foreach (Transaction t in transToInsert) { session.Insert(t); }
foreach (Transaction t in transToUpdate) { session.Update(t); }
});
(although the first process has no 'transToUpdate' line as it's only inserts - Removing the update line and just doing the inserts still takes almost 10 minutes.)
The transTo* variables are List with the objects to be updated/inserted.
BulkDataInsert creates the session and handles the DB transaction.

I didn't understand your second process. However, here are some things to consider:
Are there any clustered or non-clustered indexes on the table?
How many disk drives do you have?
How many threads are writing to the DB in the second test?
It seems that you are experiencing IO bottlenecks that can be resolved by having more disks, more threads, indexes, etc.
So, assuming a lot of things, here is what I "think" is happening:
In the first test your table probably has no indexes, and since you are just inserting data, it is a sequential insert in a single thread which can be pretty fast - especially if you are writing to one disk.
Now, in the second test, you are reading data and then updating data. Your SQL instance has to find the record that it needs to update. If you do not have any indexes this "find" action is basically a table scan, which will happen for each one of those 80,000 row updates. This will make your application really really slow.
The simplest thing you could probably do is add a clustered index on the table for a unique key, and the best option is to use the columns that you are using in the where clause to "update" those rows.
Hope this helps.
DISCLAIMER: I made quite a few assumptions

The problem was due to my test setup.
As is pretty common with nhibernate based projects, I had been using in-memory sqlite databases for unit testing. These work great but one downside is that if you close the session, it destroys the database.
Consequently, my unit of work implementation contains a 'PreserveSession' property to keep the session alive and just create new transactions when needed.
My new performance tests are using on-disk databases but they still use the common code for setting up test databases and so have PreserveSession set to true.
It seems that having several sessions all left open (even though they're not doing anything) starts to cause problems after a while including the performance drop off and the disk IO error.
I re-ran the second test with PreserveSession set to false and immediately I'm down from over 30 minutes to under 2 minutes. Which is more where I'd expect it to be.

SQL Design: Big table, thread access serialization

I have one BIG table(90k rows, size cca 60mb) which holds info about free rooms capacities for about 50 hotels. This table has very few updates/inserts per hour.
My application sends async requests to this(and joined tables) at max 30 times per sec.
When i start 30 threads(with default AppPool class at .NET 3.5 C#) at one time(with random valid sql query string), only few(cca 4) are processed asynchronously and other threads waits. Why?
Is it becouse of SQL SERVER 2008 table locking, or becouse of .NET core? Or something else?
If it is a SQL problem, can help if i split this big table into one table per each hotel model?
My goal is to have at least 10 threads servet at a time.

This table is tiny. It's doesn't even qualify as a "medium sized" table. It's trivial.
You can be full table scanning it 30 times per second, you can be copying the whole thing in ram, no server is going to be the slightest bit bothered.
If your data fits in ram, databases are fast. If you don't find this, you're doing something REALLY WRONG. Therefore I also think the problems are all on the client side.

It is more than likely on the .NET side. If it were table locking more threads would be processing, but they would be waiting on their queries to return. If I remember correctly there's a property for thread pools that controls how many actual threads they create at once. If there are more pending threads than that number, then they get in line and wait for running threads to finish. Check that.

Have you tried changing the transaction isolation level?
Even when reading from a table Sql Server will lock the table
try setting the isolation level to read uncommitted and see if that improves the situation,
but be advised that its feasible that you will read 'dirty' data make sure you understand the ramifications if this is the solution
this link explains what it is.
link text

Rather than ask, measure. Each of your SQL queries that is actually submitted by your application will create a request on the server, and the sys.dm_exec_requests DMV shows the state of each request. When the request is blocked the wait_type column shows a non-empty value. You can judge from this whether your requests are blocked are not. If they are blocked you'll also know the reason why they are blocked.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.