I've done a lot of searching on this, and haven't had a lot of luck. As a test, I've written a C# WinForms app where I spin up a configurable amount of threads and have each thread write a configurable amount of data to a number of tables in a SQLite database created by the thread. So each thread creates it's own SQLite database, and only that thread interacts with it.
What I'm seeing is that there's definitely some performance degradation happening as a result of the concurrency. For example, if I start each thread roughly simultaneously performance writing to SQLite's tables PLUMMETS compared to if I put a random start delay in each thread to spread out their access.
SQLite starts easily fast enough for my tests, I can insert 20,000 rows in a table in a third of a second, but once I start up 250 threads, those same 20,000 rows can takes MINUTES to write to each of the databases.
I've tried a lot of things, including periodic commits, setting Sychronous=Off, using paramaterized queries, etc... and those all help by shorting the amount of time each statement takes (and therefore reducing the change of concurrent activity) but nothing's really solved it and I'm hoping someone can give some advice.
Thanks!
Andy
Too much concurrency in writes in any relational database does cause some slowdown. Depending upon the scenario that you are trying to optimize you can do various other things, few I can think of are:
1) create batches instead of concurrent writes, this means if you are expecting a large number of users writing simultaneously, collect their data and flush them down in larger groups, be warned though that this this means, while our queue is collecting if the application goes down, u would lose the data, u can do this for non critical data such as logs.
2) if ur threads need to do other work as well before inserting the data, u can still have our threads and then add a semaphore or something equivalent to the part of the code where insertion takes place, this will limit the concurrency and speed up the entire process.
3) if what u r trying to do is bulk insert via a tool which you are trying to make, then mention that in your question, a lot of mysql dba's will answer our question better than me.
Related
I have a .net application which basically reads about a million of records from database table each time (every 5 minutes), does some processing and updates the table marking the records as processed.
Currently the application runs in single thread taking about top 4K records from DB table, processes it, updates the records, and takes the next.
I'm using dapper with stored procedures. I'm using 4K records for retrieval to avoid DB table locks.
What would be the most optimal way for retrieving records in multiple threads and at the same time ensuring that each thread gets a new 4K records?
My current idea is that i would first just retrieve the ids of the 1M records. Sort the ids by ascending, and split them into 4K batches remembering lowest and highest id in a batch.
Then in each thread i would call another stored procedure which would retrieve full records by specifying the lowest and highest ids of records retrieved, process that and so on.
Is there any better pattern i'm not aware of?
I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.
What you are proposing to do would work if you write your SQL query correctly.
Using ROW_NUMBER / BETWEEN it should be achievable.
I'll write and document some other alternatives here along with benefits / caveats.
Parallel processing
I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.
Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:
A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/
You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.
SQL CLR integration
One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration
This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017
Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.
Use a single thread (if you're reading from an external source)
Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024
Parallel execution improves processing for:
Queries requiring large table scans, joins, or partitioned index scans
Creation of large indexes
Creation of large tables (including materialized views)
Bulk inserts, updates, merges, and deletes
There are also setup / environment requirements
Parallel execution benefits systems with all of the following
characteristics:
Symmetric multiprocessors (SMPs), clusters, or massively parallel
systems
Sufficient I/O bandwidth
Underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
Sufficient memory to support additional memory-intensive processes,
such as sorts, hashing, and I/O buffers
There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records.
With a single thread that becomes way simpler.
Conclusion
Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.
You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.
Edit to conclusion
After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.
I have a program that performs a long running process. Loops through thousands of records, one at a time, and calls a stored proc each iteration. Would running two instances of a program like this with one processing half the records and the other processing the other half speed up the processing?
Here are the scenarios:
1 program, running long running process
2 instances of program on same server, connecting to same database, each responsible for processing half (50%) of the records.
2 instance on different server, connecting to the same database, each responsible for half (50%) of the records.
Would scenario 2 or 3 run twice as fast as 1? Would there be a difference between 2 and 3? The main bottleneck is the stored proc call that takes around half a second.
Thanks!
This depends on a lot of factors. Also note that threads may be more appropriate than processes. Or maybe not. Again: it depends. But: is this work CPU-bound? Network-bound? Or bound by what the database server can do? Adding concurrency helps with CPU-bound, and when talking to multiple independent resources. Fighting over the same network connection or the same database server is unlikely to improve things - and can make things much worse.
Frankly, from the sound of it your best bet may be to re-work the sproc to work in batches (rather than individual records).
To answer this question properly you need to know what the resource utilization of the database server currently us: can it take extra load? Or simpler - just try it and see.
It really depends what the stored procedure is doing. If the stored procedure is going to be updating the records, and you have a single database instance then there is going to be contention when writing the data back.
The values at play here, are:
The time it takes to read the data in to your application memory (and this is also dependent on whether you are using client-side or sql-server-side cursors).
The time it takes to process, or do your application logic.
The time it takes to write an updated item back (assuming the proc updates).
One solution (and this is by no means a perfect solution without knowing the exact requirements), is:
Have X servers read Y records, and process them.
Have those servers write the results back to a dedicated writing server in a serialized fashion to avoid the contention.
I don't know if this a common question asked, but if it is, please don't yell at me! :(
I have a Windows Form C# program that executes an UPDATE query every 2 seconds with the threading timer.
My question is: is this dangerous? Will this make my computer run much slower? Am I firing up the CPU usage? I'm a pretty concerned guy when it comes to constantly using something every second.
EDIT: It's UPDATE, not INSERT sorry!
This always depends a lot on the size of the operation that is done every 2 seconds; if the operation takes 1.5 seconds to pre-process, execute and post-process, then it will be a problem. If it takes 4ms, probably not. You also need to think about the server; even if we say it takes 4ms, that could be parallelised over 8 cores, so that is 32ms - and if you have 2000 users all doing that every 2 seconds, it starts to add up.
But by itself: fine.
And client-side, on a modern multi-core PC, this is probably not even enough to register as the tiniest blip on the graph.
The answer completely depends on the amount of work the update statement is performing. If it is updating millions of rows every two seconds, then it will definitely impact the performance.
However, if you are only updating a handful of rows (up to say, 100,000) in an SQL Server database, then this frequency should be perfectly acceptable.
The manner in which the update is performed is also important: using cursors, linked servers, CLR functions, databases other than SQL (i.e. Access), and many, many other factors can all significantly impact the performance.
I need to load multiple sql statements from SQL Server into DataTables. Most of the statements return some 10.000 to 100.000 records and each take up to a few seconds to load.
My guess is that this is simply due to the amount of data that needs to be shoved around. The statements themselves don't take much time to process.
So I tried to use Parallel.For() to load the data in parallel, hoping that the overall processing time would decrease. I do get a 10% performance increase, but that is not enough. A reason might be that my machine is only a dual core, thus limiting the benefit here. The server on which the program will be deployed has 16 cores though.
My question is, how I could improve the performance more? Would the use of Asynchronous Data Service Queries be a better solution (BeginExecute, etc.) than PLINQ? Or maybe some other approach?
The SQl Server is running on the same machine. This is also the case on the deployment server.
EDIT:
I've run some tests with using a DataReader instead of a DataTable. This already decreased the load times by about 50%. Great! Still I am wondering whether parallel processing with BeginExecute would improve the overall load time if a multiprocessor machine is used. Does anybody have experience with this? Thanks for any help on this!
UPDATE:
I found that about half of the loading time was consumed by processing the sql statement. In SQL Server Management Studio the statements took only a fraction of the time, but somehow they take much longer through ADO.NET. So by using DataReaders instead of loading DataTables and adapting the sql statements I've come down to about 25% of the initial loading time. Loading the DataReaders in parallel threads with Parallel.For() does not make an improvement here. So for now I am happy with the result and leave it at that. Maybe when we update to .NET 4.5 I'll give the asnchronous DataReader loading a try.
My guess is that this is simply due to the amount of data that needs to be shoved around.
No, it is due to using a SLOW framework. I am pulling nearly a million rows into a dictionary in less than 5 seconds in one of my apps. DataTables are SLOW.
You have to change the nature of the problem. Let's be honest, who needs to view 10.000 to 100.000 records per request? I think no one.
You need to consider to handle paging and in your case, paging should be done on sql server. To make this clear, lets say you have stored procedure named "GetRecords". Modify this stored procedure to accept page parameter and return only data relevant for specific page (let's say 100 records only) and total page count. Inside app just show this 100 records (they will fly) and handle selected page index.
Hope this helps, best regards!
Do you often have to load these requests? If so, why not use a distributed cache?
I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?