Running multiple instances of a program to speed up process? - c#

I have a program that performs a long running process. Loops through thousands of records, one at a time, and calls a stored proc each iteration. Would running two instances of a program like this with one processing half the records and the other processing the other half speed up the processing?
Here are the scenarios:
1 program, running long running process
2 instances of program on same server, connecting to same database, each responsible for processing half (50%) of the records.
2 instance on different server, connecting to the same database, each responsible for half (50%) of the records.
Would scenario 2 or 3 run twice as fast as 1? Would there be a difference between 2 and 3? The main bottleneck is the stored proc call that takes around half a second.
Thanks!

This depends on a lot of factors. Also note that threads may be more appropriate than processes. Or maybe not. Again: it depends. But: is this work CPU-bound? Network-bound? Or bound by what the database server can do? Adding concurrency helps with CPU-bound, and when talking to multiple independent resources. Fighting over the same network connection or the same database server is unlikely to improve things - and can make things much worse.
Frankly, from the sound of it your best bet may be to re-work the sproc to work in batches (rather than individual records).
To answer this question properly you need to know what the resource utilization of the database server currently us: can it take extra load? Or simpler - just try it and see.

It really depends what the stored procedure is doing. If the stored procedure is going to be updating the records, and you have a single database instance then there is going to be contention when writing the data back.
The values at play here, are:
The time it takes to read the data in to your application memory (and this is also dependent on whether you are using client-side or sql-server-side cursors).
The time it takes to process, or do your application logic.
The time it takes to write an updated item back (assuming the proc updates).
One solution (and this is by no means a perfect solution without knowing the exact requirements), is:
Have X servers read Y records, and process them.
Have those servers write the results back to a dedicated writing server in a serialized fashion to avoid the contention.

Related

Multithreaded application with database read - each thread unique records

I have a .net application which basically reads about a million of records from database table each time (every 5 minutes), does some processing and updates the table marking the records as processed.
Currently the application runs in single thread taking about top 4K records from DB table, processes it, updates the records, and takes the next.
I'm using dapper with stored procedures. I'm using 4K records for retrieval to avoid DB table locks.
What would be the most optimal way for retrieving records in multiple threads and at the same time ensuring that each thread gets a new 4K records?
My current idea is that i would first just retrieve the ids of the 1M records. Sort the ids by ascending, and split them into 4K batches remembering lowest and highest id in a batch.
Then in each thread i would call another stored procedure which would retrieve full records by specifying the lowest and highest ids of records retrieved, process that and so on.
Is there any better pattern i'm not aware of?
I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.
What you are proposing to do would work if you write your SQL query correctly.
Using ROW_NUMBER / BETWEEN it should be achievable.
I'll write and document some other alternatives here along with benefits / caveats.
Parallel processing
I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.
Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:
A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/
You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.
SQL CLR integration
One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration
This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017
Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.
Use a single thread (if you're reading from an external source)
Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024
Parallel execution improves processing for:
Queries requiring large table scans, joins, or partitioned index scans
Creation of large indexes
Creation of large tables (including materialized views)
Bulk inserts, updates, merges, and deletes
There are also setup / environment requirements
Parallel execution benefits systems with all of the following
characteristics:
Symmetric multiprocessors (SMPs), clusters, or massively parallel
systems
Sufficient I/O bandwidth
Underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
Sufficient memory to support additional memory-intensive processes,
such as sorts, hashing, and I/O buffers
There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records.
With a single thread that becomes way simpler.
Conclusion
Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.
You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.
Edit to conclusion
After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.

Concurrency issues writing to multiple SQLite databases simultaneously on different app threads

I've done a lot of searching on this, and haven't had a lot of luck. As a test, I've written a C# WinForms app where I spin up a configurable amount of threads and have each thread write a configurable amount of data to a number of tables in a SQLite database created by the thread. So each thread creates it's own SQLite database, and only that thread interacts with it.
What I'm seeing is that there's definitely some performance degradation happening as a result of the concurrency. For example, if I start each thread roughly simultaneously performance writing to SQLite's tables PLUMMETS compared to if I put a random start delay in each thread to spread out their access.
SQLite starts easily fast enough for my tests, I can insert 20,000 rows in a table in a third of a second, but once I start up 250 threads, those same 20,000 rows can takes MINUTES to write to each of the databases.
I've tried a lot of things, including periodic commits, setting Sychronous=Off, using paramaterized queries, etc... and those all help by shorting the amount of time each statement takes (and therefore reducing the change of concurrent activity) but nothing's really solved it and I'm hoping someone can give some advice.
Thanks!
Andy
Too much concurrency in writes in any relational database does cause some slowdown. Depending upon the scenario that you are trying to optimize you can do various other things, few I can think of are:
1) create batches instead of concurrent writes, this means if you are expecting a large number of users writing simultaneously, collect their data and flush them down in larger groups, be warned though that this this means, while our queue is collecting if the application goes down, u would lose the data, u can do this for non critical data such as logs.
2) if ur threads need to do other work as well before inserting the data, u can still have our threads and then add a semaphore or something equivalent to the part of the code where insertion takes place, this will limit the concurrency and speed up the entire process.
3) if what u r trying to do is bulk insert via a tool which you are trying to make, then mention that in your question, a lot of mysql dba's will answer our question better than me.

Constantly sending SQL queries

I don't know if this a common question asked, but if it is, please don't yell at me! :(
I have a Windows Form C# program that executes an UPDATE query every 2 seconds with the threading timer.
My question is: is this dangerous? Will this make my computer run much slower? Am I firing up the CPU usage? I'm a pretty concerned guy when it comes to constantly using something every second.
EDIT: It's UPDATE, not INSERT sorry!
This always depends a lot on the size of the operation that is done every 2 seconds; if the operation takes 1.5 seconds to pre-process, execute and post-process, then it will be a problem. If it takes 4ms, probably not. You also need to think about the server; even if we say it takes 4ms, that could be parallelised over 8 cores, so that is 32ms - and if you have 2000 users all doing that every 2 seconds, it starts to add up.
But by itself: fine.
And client-side, on a modern multi-core PC, this is probably not even enough to register as the tiniest blip on the graph.
The answer completely depends on the amount of work the update statement is performing. If it is updating millions of rows every two seconds, then it will definitely impact the performance.
However, if you are only updating a handful of rows (up to say, 100,000) in an SQL Server database, then this frequency should be perfectly acceptable.
The manner in which the update is performed is also important: using cursors, linked servers, CLR functions, databases other than SQL (i.e. Access), and many, many other factors can all significantly impact the performance.

Loading multiple large ADO.NET DataTables/DataReaders - Performance improvements

I need to load multiple sql statements from SQL Server into DataTables. Most of the statements return some 10.000 to 100.000 records and each take up to a few seconds to load.
My guess is that this is simply due to the amount of data that needs to be shoved around. The statements themselves don't take much time to process.
So I tried to use Parallel.For() to load the data in parallel, hoping that the overall processing time would decrease. I do get a 10% performance increase, but that is not enough. A reason might be that my machine is only a dual core, thus limiting the benefit here. The server on which the program will be deployed has 16 cores though.
My question is, how I could improve the performance more? Would the use of Asynchronous Data Service Queries be a better solution (BeginExecute, etc.) than PLINQ? Or maybe some other approach?
The SQl Server is running on the same machine. This is also the case on the deployment server.
EDIT:
I've run some tests with using a DataReader instead of a DataTable. This already decreased the load times by about 50%. Great! Still I am wondering whether parallel processing with BeginExecute would improve the overall load time if a multiprocessor machine is used. Does anybody have experience with this? Thanks for any help on this!
UPDATE:
I found that about half of the loading time was consumed by processing the sql statement. In SQL Server Management Studio the statements took only a fraction of the time, but somehow they take much longer through ADO.NET. So by using DataReaders instead of loading DataTables and adapting the sql statements I've come down to about 25% of the initial loading time. Loading the DataReaders in parallel threads with Parallel.For() does not make an improvement here. So for now I am happy with the result and leave it at that. Maybe when we update to .NET 4.5 I'll give the asnchronous DataReader loading a try.
My guess is that this is simply due to the amount of data that needs to be shoved around.
No, it is due to using a SLOW framework. I am pulling nearly a million rows into a dictionary in less than 5 seconds in one of my apps. DataTables are SLOW.
You have to change the nature of the problem. Let's be honest, who needs to view 10.000 to 100.000 records per request? I think no one.
You need to consider to handle paging and in your case, paging should be done on sql server. To make this clear, lets say you have stored procedure named "GetRecords". Modify this stored procedure to accept page parameter and return only data relevant for specific page (let's say 100 records only) and total page count. Inside app just show this 100 records (they will fly) and handle selected page index.
Hope this helps, best regards!
Do you often have to load these requests? If so, why not use a distributed cache?

Fetching records from database

In my C# 3.5 application,code performs following steps:
1.Loop through a collection[of length 10]
2.For each item in step 1, fetch records from oracle database by executing a stored proc[here,record count is typically 100]
3.Process items fetched in step 2.
4.Go to next item in step 1.
My question here, with regard to performance, is it a good idea to fetch all items in step #2[ie. 10 * 100=1000 records] in one shot rather than connecting to database in each step and retrieving the 10 records?
Thanks.
Yes it's slightly better because you will lose the overhead of connecting to the DB, but you will still have the overhead of 10 stored procedure calls. If you could find a way to pass all 10 items as parameter to the stored proc and execute just one stored proc call, I think you would get a better performance.
Depending on how intense the connection steps are, it might be better to fetch all the records at once. However, keep in mind that premature optimization is the root of all evil. :-)
Generally it is better to pull all the records from the database in one stored procedure call.
This is countered when the stored procedure call is long running or otherwise extensive enough to cause contention on the table. In your case however with only a 1000 records, I doubt that will be an issue.
Yes, it is an incredibly good idea. The key to database performance is to run as many operations in bulk as possible.
For example, consider just the interaction between PL/SQL and SQL. These two languages run on the same server and are very thoroughly integrated. Yet I routinely see an order of magnitude performance increase when I reduce or eliminate any interaction between the two. I'm sure the same thing applies to interaction between the application and the database.
Even though the number of records may be small, bulking your operations is an excellent habit to get into. It's not premature optimization, it's a best practice that will save you a lot of time and effort later.

Categories