I have a .net application which basically reads about a million of records from database table each time (every 5 minutes), does some processing and updates the table marking the records as processed.
Currently the application runs in single thread taking about top 4K records from DB table, processes it, updates the records, and takes the next.
I'm using dapper with stored procedures. I'm using 4K records for retrieval to avoid DB table locks.
What would be the most optimal way for retrieving records in multiple threads and at the same time ensuring that each thread gets a new 4K records?
My current idea is that i would first just retrieve the ids of the 1M records. Sort the ids by ascending, and split them into 4K batches remembering lowest and highest id in a batch.
Then in each thread i would call another stored procedure which would retrieve full records by specifying the lowest and highest ids of records retrieved, process that and so on.
Is there any better pattern i'm not aware of?
I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.
What you are proposing to do would work if you write your SQL query correctly.
Using ROW_NUMBER / BETWEEN it should be achievable.
I'll write and document some other alternatives here along with benefits / caveats.
Parallel processing
I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.
Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:
A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/
You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.
SQL CLR integration
One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration
This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017
Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.
Use a single thread (if you're reading from an external source)
Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024
Parallel execution improves processing for:
Queries requiring large table scans, joins, or partitioned index scans
Creation of large indexes
Creation of large tables (including materialized views)
Bulk inserts, updates, merges, and deletes
There are also setup / environment requirements
Parallel execution benefits systems with all of the following
characteristics:
Symmetric multiprocessors (SMPs), clusters, or massively parallel
systems
Sufficient I/O bandwidth
Underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
Sufficient memory to support additional memory-intensive processes,
such as sorts, hashing, and I/O buffers
There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records.
With a single thread that becomes way simpler.
Conclusion
Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.
You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.
Edit to conclusion
After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.
Related
I've done a lot of searching on this, and haven't had a lot of luck. As a test, I've written a C# WinForms app where I spin up a configurable amount of threads and have each thread write a configurable amount of data to a number of tables in a SQLite database created by the thread. So each thread creates it's own SQLite database, and only that thread interacts with it.
What I'm seeing is that there's definitely some performance degradation happening as a result of the concurrency. For example, if I start each thread roughly simultaneously performance writing to SQLite's tables PLUMMETS compared to if I put a random start delay in each thread to spread out their access.
SQLite starts easily fast enough for my tests, I can insert 20,000 rows in a table in a third of a second, but once I start up 250 threads, those same 20,000 rows can takes MINUTES to write to each of the databases.
I've tried a lot of things, including periodic commits, setting Sychronous=Off, using paramaterized queries, etc... and those all help by shorting the amount of time each statement takes (and therefore reducing the change of concurrent activity) but nothing's really solved it and I'm hoping someone can give some advice.
Thanks!
Andy
Too much concurrency in writes in any relational database does cause some slowdown. Depending upon the scenario that you are trying to optimize you can do various other things, few I can think of are:
1) create batches instead of concurrent writes, this means if you are expecting a large number of users writing simultaneously, collect their data and flush them down in larger groups, be warned though that this this means, while our queue is collecting if the application goes down, u would lose the data, u can do this for non critical data such as logs.
2) if ur threads need to do other work as well before inserting the data, u can still have our threads and then add a semaphore or something equivalent to the part of the code where insertion takes place, this will limit the concurrency and speed up the entire process.
3) if what u r trying to do is bulk insert via a tool which you are trying to make, then mention that in your question, a lot of mysql dba's will answer our question better than me.
I have a program that performs a long running process. Loops through thousands of records, one at a time, and calls a stored proc each iteration. Would running two instances of a program like this with one processing half the records and the other processing the other half speed up the processing?
Here are the scenarios:
1 program, running long running process
2 instances of program on same server, connecting to same database, each responsible for processing half (50%) of the records.
2 instance on different server, connecting to the same database, each responsible for half (50%) of the records.
Would scenario 2 or 3 run twice as fast as 1? Would there be a difference between 2 and 3? The main bottleneck is the stored proc call that takes around half a second.
Thanks!
This depends on a lot of factors. Also note that threads may be more appropriate than processes. Or maybe not. Again: it depends. But: is this work CPU-bound? Network-bound? Or bound by what the database server can do? Adding concurrency helps with CPU-bound, and when talking to multiple independent resources. Fighting over the same network connection or the same database server is unlikely to improve things - and can make things much worse.
Frankly, from the sound of it your best bet may be to re-work the sproc to work in batches (rather than individual records).
To answer this question properly you need to know what the resource utilization of the database server currently us: can it take extra load? Or simpler - just try it and see.
It really depends what the stored procedure is doing. If the stored procedure is going to be updating the records, and you have a single database instance then there is going to be contention when writing the data back.
The values at play here, are:
The time it takes to read the data in to your application memory (and this is also dependent on whether you are using client-side or sql-server-side cursors).
The time it takes to process, or do your application logic.
The time it takes to write an updated item back (assuming the proc updates).
One solution (and this is by no means a perfect solution without knowing the exact requirements), is:
Have X servers read Y records, and process them.
Have those servers write the results back to a dedicated writing server in a serialized fashion to avoid the contention.
What is the most common and easy to implement solution to improve speed for SQL Server 2008R2 database & .Net 3.5 application.
We have an application with the following attributes:
- small number of simultaneous clients (~200 at MOST).
- complex math operations on SQL server side
- we are imitating something to oracle's row-level security (Thus using tvf's and storedprocs instead of directly querying tables)
-The main problem is that users perform high amount of updates/inserts/deletes/calculations, and they freak out because they need to wait for pages to reload while those actions are done.
The questions I need clarification on are as follows:
What is faster: returning whole dataset from sql server and performing math functions on C# side, or performing calculation functions on sql side (thus, not returning extra columns). Or is it only hardware dependant?
Will caching improve performance (For example if we add redis cache). Or caching solutions only feasible for large number of clients?
Is it a bad practice to pre-calculate some of the data and store somewhere in the database (so, when user will request, it will already be calculated). Or this is what caching suppose to do? If this is not a bad practice, how do you configure SQL server to do calculations when there are available resources?
How caching can improve performance if it still needs to go to the database and see if any records were updated?
general suggestions and comments are also welcome.
Let's separate the answer to two parts, performance of your query execution and caching to improve that performance.
I believe you should start with addressing the load on your SQL server and try to optimize process running on it to the maximum, this should resolve most of the need to implement any caching.
From your question it appears that you have a system that is used for both transactional processing and also for aggregations/calculations, this will often result in conflicts when these two tasks lock each other resources. A long query performing math operations may lock/hold an object required by the UI.
Optimizing these systems to work side-by-side and improving the query efficiency is the key for having increased performance.
To start, I'll use your questions. What is faster? depends on the actual aggregation you are performing, if you're dealing with a set operations, i.e. SUM/AVG of a column, keep it in SQL, on the other hand if you find yourself having a cursor in the procedure, move it to C#. Cursors will kill your performance!
You asked if it's bad-practice to aggregate data aside and later query that repository, this is the best practice :). You'll end up with having one database catering the transactional, high-paced clients and another database storing the aggregated info, this will be quickly and easily available for your other needs. Taking it to the next step will result with you having a data warehouse, so this is definitely where you want to be heading when you have a lot information and calculations.
Lastly, caching, this is tricky and really depends on the specific nature of your needs, I'd say take the above approach, spend the time in improving the processes and I expect the end result will make caching redundant.
One of your best friends for the task is SQL Profiler, run a trace on stmt:completed to see what are the highest duration/io/cpu and pick on them first.
Good luck!
I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?
I have wrote a ETL process that perform ETL process. The ETL process needs to process more than 100+ million or rows overall for 2 years worth of records. To avoid out of memory issue, we chunk the data loading down to every 7 days. For each chunk process, it loads up all the required reference data, then the process open a sql connection and load the source data one by one, transform it, and write it to the data warehouse.
The drawback of processing the data by chunk is it is slow.
This process has been working fine for most of the tables, but there is one table I still run into out of memory. The process has loaded too many reference data. I would like to avoid chunk the data down to 3 days so that it has a decent performance.
Is there any other strategies that I can use to avoid OutOfMemoryException?
For example, local database, write the reference data to files, spawn another .NET process to hold more memory in Windows, use CLR stored procedure to do ETL...
Environment: Windows 7 32 bit OS. 4 GB of RAM. SQL Server Standard Edition.
The only one solution is to use a store procedure and let SQL Server handle the ETL. However, I am trying to avoid it because the program needs to support Oracle as well.
Other performance improvement I tried are added indexes to improve the loading queries. Create custom data access class to only load the necessary columns, instead of loading the entire row into memory.
Thanks
Without knowing how you exactly process the data it is hard to say, but a naive solution that can be implemented in any case is to use a 64-bit OS and compile your application as 64-bit. In 32-bit mode .NET heap will only grow to about 1.5GB which might be limiting you.
I know its old post but for people searching for better points to write data operations with programming languages.
I am not sure if you have considered to study how ETL tools perform their data loading operations and replicate similar strategy in your code.
One such suggestion, parallel data pipes. Here each pipe will perform the ETL on a single chunks based on partitioning of the data from the source. For example, you could consider spawning processes for different weeks data in parallel. This still will not solve your memory issues within a single process. Though can be used in case you reach a limit with memory allocation within heap within single process. This is also useful to read the data in parallel with random access. Though will require a master process to coordinate and complete the process as a single ETL operation.
I assume you perform in your transformation a lot of lookup operation before finally writing your data to database. Assuming the master transaction table is huge and reference data is small. You need to focus on data structure operation and alogirthm. There are few tips below for the same. Refer to the characteristics of your data before choosing what suites best when writing the algorithm.
Generally, Lookup data (reference data) is stored in cache. Choose a simple data structure that is efficient for read and search operation (say Array list). If possible sort this array by the key you will join to be efficient in your search algorithm.
There is different strategy for lookup operations in your transformation tasks. In database world you can call it as join operation.
Merge Join algorithm :
Ideal when the source is already sorted on join attribute key. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time. For sample code, https://en.wikipedia.org/wiki/Sort-merge_join
Nested Join:
works like a nested loop, where each value of the index of the outer loop is taken as a limit (or starting point or whatever applicable) for the index of the inner loop, and corresponding actions are performed on the statement(s) following the inner loop. So basically, if the outer loop executes R times and for each such execution the inner loop executes S times, then the total cost or time complexity of the nested loop is O(RS).
Nested-loop joins provide efficient access when tables are indexed on join columns. Furthermore,in many small transactions, such as those affecting only a small set of rows, index nested loopsjoins are far superior to both sort -merge joins and hash joins
I am only describing two methods that can be thought in your lookup operation. The main idea to remember in ETL is all about lookup and retrieve the tuples (as set) for further operation. Search will be based on key and resultant transaction keys will extract all the records (projection). Take this and load the rows from the file in one reading operation. This is more of suggestion in case you don't need all the records for transformation operations.
Another very costly operation is writing back to the database. There might be tendency to process the extraction, transformation and loading one row at a time. Think of operations that can be vectorized where in you can perform it together with a data structure operation in bulk. For example, lambada operation on a multi dimensional vector rather than looping every row one at a time and performing transformation and operations across all columns for a given row. We then can write this vector into file or database. This will avoid memory pressure.
This was a very old question, and it is more a design question and I am sure there are many solutions to it, unless I get into more specific details.
Ultimately, I wrote SQL Stored Procedure using Merge to handle the ETL process for the data type that took too long to process thought C# application. In addition, the business requirement was changed such that we dropped Oracle support, and only support 64-bit server, which reduced maintenance cost and avoid ETL out of memory issue.
In addition, we added many indexes whenever we see an opportunity to improve the querying performance.
Instead of chunking by a day range, the ETL process also chunks the data by count (5000) and commit on every transaction, this reduced the transaction log file size and if the ETL fails , the process only needs to rollback a subset of the data.
Lastly, we implemented caches (key,value) so that frequently referenced data within the ETL date range are loaded in memory to reduce database querying.