Efficiency of transaction in code vs. DB - c#

What is more efficient -- having a IDbTransaction in the .net code or handling it in the database? Why?
What are the possible scenarios in which either should be used?

When it comes to connection-based transactions (IDbTransaction), the overall performance should be pretty similar - but by handling it in the .NET code you make it possible to conveniently span multiple database operations on the same connection. If you are doing transaction management inside TSQL you should really limit it to the single TSQL query. There may well be an extra round-trip for the begin/end, but that isn't likely to hurt you.
It is pretty rare (these days) that I'd manually write TSQL-based transactions - maybe if I was writing something called directly by the server via an agent (rather than from my own application code).
The bigger difference is between IDbTransaction and TransactionScope see Transactions in .net for more, but the short version is that TransactionScope is slightly slower (depending on the scenario), but can span multiple connections / databases (or other resources).

Related

Multithreaded application with database read - each thread unique records

I have a .net application which basically reads about a million of records from database table each time (every 5 minutes), does some processing and updates the table marking the records as processed.
Currently the application runs in single thread taking about top 4K records from DB table, processes it, updates the records, and takes the next.
I'm using dapper with stored procedures. I'm using 4K records for retrieval to avoid DB table locks.
What would be the most optimal way for retrieving records in multiple threads and at the same time ensuring that each thread gets a new 4K records?
My current idea is that i would first just retrieve the ids of the 1M records. Sort the ids by ascending, and split them into 4K batches remembering lowest and highest id in a batch.
Then in each thread i would call another stored procedure which would retrieve full records by specifying the lowest and highest ids of records retrieved, process that and so on.
Is there any better pattern i'm not aware of?
I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.
What you are proposing to do would work if you write your SQL query correctly.
Using ROW_NUMBER / BETWEEN it should be achievable.
I'll write and document some other alternatives here along with benefits / caveats.
Parallel processing
I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.
Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:
A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/
You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.
SQL CLR integration
One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration
This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017
Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.
Use a single thread (if you're reading from an external source)
Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024
Parallel execution improves processing for:
Queries requiring large table scans, joins, or partitioned index scans
Creation of large indexes
Creation of large tables (including materialized views)
Bulk inserts, updates, merges, and deletes
There are also setup / environment requirements
Parallel execution benefits systems with all of the following
characteristics:
Symmetric multiprocessors (SMPs), clusters, or massively parallel
systems
Sufficient I/O bandwidth
Underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
Sufficient memory to support additional memory-intensive processes,
such as sorts, hashing, and I/O buffers
There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records.
With a single thread that becomes way simpler.
Conclusion
Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.
You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.
Edit to conclusion
After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.

Is it feasible to improve performance of SQL server with caching?

What is the most common and easy to implement solution to improve speed for SQL Server 2008R2 database & .Net 3.5 application.
We have an application with the following attributes:
- small number of simultaneous clients (~200 at MOST).
- complex math operations on SQL server side
- we are imitating something to oracle's row-level security (Thus using tvf's and storedprocs instead of directly querying tables)
-The main problem is that users perform high amount of updates/inserts/deletes/calculations, and they freak out because they need to wait for pages to reload while those actions are done.
The questions I need clarification on are as follows:
What is faster: returning whole dataset from sql server and performing math functions on C# side, or performing calculation functions on sql side (thus, not returning extra columns). Or is it only hardware dependant?
Will caching improve performance (For example if we add redis cache). Or caching solutions only feasible for large number of clients?
Is it a bad practice to pre-calculate some of the data and store somewhere in the database (so, when user will request, it will already be calculated). Or this is what caching suppose to do? If this is not a bad practice, how do you configure SQL server to do calculations when there are available resources?
How caching can improve performance if it still needs to go to the database and see if any records were updated?
general suggestions and comments are also welcome.
Let's separate the answer to two parts, performance of your query execution and caching to improve that performance.
I believe you should start with addressing the load on your SQL server and try to optimize process running on it to the maximum, this should resolve most of the need to implement any caching.
From your question it appears that you have a system that is used for both transactional processing and also for aggregations/calculations, this will often result in conflicts when these two tasks lock each other resources. A long query performing math operations may lock/hold an object required by the UI.
Optimizing these systems to work side-by-side and improving the query efficiency is the key for having increased performance.
To start, I'll use your questions. What is faster? depends on the actual aggregation you are performing, if you're dealing with a set operations, i.e. SUM/AVG of a column, keep it in SQL, on the other hand if you find yourself having a cursor in the procedure, move it to C#. Cursors will kill your performance!
You asked if it's bad-practice to aggregate data aside and later query that repository, this is the best practice :). You'll end up with having one database catering the transactional, high-paced clients and another database storing the aggregated info, this will be quickly and easily available for your other needs. Taking it to the next step will result with you having a data warehouse, so this is definitely where you want to be heading when you have a lot information and calculations.
Lastly, caching, this is tricky and really depends on the specific nature of your needs, I'd say take the above approach, spend the time in improving the processes and I expect the end result will make caching redundant.
One of your best friends for the task is SQL Profiler, run a trace on stmt:completed to see what are the highest duration/io/cpu and pick on them first.
Good luck!

Is MSDTC a big resource drain?

So, Is MSDTC a big resource drain on a system (server & application)?
In the past, I have written several large scale web applications that rely on MSDTC for database transactions. I have never investigated how much of a drain that would be on the server.
Currently, I am working with CSLA framework and CSLA doesn't rely on MSDTC for transactions. It basically reuses the same connection object and executes all database commands within a TransactionScope object.
I guess I am looking for some arguments either way as to using MSDTC or not.
MSTDC is used for distributed transaction. To simplify, using a TransactionScope can implicitely use MSDTC if the transaction needs to be distributed, ie: if the TransactionScope surrounds a piece of code that implies more than one resource. This is called escalation, and most of the time happens automatically.
So, yes it takes somes resources, but if you do need ACID transaction across multiple systems ("resource managers", like SQL Server, Oracle, or MSMQ for example) on a Windows OS, you don't have much choice but use MSDTC.
One thing about performance that can be done when configuration MSDTC, is ensure there is only one Coordinator for a pool of distributed resources, avoiding MSDTC to MSDTC communication. Configuration is usually the biggest problem you'll face with MSDTC. Example here: http://yrushka.com/index.php/security/configure-msdtc-for-distributed-transactions/
Compared to you application probably not. I use it on my current project and I have never noticed it affecting the CPU resources. What you do have to be careful about is latency, if there are multiple servers involved in your transaction that will be a much bigger problem than CPU.
Another way to look at is, its not going to be CPU bound, execution will be based on IO. Of course this assumes you don't do a lot of computation in your transaction, but that's not DTC's fault now is it?
I have used MSDTC for transactions enrolling multiple partners (one or more DB servers and one or more servers using MSMQ). The drain of using MSDTC in terms of performance vs. using transactions in general isn't a big deal. We were processing more than 40 million messages a data through MSMQ and every single one had a db action as well (though some were just cached reads, not many though).
The biggest problem is MSDTC is a huge pain in the but when you are crossing zones (e.g. DMZ to intranet). Getting stuff to enroll when they are in different zones is possible but takes tweaking. DTC also has a large number of configuration options if you are interested.

SQLite .Net Performance

I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?

Linq to SQL connections

I'm using Linq to SQL for a fairly complicated site, and after go live we've had a number of database timeouts. The first thing I noticed was there are a fairly large number of connections to the database.
Coming from an ADO.net background we used to code it so that any site would only use one or two pooled connections, and this resulted in acceptable performance even with a fair few concurrent users.
So my question is, was this old way of doing it flawed, OR is there a way to do it LINQ? It seems our performance issues are being caused by so many connections to the DB but if this was an issue I'd have thought it would be mentioned in all the tutorials for LINQ.
Any suggestions?
I'm guessing that you are keeping DataContexts around and not calling Dispose on them when done (or leaving them around, at least).
Rather, you should initialize your DataContext, perform your operation, and then dispose of it when done. You shouldn't hold a reference to it between operations.
Preferably, you would use the using statement for handling the call to IDisposable.
Regarding connection pooling, the SqlClient pools connections by default, so unless you explicitly turn it off, you should be taking advantage of it already. Of course, if you aren't releasing connections you are using, then pooling will only take you so far.

Categories