Is it feasible to improve performance of SQL server with caching? - c#

What is the most common and easy to implement solution to improve speed for SQL Server 2008R2 database & .Net 3.5 application.
We have an application with the following attributes:
- small number of simultaneous clients (~200 at MOST).
- complex math operations on SQL server side
- we are imitating something to oracle's row-level security (Thus using tvf's and storedprocs instead of directly querying tables)
-The main problem is that users perform high amount of updates/inserts/deletes/calculations, and they freak out because they need to wait for pages to reload while those actions are done.
The questions I need clarification on are as follows:
What is faster: returning whole dataset from sql server and performing math functions on C# side, or performing calculation functions on sql side (thus, not returning extra columns). Or is it only hardware dependant?
Will caching improve performance (For example if we add redis cache). Or caching solutions only feasible for large number of clients?
Is it a bad practice to pre-calculate some of the data and store somewhere in the database (so, when user will request, it will already be calculated). Or this is what caching suppose to do? If this is not a bad practice, how do you configure SQL server to do calculations when there are available resources?
How caching can improve performance if it still needs to go to the database and see if any records were updated?
general suggestions and comments are also welcome.

Let's separate the answer to two parts, performance of your query execution and caching to improve that performance.
I believe you should start with addressing the load on your SQL server and try to optimize process running on it to the maximum, this should resolve most of the need to implement any caching.
From your question it appears that you have a system that is used for both transactional processing and also for aggregations/calculations, this will often result in conflicts when these two tasks lock each other resources. A long query performing math operations may lock/hold an object required by the UI.
Optimizing these systems to work side-by-side and improving the query efficiency is the key for having increased performance.
To start, I'll use your questions. What is faster? depends on the actual aggregation you are performing, if you're dealing with a set operations, i.e. SUM/AVG of a column, keep it in SQL, on the other hand if you find yourself having a cursor in the procedure, move it to C#. Cursors will kill your performance!
You asked if it's bad-practice to aggregate data aside and later query that repository, this is the best practice :). You'll end up with having one database catering the transactional, high-paced clients and another database storing the aggregated info, this will be quickly and easily available for your other needs. Taking it to the next step will result with you having a data warehouse, so this is definitely where you want to be heading when you have a lot information and calculations.
Lastly, caching, this is tricky and really depends on the specific nature of your needs, I'd say take the above approach, spend the time in improving the processes and I expect the end result will make caching redundant.
One of your best friends for the task is SQL Profiler, run a trace on stmt:completed to see what are the highest duration/io/cpu and pick on them first.
Good luck!

Related

Multithreaded application with database read - each thread unique records

I have a .net application which basically reads about a million of records from database table each time (every 5 minutes), does some processing and updates the table marking the records as processed.
Currently the application runs in single thread taking about top 4K records from DB table, processes it, updates the records, and takes the next.
I'm using dapper with stored procedures. I'm using 4K records for retrieval to avoid DB table locks.
What would be the most optimal way for retrieving records in multiple threads and at the same time ensuring that each thread gets a new 4K records?
My current idea is that i would first just retrieve the ids of the 1M records. Sort the ids by ascending, and split them into 4K batches remembering lowest and highest id in a batch.
Then in each thread i would call another stored procedure which would retrieve full records by specifying the lowest and highest ids of records retrieved, process that and so on.
Is there any better pattern i'm not aware of?
I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.
What you are proposing to do would work if you write your SQL query correctly.
Using ROW_NUMBER / BETWEEN it should be achievable.
I'll write and document some other alternatives here along with benefits / caveats.
Parallel processing
I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.
Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:
A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/
You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.
SQL CLR integration
One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration
This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017
Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.
Use a single thread (if you're reading from an external source)
Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024
Parallel execution improves processing for:
Queries requiring large table scans, joins, or partitioned index scans
Creation of large indexes
Creation of large tables (including materialized views)
Bulk inserts, updates, merges, and deletes
There are also setup / environment requirements
Parallel execution benefits systems with all of the following
characteristics:
Symmetric multiprocessors (SMPs), clusters, or massively parallel
systems
Sufficient I/O bandwidth
Underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
Sufficient memory to support additional memory-intensive processes,
such as sorts, hashing, and I/O buffers
There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records.
With a single thread that becomes way simpler.
Conclusion
Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.
You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.
Edit to conclusion
After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.

Loading multiple large ADO.NET DataTables/DataReaders - Performance improvements

I need to load multiple sql statements from SQL Server into DataTables. Most of the statements return some 10.000 to 100.000 records and each take up to a few seconds to load.
My guess is that this is simply due to the amount of data that needs to be shoved around. The statements themselves don't take much time to process.
So I tried to use Parallel.For() to load the data in parallel, hoping that the overall processing time would decrease. I do get a 10% performance increase, but that is not enough. A reason might be that my machine is only a dual core, thus limiting the benefit here. The server on which the program will be deployed has 16 cores though.
My question is, how I could improve the performance more? Would the use of Asynchronous Data Service Queries be a better solution (BeginExecute, etc.) than PLINQ? Or maybe some other approach?
The SQl Server is running on the same machine. This is also the case on the deployment server.
EDIT:
I've run some tests with using a DataReader instead of a DataTable. This already decreased the load times by about 50%. Great! Still I am wondering whether parallel processing with BeginExecute would improve the overall load time if a multiprocessor machine is used. Does anybody have experience with this? Thanks for any help on this!
UPDATE:
I found that about half of the loading time was consumed by processing the sql statement. In SQL Server Management Studio the statements took only a fraction of the time, but somehow they take much longer through ADO.NET. So by using DataReaders instead of loading DataTables and adapting the sql statements I've come down to about 25% of the initial loading time. Loading the DataReaders in parallel threads with Parallel.For() does not make an improvement here. So for now I am happy with the result and leave it at that. Maybe when we update to .NET 4.5 I'll give the asnchronous DataReader loading a try.
My guess is that this is simply due to the amount of data that needs to be shoved around.
No, it is due to using a SLOW framework. I am pulling nearly a million rows into a dictionary in less than 5 seconds in one of my apps. DataTables are SLOW.
You have to change the nature of the problem. Let's be honest, who needs to view 10.000 to 100.000 records per request? I think no one.
You need to consider to handle paging and in your case, paging should be done on sql server. To make this clear, lets say you have stored procedure named "GetRecords". Modify this stored procedure to accept page parameter and return only data relevant for specific page (let's say 100 records only) and total page count. Inside app just show this 100 records (they will fly) and handle selected page index.
Hope this helps, best regards!
Do you often have to load these requests? If so, why not use a distributed cache?

SQLite .Net Performance

I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?

How many SQL queries per HTTP request is optimal?

I know the answer to this question for the most part is "It Depends", however I wanted to see if anyone had some pointers.
We execute queries each request in ASP.NET MVC. Each request we need to get user rights information, and Various data for the Views that we are displaying. How many is too much, I know I should be conscious to the number of queries i am executing. I would assume if they are small queries and optimized out, half-a-dozen should be okay? Am I right?
What do you think?
Premature optimization is the root of all evil :)
First create your application, if it is sluggish you will have to determine the cause and optimize that part. Sure reducing the queries will save you time, but also optimizing those queries that you have to do.
You could spend a whole day shaving off 50% time spend off a query, that only took 2 milisecond to begin with, or spend 2 hours on removing some INNER JOINS that made another query took 10 seconds. Analyse whats wrong before you start optimising.
The optimal amount would be zero.
Given that this is most likely not achievable, the only reasonable thing to say about is: "As little as possible".
Simplify your site design until it's
as simple as possible, and still
meeting your client's requirements.
Cache information that can be cached.
Pre-load information into the cache
outside the request, where you can.
Ask only for the information that you
need in that request.
If you need to make a lot of independant queries for a single request, parallelise the loading as much as possible.
What you're left with is the 'optimal' amount for that site.
If that's too slow, you need to review the above again.
User rights information may be able to be cached, as may other common information you display everywhere.
You can probably get away with caching more than the requirements necessitate. For instance - you can probably cache 'live' information such as product stock levels, and the user's shopping cart. Use SQL Change Notifications to allow you to expire and repopulate the cache in the background.
As few as possible.
Use caching for lookups. Also store some light-weight data (such as permissions) in the session.
Q: Do you have a performance problem related to database queries?
Yes? A: Fewer than you have now.
No? A: The exact same number you have now.
If it ain't broke, don't fix it.
While refactoring and optimizing to save a few milliseconds is a fun and intellectually rewarding way for programmers to spend time, it is often a waste of time.
Also, changing your code to combine database requests could come at the cost of simplicity and maintainability in your code. That is, while it may be technically possible to combine several queries into one, that could require removing the conceptual isolation of business objects in your code, which is bad.
You can make as many queries as you want, until your site gets too slow.
As many as necessary, but no more.
In other words, the performance bottlenecks will not come from the number of queries, but what you do in the queries and how you deal with the data (e.g. caching a huge yet static resultset might help).
Along with all the other recommendations of making fewer trips, it also depends on how much data is retrieved on each round trip. If it is just a few bytes, then it can probably be chatty and performance would not hurt. However, if each trip returns hundreds of kb, then your performance will hurt faster.
You have answered your own question "It depends".
Although, trying to justify optimal number number of queries per HTTP request is not a legible scenario. If your SQL server has real good hardware support than you could run good number of queries in less time and have real low turn around time for the HTTP request. So basically, "it depends" as you rightly said.
As the comments above indicate, some caching is likely appropriate for your situation. And like your question suggests, the real answer is "it depends." Generally, the fewer the queries, the better since each query has a cost associated with it. You should examine your data model and your application's requirements to determine what is appropriate.
For example, if a user's rights are likely to be static during the user's session, it makes sense to cache the rights data so fewer queries are required. If aspects of the data displayed in your View are also static for a user's session, these could also be cached.

Does normalization really hurt performance in high traffic sites?

I am designing a database and I would like to normalize the database. In one query I will joining about 30-40 tables. Will this hurt the website performance if it ever becomes extremely popular? This will be the main query and it will be getting called 50% of the time. The other queries I will be joining about two tables.
I have a choice right now to normalize or not to normalize but if the normalization becomes a problem in the future I may have to rewrite 40% of the software and it may take me a long time. Does normalization really hurt in this case? Should I denormalize now while I have the time?
I quote: "normalize for correctness, denormalize for speed - and only when necessary"
I refer you to: In terms of databases, is "Normalize for correctness, denormalize for performance" a right mantra?
HTH.
When performance is a concern, there are usually better alternatives than denormalization:
Creating appropriate indexes and statistics on the involved tables
Caching
Materialized views (Indexed views in MS SQL Server)
Having a denormalized copy of your tables (used exclusively for the queries that need them), in addition to the normalized tables that are used in most cases (requires writing synchronization code, that could run either as a trigger or a scheduled job depending on the data accuracy you need)
Normalization can hurt performance. However this is no reason to denormalize prematurely.
Start with full normalization and then you'll see if you have any performance problems. At the rate you are describing (1000 updates/inserts per day) I don't think you'll run into problems unless the tables are huge.
And even if there are tons of database optimization options (Indexes, Prepared stored procedures, materialized views, ...) that you can use.
Maybe I missing something here. But if your architecture requires you to join 30 to 40 tables in a single query, ad that query is the main use of your site then you have larger problems.
I agree with others, don't prematurely optimize your site. However, you should optimize your architecture to account for you main use case. a 40 table join for a query run over 50% of the time is not optimized IMO.
Don't make early optimizations. Denormalization isn't the only way to speed up a website. Your caching strategy is also quite important and if that query of 30-40 tables is of fairly static data, caching the results may prove to be a better optimization.
Also, take into account the number of writes to the number of reads. If you are doing approximately 10 reads for every insert or update, you could say that data is fairly static, hence you should cache it for some period of time.
If you end up denormalizing your schema, your writes will also become more expensive and potentially slow things down as well.
Really analyze your problem before making too many optimizations and also wait to see where your bottlenecks in the system really as you might end up being surprised as to what it is you should optimize in the first place.

Categories