Loading multiple large ADO.NET DataTables/DataReaders - Performance improvements - c#

I need to load multiple sql statements from SQL Server into DataTables. Most of the statements return some 10.000 to 100.000 records and each take up to a few seconds to load.
My guess is that this is simply due to the amount of data that needs to be shoved around. The statements themselves don't take much time to process.
So I tried to use Parallel.For() to load the data in parallel, hoping that the overall processing time would decrease. I do get a 10% performance increase, but that is not enough. A reason might be that my machine is only a dual core, thus limiting the benefit here. The server on which the program will be deployed has 16 cores though.
My question is, how I could improve the performance more? Would the use of Asynchronous Data Service Queries be a better solution (BeginExecute, etc.) than PLINQ? Or maybe some other approach?
The SQl Server is running on the same machine. This is also the case on the deployment server.
EDIT:
I've run some tests with using a DataReader instead of a DataTable. This already decreased the load times by about 50%. Great! Still I am wondering whether parallel processing with BeginExecute would improve the overall load time if a multiprocessor machine is used. Does anybody have experience with this? Thanks for any help on this!
UPDATE:
I found that about half of the loading time was consumed by processing the sql statement. In SQL Server Management Studio the statements took only a fraction of the time, but somehow they take much longer through ADO.NET. So by using DataReaders instead of loading DataTables and adapting the sql statements I've come down to about 25% of the initial loading time. Loading the DataReaders in parallel threads with Parallel.For() does not make an improvement here. So for now I am happy with the result and leave it at that. Maybe when we update to .NET 4.5 I'll give the asnchronous DataReader loading a try.

My guess is that this is simply due to the amount of data that needs to be shoved around.
No, it is due to using a SLOW framework. I am pulling nearly a million rows into a dictionary in less than 5 seconds in one of my apps. DataTables are SLOW.

You have to change the nature of the problem. Let's be honest, who needs to view 10.000 to 100.000 records per request? I think no one.
You need to consider to handle paging and in your case, paging should be done on sql server. To make this clear, lets say you have stored procedure named "GetRecords". Modify this stored procedure to accept page parameter and return only data relevant for specific page (let's say 100 records only) and total page count. Inside app just show this 100 records (they will fly) and handle selected page index.
Hope this helps, best regards!

Do you often have to load these requests? If so, why not use a distributed cache?

Related

Running multiple instances of a program to speed up process?

I have a program that performs a long running process. Loops through thousands of records, one at a time, and calls a stored proc each iteration. Would running two instances of a program like this with one processing half the records and the other processing the other half speed up the processing?
Here are the scenarios:
1 program, running long running process
2 instances of program on same server, connecting to same database, each responsible for processing half (50%) of the records.
2 instance on different server, connecting to the same database, each responsible for half (50%) of the records.
Would scenario 2 or 3 run twice as fast as 1? Would there be a difference between 2 and 3? The main bottleneck is the stored proc call that takes around half a second.
Thanks!
This depends on a lot of factors. Also note that threads may be more appropriate than processes. Or maybe not. Again: it depends. But: is this work CPU-bound? Network-bound? Or bound by what the database server can do? Adding concurrency helps with CPU-bound, and when talking to multiple independent resources. Fighting over the same network connection or the same database server is unlikely to improve things - and can make things much worse.
Frankly, from the sound of it your best bet may be to re-work the sproc to work in batches (rather than individual records).
To answer this question properly you need to know what the resource utilization of the database server currently us: can it take extra load? Or simpler - just try it and see.
It really depends what the stored procedure is doing. If the stored procedure is going to be updating the records, and you have a single database instance then there is going to be contention when writing the data back.
The values at play here, are:
The time it takes to read the data in to your application memory (and this is also dependent on whether you are using client-side or sql-server-side cursors).
The time it takes to process, or do your application logic.
The time it takes to write an updated item back (assuming the proc updates).
One solution (and this is by no means a perfect solution without knowing the exact requirements), is:
Have X servers read Y records, and process them.
Have those servers write the results back to a dedicated writing server in a serialized fashion to avoid the contention.

Constantly sending SQL queries

I don't know if this a common question asked, but if it is, please don't yell at me! :(
I have a Windows Form C# program that executes an UPDATE query every 2 seconds with the threading timer.
My question is: is this dangerous? Will this make my computer run much slower? Am I firing up the CPU usage? I'm a pretty concerned guy when it comes to constantly using something every second.
EDIT: It's UPDATE, not INSERT sorry!
This always depends a lot on the size of the operation that is done every 2 seconds; if the operation takes 1.5 seconds to pre-process, execute and post-process, then it will be a problem. If it takes 4ms, probably not. You also need to think about the server; even if we say it takes 4ms, that could be parallelised over 8 cores, so that is 32ms - and if you have 2000 users all doing that every 2 seconds, it starts to add up.
But by itself: fine.
And client-side, on a modern multi-core PC, this is probably not even enough to register as the tiniest blip on the graph.
The answer completely depends on the amount of work the update statement is performing. If it is updating millions of rows every two seconds, then it will definitely impact the performance.
However, if you are only updating a handful of rows (up to say, 100,000) in an SQL Server database, then this frequency should be perfectly acceptable.
The manner in which the update is performed is also important: using cursors, linked servers, CLR functions, databases other than SQL (i.e. Access), and many, many other factors can all significantly impact the performance.

Query from database or from memory? Which is faster?

I am trying to improve the performance of a Windows Service, developed in C# and .NET 2.0, that processes a great amount of files. I want to process more files per second.
In its process, for each file, the service does a database query to retrieve some parameters of the system.
Those parameters change annually, and I am thinking that I would gain some performance, if a loaded those parameters as a singleton and refreshed this singleton periodically. Instead of make a database query for each file being processed, I would get the parameters from memory.
To complete the scenario : I am using Windows Server 2008 R2 64 Bits, SQL Server 2008 is the database, C# and .NET 2.0 as already mentioned.
I am right in my approach? What would you do?
Thanks!
Those parameters change anually
Yes, do cache them in memory. Especially if they are large or complex.
You should take care to invalidate them at the right time once a year, depending how accurate that has to be.
Simply caching them for an hour or even for a few minutes might be a good compromise.
RAM memory data access is definitely faster that any other data access, except than cpu memories like registries and CPU cache
Chaching would be faster even if you change it every minute, so yes, caching that query is very faster
Crossing a network or going to disk is always orders of magnitude slower than in memory access.
Databases can cache data in memory so if you can achieve that and you're not crossing a network, the database might be faster since their data access patterns/indexes etc... may be faster than you're code. But, that's best case - if you need it faster, in memory caches help.
But, be aware that in memory caches can add complexity and bugs. You have to determine the lifetime of the cached data, how to refresh and the more complex it is, the more wierd edge case state bugs you will have. Even though they change annually, you have to handle that cusp.

SQLite .Net Performance

I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?

Fetching records from database

In my C# 3.5 application,code performs following steps:
1.Loop through a collection[of length 10]
2.For each item in step 1, fetch records from oracle database by executing a stored proc[here,record count is typically 100]
3.Process items fetched in step 2.
4.Go to next item in step 1.
My question here, with regard to performance, is it a good idea to fetch all items in step #2[ie. 10 * 100=1000 records] in one shot rather than connecting to database in each step and retrieving the 10 records?
Thanks.
Yes it's slightly better because you will lose the overhead of connecting to the DB, but you will still have the overhead of 10 stored procedure calls. If you could find a way to pass all 10 items as parameter to the stored proc and execute just one stored proc call, I think you would get a better performance.
Depending on how intense the connection steps are, it might be better to fetch all the records at once. However, keep in mind that premature optimization is the root of all evil. :-)
Generally it is better to pull all the records from the database in one stored procedure call.
This is countered when the stored procedure call is long running or otherwise extensive enough to cause contention on the table. In your case however with only a 1000 records, I doubt that will be an issue.
Yes, it is an incredibly good idea. The key to database performance is to run as many operations in bulk as possible.
For example, consider just the interaction between PL/SQL and SQL. These two languages run on the same server and are very thoroughly integrated. Yet I routinely see an order of magnitude performance increase when I reduce or eliminate any interaction between the two. I'm sure the same thing applies to interaction between the application and the database.
Even though the number of records may be small, bulking your operations is an excellent habit to get into. It's not premature optimization, it's a best practice that will save you a lot of time and effort later.

Categories