We're working on an online system right now, and I'm confused about when to use in-memory search and when to use database search. Can someone please help me figure out the factors to be considered when it comes to searching records?
One factor is that if you need to go through the same results over and over, be sure to cache them in memory. This becomes an issue when you're using linq-to-sql or Entity FrameworkâORMs that support deferred execution.
So if you have an IQueryable<SomeType> that you need to go through multiple times, make sure you materialize it with a ToList() before firing up multiple foreach loops.
It depends on the situation, though I generally prefer in memory search when possible.
However depends on the context, for example if records can get updated between one search and another , and you need the most updated record at the time of the search, obviously you need database search.
If the size of the recordset (data table) that you need to store in memory is huge, maybe is better another search directly on the database.
However keep present that if you can and if performance are important loading data into a datatable and searching, filtering with LINQ for example can increase performance of the search itself.
Another thing to keep in mind is performance of database server and performance of application server : if the database server if fast enough on the search query, maybe you don't need to caching in memory on the application and so you can avoid one step. Keep in mind that caching for in memory search move computational request from database to the application server...
An absolute response is not possible for your question, it is relative on your context ...
It depends on the number of records. If the number of records is small then it's better to keep that in memory, i.e cache the records. Also, if the records get queried frequently then go for the memory option.
But if the record number or record size is too large than it's better to go for the database search option.
Basically it depends on how much memory you have on your server...
Related
I need to load multiple sql statements from SQL Server into DataTables. Most of the statements return some 10.000 to 100.000 records and each take up to a few seconds to load.
My guess is that this is simply due to the amount of data that needs to be shoved around. The statements themselves don't take much time to process.
So I tried to use Parallel.For() to load the data in parallel, hoping that the overall processing time would decrease. I do get a 10% performance increase, but that is not enough. A reason might be that my machine is only a dual core, thus limiting the benefit here. The server on which the program will be deployed has 16 cores though.
My question is, how I could improve the performance more? Would the use of Asynchronous Data Service Queries be a better solution (BeginExecute, etc.) than PLINQ? Or maybe some other approach?
The SQl Server is running on the same machine. This is also the case on the deployment server.
EDIT:
I've run some tests with using a DataReader instead of a DataTable. This already decreased the load times by about 50%. Great! Still I am wondering whether parallel processing with BeginExecute would improve the overall load time if a multiprocessor machine is used. Does anybody have experience with this? Thanks for any help on this!
UPDATE:
I found that about half of the loading time was consumed by processing the sql statement. In SQL Server Management Studio the statements took only a fraction of the time, but somehow they take much longer through ADO.NET. So by using DataReaders instead of loading DataTables and adapting the sql statements I've come down to about 25% of the initial loading time. Loading the DataReaders in parallel threads with Parallel.For() does not make an improvement here. So for now I am happy with the result and leave it at that. Maybe when we update to .NET 4.5 I'll give the asnchronous DataReader loading a try.
My guess is that this is simply due to the amount of data that needs to be shoved around.
No, it is due to using a SLOW framework. I am pulling nearly a million rows into a dictionary in less than 5 seconds in one of my apps. DataTables are SLOW.
You have to change the nature of the problem. Let's be honest, who needs to view 10.000 to 100.000 records per request? I think no one.
You need to consider to handle paging and in your case, paging should be done on sql server. To make this clear, lets say you have stored procedure named "GetRecords". Modify this stored procedure to accept page parameter and return only data relevant for specific page (let's say 100 records only) and total page count. Inside app just show this 100 records (they will fly) and handle selected page index.
Hope this helps, best regards!
Do you often have to load these requests? If so, why not use a distributed cache?
To Query everytime on the Database and use 'WHERE' operator?
SELECT * FROM tblProduct WHERE productID = #productID
OR
To Filter the Products List that are put into Cache?
DataTable dtProducts = new DataTable();
dtProducts = HttpContext.Current.Cache["CachedProductList"] as DataTable;
DataView dvProduct = new DataView();
dvProduct = dtProducts.DefaultView;
dvProduct.RowFilter = String.Format("[productID] = {0}", iProductID);
Please share your opinion. Thanks in advance.
Performance is very subjective to your data and how you use it. The method to know what works for sure is to benchmark.
Decide to cache only when your db performance does not meet the performance you require.
When you cache data, you add a lot of overhead in making sure it is up-to-date.
Sql server does not read from disk every time you fire a query, it caches results of frequent queries. Before you decide to cache, know the caching mechanisms used by your database. Using a stored procedure would allow you to cache the query plan too.
Caching data, especially through an in-memory mechanism like HttpContext.Current.Cache is (almost) always going to be faster than going back to the database. Going to the database requires establishing network connections, then the database has to do I/O, etc., whereas using the cache you just use objects in memory. That said, there are a number of things you have to take into account:
The ASP.NET runtime cache is not distributed. If you will be running this code on multiple nodes, you have to decide if you're okay with different nodes potentially having different version of the cached data.
Caches can be told to hold onto data for as long as you want them to, as short as just a few minutes and as long as forever. You have to take into consideration how long the data is going to remain unchanged when deciding how long to cache it. Product data probably doesn't change more often than once a day, so it's a very viable candidate for caching.
Be aware though that the cache time limits you set are not absolutes; objects can be evicted from the cache because of memory limits or when a process/app pool recycles.
As pointed out above, DataTable is not a good object to cache; it's very bulky and expensive to serialize. A list of custom classes is a much better choice from a performance standpoint.
I would say as a general rule of thumb, if you need a set of data more frequently than a few times an hour and it changes less frequently than every few hours, it would be better to pull the list from the database, cache it for a reasonable amount of time, and retrieve it by a filter in code. But that's a general rule; this is the kind of thing that's worth experimenting with in your particular environment.
200,000 objects is a lot of data to put into a cache, but it's also a lot of work for the database if you have to retrieve it frequently. Perhaps there's some subset of it that would be better to cache, and a different, less frequently used subset that could be retrieved every time it's needed. As I said, experiment!
I would prefer the first method. Having 20000 rows in cache does not sound good to me.
I have a database that contains more than 100 million records. I am running a query that contains more than 10 million records. This process takes too much time so i need to shorten this time. I want to save my obtained record list as a csv file. How can I do it as quickly and optimum as possible? Looking forward your suggestions. Thanks.
I'm assuming that your query is already constrained to the rows/columns you need, and makes good use of indexing.
At that scale, the only critical thing is that you don't try to load it all into memory at once; so forget about things like DataTable, and most full-fat ORMs (which typically try to associate rows with an identity-manager and/or change-manager). You would have to use either the raw IDataReader (from DbCommand.ExecuteReader), or any API that builds a non-buffered iterator on top of that (there are several; I'm biased towards dapper). For the purposes of writing CSV, the raw data-reader is probably fine.
Beyond that: you can't make it go much faster, since you are bandwidth constrained. The only way you can get it faster is to create the CSV file at the database server, so that there is no network overhead.
Chances are pretty slim you need to do this in C#. This is the domain of bulk data loading/exporting (commonly used in Data Warehousing scenarios).
Many (free) tools (I imagine even Toad by Quest Software) will do this more robustly and more efficiently than you can write it in any platform.
I have a hunch that you don't actually need this for an end-user (the simple observation is that the department secretary doesn't actually need to mail out copies of that; it is too large to be useful in that way).
I suggest using the right tool for the job. And whatever you do,
donot roll your own datatype conversions
use CSV with quoted literals and think of escaping the double quotes inside these
think of regional options (IOW: always use InvariantCulture for export/import!)
"This process takes too much time so i need to shorten this time. "
This process consists of three sub-processes:
Retrieving > 10m records
Writing records to file
Transferring records across the network (my presumption is you are working with a local client against a remote database)
Any or all of those issues could be a bottleneck. So, if you want to reduce the total elapsed time you need to figure out where the time is spent. You will probably need to instrument your C# code to get the metrics.
If it turns out the query is the problem then you will need to tune it. Indexes won't help here as you're retrieving a large chunk of the table (> 10%), so increasing the performance of a full table scan will help. For instance increasing the memory to avoid disk sorts. Parallel query could be useful (if you have Enterprise Edition and you have sufficient CPUs). Also check that the problem isn't a hardware issue (spindle contention, dodgy interconnects, etc).
Can writing to a file be the problem? Perhaps your disk is slow for some reason (e.g. fragmentation) or perhaps you're contending with other processes writing to the same directory.
Transferring large amounts of data across a network is obviously a potential bottleneck. Are you certain you're only sending relevenat data to the client?
An alternative architecture: use PL/SQL to write the records to a file on the dataserver, using bulk collect to retrieve manageable batches of records, and then transfer the file to where you need it at the end, via FTP, perhaps compressing it first.
The real question is why you need to read so many rows from the database (and such a large proportion of the underlying dataset). There are lots of approaches which should make this scenario avoidable, obvious ones being synchronous processing, message queueing and pre-consolidation.
Leaving that aside for now...if you're consolidating the data or sifting it, then implementing the bulk of the logic in PL/SQL saves having to haul the data across the network (even if it's just to localhost, there's still a big overhead). Again if you just want to dump it out into a flat file, implementing this in C# isn't doing you any favours.
I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?
In my C# 3.5 application,code performs following steps:
1.Loop through a collection[of length 10]
2.For each item in step 1, fetch records from oracle database by executing a stored proc[here,record count is typically 100]
3.Process items fetched in step 2.
4.Go to next item in step 1.
My question here, with regard to performance, is it a good idea to fetch all items in step #2[ie. 10 * 100=1000 records] in one shot rather than connecting to database in each step and retrieving the 10 records?
Thanks.
Yes it's slightly better because you will lose the overhead of connecting to the DB, but you will still have the overhead of 10 stored procedure calls. If you could find a way to pass all 10 items as parameter to the stored proc and execute just one stored proc call, I think you would get a better performance.
Depending on how intense the connection steps are, it might be better to fetch all the records at once. However, keep in mind that premature optimization is the root of all evil. :-)
Generally it is better to pull all the records from the database in one stored procedure call.
This is countered when the stored procedure call is long running or otherwise extensive enough to cause contention on the table. In your case however with only a 1000 records, I doubt that will be an issue.
Yes, it is an incredibly good idea. The key to database performance is to run as many operations in bulk as possible.
For example, consider just the interaction between PL/SQL and SQL. These two languages run on the same server and are very thoroughly integrated. Yet I routinely see an order of magnitude performance increase when I reduce or eliminate any interaction between the two. I'm sure the same thing applies to interaction between the application and the database.
Even though the number of records may be small, bulking your operations is an excellent habit to get into. It's not premature optimization, it's a best practice that will save you a lot of time and effort later.