Performance: should I make one or multiple queries to database - c#

I have a C# program that has to process about 100 000 items, and ADO.NET access to sql server database, and in one part of the program I have to make a performance decision:
Durring processing, for each item I have to read data from a database:
should I query database once for every item, or should I query once at the beginning for all items, and keep that 100 000 rows of data (about 10 columns - int and string) in c# object in memory and retrieve required data from it?

If you have a reasonably static data set, and enough memory to read everything upfront, and keep the results cached without starving of memory the rest of your system, the answer is very easy: you should do it.
There are two major components to the cost to any DB operation - the cost of the data transfer, and the cost of a round-trip. In your case, the cost of the data transfer is fixed, because the total number of bytes does not change based on whether you retrieve them all at once, or get them one chunk at a time.
The cost of a round-trip includes the time it takes RDBMS to figure out what data you need from the SQL statement, locating that data, and do all locking required to ensure that the data it serves you is consistent. A single round-trip is not expensive, but when you do it 100,000 times, the costs may very well become prohibitive. That is why it is best to read the data at once if your memory configuration allows it.
Another concern is how dynamic is your data. If the chances of your data changing in the time it takes you to process the entire set are high, you may take additional precautions to see if you need to re-process anything once your computations are done.

I am not sure what you mean by processing but oftentimes if this processing can be done on the db server, then triggering a stored procedure and passing arguments to it would be the preferred option. You then do not need round trips etc. You have to make the decision of whether you want to bring the data to the processing (from db to application) or bring the processing to the data (processing code to stored procedure).

Related

What is more advisable for Ecommerce website when it comes to displaying specific product?

To Query everytime on the Database and use 'WHERE' operator?
SELECT * FROM tblProduct WHERE productID = #productID
OR
To Filter the Products List that are put into Cache?
DataTable dtProducts = new DataTable();
dtProducts = HttpContext.Current.Cache["CachedProductList"] as DataTable;
DataView dvProduct = new DataView();
dvProduct = dtProducts.DefaultView;
dvProduct.RowFilter = String.Format("[productID] = {0}", iProductID);
Please share your opinion. Thanks in advance.
Performance is very subjective to your data and how you use it. The method to know what works for sure is to benchmark.
Decide to cache only when your db performance does not meet the performance you require.
When you cache data, you add a lot of overhead in making sure it is up-to-date.
Sql server does not read from disk every time you fire a query, it caches results of frequent queries. Before you decide to cache, know the caching mechanisms used by your database. Using a stored procedure would allow you to cache the query plan too.
Caching data, especially through an in-memory mechanism like HttpContext.Current.Cache is (almost) always going to be faster than going back to the database. Going to the database requires establishing network connections, then the database has to do I/O, etc., whereas using the cache you just use objects in memory. That said, there are a number of things you have to take into account:
The ASP.NET runtime cache is not distributed. If you will be running this code on multiple nodes, you have to decide if you're okay with different nodes potentially having different version of the cached data.
Caches can be told to hold onto data for as long as you want them to, as short as just a few minutes and as long as forever. You have to take into consideration how long the data is going to remain unchanged when deciding how long to cache it. Product data probably doesn't change more often than once a day, so it's a very viable candidate for caching.
Be aware though that the cache time limits you set are not absolutes; objects can be evicted from the cache because of memory limits or when a process/app pool recycles.
As pointed out above, DataTable is not a good object to cache; it's very bulky and expensive to serialize. A list of custom classes is a much better choice from a performance standpoint.
I would say as a general rule of thumb, if you need a set of data more frequently than a few times an hour and it changes less frequently than every few hours, it would be better to pull the list from the database, cache it for a reasonable amount of time, and retrieve it by a filter in code. But that's a general rule; this is the kind of thing that's worth experimenting with in your particular environment.
200,000 objects is a lot of data to put into a cache, but it's also a lot of work for the database if you have to retrieve it frequently. Perhaps there's some subset of it that would be better to cache, and a different, less frequently used subset that could be retrieved every time it's needed. As I said, experiment!
I would prefer the first method. Having 20000 rows in cache does not sound good to me.

which is better... in-memory search or database access?

We're working on an online system right now, and I'm confused about when to use in-memory search and when to use database search. Can someone please help me figure out the factors to be considered when it comes to searching records?
One factor is that if you need to go through the same results over and over, be sure to cache them in memory. This becomes an issue when you're using linq-to-sql or Entity Framework—ORMs that support deferred execution.
So if you have an IQueryable<SomeType> that you need to go through multiple times, make sure you materialize it with a ToList() before firing up multiple foreach loops.
It depends on the situation, though I generally prefer in memory search when possible.
However depends on the context, for example if records can get updated between one search and another , and you need the most updated record at the time of the search, obviously you need database search.
If the size of the recordset (data table) that you need to store in memory is huge, maybe is better another search directly on the database.
However keep present that if you can and if performance are important loading data into a datatable and searching, filtering with LINQ for example can increase performance of the search itself.
Another thing to keep in mind is performance of database server and performance of application server : if the database server if fast enough on the search query, maybe you don't need to caching in memory on the application and so you can avoid one step. Keep in mind that caching for in memory search move computational request from database to the application server...
An absolute response is not possible for your question, it is relative on your context ...
It depends on the number of records. If the number of records is small then it's better to keep that in memory, i.e cache the records. Also, if the records get queried frequently then go for the memory option.
But if the record number or record size is too large than it's better to go for the database search option.
Basically it depends on how much memory you have on your server...

Listing more than 10 million records from Oracle With C#

I have a database that contains more than 100 million records. I am running a query that contains more than 10 million records. This process takes too much time so i need to shorten this time. I want to save my obtained record list as a csv file. How can I do it as quickly and optimum as possible? Looking forward your suggestions. Thanks.
I'm assuming that your query is already constrained to the rows/columns you need, and makes good use of indexing.
At that scale, the only critical thing is that you don't try to load it all into memory at once; so forget about things like DataTable, and most full-fat ORMs (which typically try to associate rows with an identity-manager and/or change-manager). You would have to use either the raw IDataReader (from DbCommand.ExecuteReader), or any API that builds a non-buffered iterator on top of that (there are several; I'm biased towards dapper). For the purposes of writing CSV, the raw data-reader is probably fine.
Beyond that: you can't make it go much faster, since you are bandwidth constrained. The only way you can get it faster is to create the CSV file at the database server, so that there is no network overhead.
Chances are pretty slim you need to do this in C#. This is the domain of bulk data loading/exporting (commonly used in Data Warehousing scenarios).
Many (free) tools (I imagine even Toad by Quest Software) will do this more robustly and more efficiently than you can write it in any platform.
I have a hunch that you don't actually need this for an end-user (the simple observation is that the department secretary doesn't actually need to mail out copies of that; it is too large to be useful in that way).
I suggest using the right tool for the job. And whatever you do,
donot roll your own datatype conversions
use CSV with quoted literals and think of escaping the double quotes inside these
think of regional options (IOW: always use InvariantCulture for export/import!)
"This process takes too much time so i need to shorten this time. "
This process consists of three sub-processes:
Retrieving > 10m records
Writing records to file
Transferring records across the network (my presumption is you are working with a local client against a remote database)
Any or all of those issues could be a bottleneck. So, if you want to reduce the total elapsed time you need to figure out where the time is spent. You will probably need to instrument your C# code to get the metrics.
If it turns out the query is the problem then you will need to tune it. Indexes won't help here as you're retrieving a large chunk of the table (> 10%), so increasing the performance of a full table scan will help. For instance increasing the memory to avoid disk sorts. Parallel query could be useful (if you have Enterprise Edition and you have sufficient CPUs). Also check that the problem isn't a hardware issue (spindle contention, dodgy interconnects, etc).
Can writing to a file be the problem? Perhaps your disk is slow for some reason (e.g. fragmentation) or perhaps you're contending with other processes writing to the same directory.
Transferring large amounts of data across a network is obviously a potential bottleneck. Are you certain you're only sending relevenat data to the client?
An alternative architecture: use PL/SQL to write the records to a file on the dataserver, using bulk collect to retrieve manageable batches of records, and then transfer the file to where you need it at the end, via FTP, perhaps compressing it first.
The real question is why you need to read so many rows from the database (and such a large proportion of the underlying dataset). There are lots of approaches which should make this scenario avoidable, obvious ones being synchronous processing, message queueing and pre-consolidation.
Leaving that aside for now...if you're consolidating the data or sifting it, then implementing the bulk of the logic in PL/SQL saves having to haul the data across the network (even if it's just to localhost, there's still a big overhead). Again if you just want to dump it out into a flat file, implementing this in C# isn't doing you any favours.

Strategy to avoid OutOfMemoryException during ETL in .NET

I have wrote a ETL process that perform ETL process. The ETL process needs to process more than 100+ million or rows overall for 2 years worth of records. To avoid out of memory issue, we chunk the data loading down to every 7 days. For each chunk process, it loads up all the required reference data, then the process open a sql connection and load the source data one by one, transform it, and write it to the data warehouse.
The drawback of processing the data by chunk is it is slow.
This process has been working fine for most of the tables, but there is one table I still run into out of memory. The process has loaded too many reference data. I would like to avoid chunk the data down to 3 days so that it has a decent performance.
Is there any other strategies that I can use to avoid OutOfMemoryException?
For example, local database, write the reference data to files, spawn another .NET process to hold more memory in Windows, use CLR stored procedure to do ETL...
Environment: Windows 7 32 bit OS. 4 GB of RAM. SQL Server Standard Edition.
The only one solution is to use a store procedure and let SQL Server handle the ETL. However, I am trying to avoid it because the program needs to support Oracle as well.
Other performance improvement I tried are added indexes to improve the loading queries. Create custom data access class to only load the necessary columns, instead of loading the entire row into memory.
Thanks
Without knowing how you exactly process the data it is hard to say, but a naive solution that can be implemented in any case is to use a 64-bit OS and compile your application as 64-bit. In 32-bit mode .NET heap will only grow to about 1.5GB which might be limiting you.
I know its old post but for people searching for better points to write data operations with programming languages.
I am not sure if you have considered to study how ETL tools perform their data loading operations and replicate similar strategy in your code.
One such suggestion, parallel data pipes. Here each pipe will perform the ETL on a single chunks based on partitioning of the data from the source. For example, you could consider spawning processes for different weeks data in parallel. This still will not solve your memory issues within a single process. Though can be used in case you reach a limit with memory allocation within heap within single process. This is also useful to read the data in parallel with random access. Though will require a master process to coordinate and complete the process as a single ETL operation.
I assume you perform in your transformation a lot of lookup operation before finally writing your data to database. Assuming the master transaction table is huge and reference data is small. You need to focus on data structure operation and alogirthm. There are few tips below for the same. Refer to the characteristics of your data before choosing what suites best when writing the algorithm.
Generally, Lookup data (reference data) is stored in cache. Choose a simple data structure that is efficient for read and search operation (say Array list). If possible sort this array by the key you will join to be efficient in your search algorithm.
There is different strategy for lookup operations in your transformation tasks. In database world you can call it as join operation.
Merge Join algorithm :
Ideal when the source is already sorted on join attribute key. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time. For sample code, https://en.wikipedia.org/wiki/Sort-merge_join
Nested Join:
works like a nested loop, where each value of the index of the outer loop is taken as a limit (or starting point or whatever applicable) for the index of the inner loop, and corresponding actions are performed on the statement(s) following the inner loop. So basically, if the outer loop executes R times and for each such execution the inner loop executes S times, then the total cost or time complexity of the nested loop is O(RS).
Nested-loop joins provide efficient access when tables are indexed on join columns. Furthermore,in many small transactions, such as those affecting only a small set of rows, index nested loopsjoins are far superior to both sort -merge joins and hash joins
I am only describing two methods that can be thought in your lookup operation. The main idea to remember in ETL is all about lookup and retrieve the tuples (as set) for further operation. Search will be based on key and resultant transaction keys will extract all the records (projection). Take this and load the rows from the file in one reading operation. This is more of suggestion in case you don't need all the records for transformation operations.
Another very costly operation is writing back to the database. There might be tendency to process the extraction, transformation and loading one row at a time. Think of operations that can be vectorized where in you can perform it together with a data structure operation in bulk. For example, lambada operation on a multi dimensional vector rather than looping every row one at a time and performing transformation and operations across all columns for a given row. We then can write this vector into file or database. This will avoid memory pressure.
This was a very old question, and it is more a design question and I am sure there are many solutions to it, unless I get into more specific details.
Ultimately, I wrote SQL Stored Procedure using Merge to handle the ETL process for the data type that took too long to process thought C# application. In addition, the business requirement was changed such that we dropped Oracle support, and only support 64-bit server, which reduced maintenance cost and avoid ETL out of memory issue.
In addition, we added many indexes whenever we see an opportunity to improve the querying performance.
Instead of chunking by a day range, the ETL process also chunks the data by count (5000) and commit on every transaction, this reduced the transaction log file size and if the ETL fails , the process only needs to rollback a subset of the data.
Lastly, we implemented caches (key,value) so that frequently referenced data within the ETL date range are loaded in memory to reduce database querying.

Fetching records from database

In my C# 3.5 application,code performs following steps:
1.Loop through a collection[of length 10]
2.For each item in step 1, fetch records from oracle database by executing a stored proc[here,record count is typically 100]
3.Process items fetched in step 2.
4.Go to next item in step 1.
My question here, with regard to performance, is it a good idea to fetch all items in step #2[ie. 10 * 100=1000 records] in one shot rather than connecting to database in each step and retrieving the 10 records?
Thanks.
Yes it's slightly better because you will lose the overhead of connecting to the DB, but you will still have the overhead of 10 stored procedure calls. If you could find a way to pass all 10 items as parameter to the stored proc and execute just one stored proc call, I think you would get a better performance.
Depending on how intense the connection steps are, it might be better to fetch all the records at once. However, keep in mind that premature optimization is the root of all evil. :-)
Generally it is better to pull all the records from the database in one stored procedure call.
This is countered when the stored procedure call is long running or otherwise extensive enough to cause contention on the table. In your case however with only a 1000 records, I doubt that will be an issue.
Yes, it is an incredibly good idea. The key to database performance is to run as many operations in bulk as possible.
For example, consider just the interaction between PL/SQL and SQL. These two languages run on the same server and are very thoroughly integrated. Yet I routinely see an order of magnitude performance increase when I reduce or eliminate any interaction between the two. I'm sure the same thing applies to interaction between the application and the database.
Even though the number of records may be small, bulking your operations is an excellent habit to get into. It's not premature optimization, it's a best practice that will save you a lot of time and effort later.

Categories