Is there anything faster than SqlDataReader in .NET? - c#

I need to load one column of strings from table on SqlServer into Array in memory using C#.
Is there a faster way than open SqlDataReader and loop through it.
Table is large and time is critical.
EDIT
I am trying to build .dll and use it on server for some operations on database. But it is to slow for now. If this is fastest than I have to redesign the database. I tough there may be some solution how to speed thing up.

Data Reader
About the fastest access you will get to SQL is with the SqlDataReader.
Profile it
It's worth actually profiling where your performance issue is. Usually, where you think the performance issue is, is proven to be totally wrong after you've profiled it.
For example it could be:
The time... the query takes to run
The time... the data takes to copy across the network/process boundry
The time... .Net takes to load the data into memory
The time... your code takes to do something with it
Profiling each of these in isolation will give you a better idea of where your bottleneck is. For profiling your code, there is a great article from Microsoft
Cache it
The thing to look at to improve performance is to work out if you need to load all that data every time. Can the list (or part of it) be cached? Take a look at the new System.Runtime.Caching namespace.
Rewrite as T-SQL
If you're doing purely data operations (as your question suggests), you could rewrite your code which is using the data to be T-SQL and run natively on SQL. This has the potential to be much faster, as you will be working with the data directly and not shifting it about.
If your code has a lot of necessary procedural logic, you can try mixing T-SQL with CLR Integration giving you the benefits of both worlds.
This very much comes down to the complexity (or more procedural nature) of your logic.
If all else fails
If all areas are optimal (or as near as), and your design is without fault. I wouldn't even get into micro-optimisation, I'd just throw hardware at it.
What hardware? Try the reliability and performance monitor to find out where the bottle neck is. Most likely place for the problem you describe HDD or RAM.

If SqlDataReader isn't fast enough, perhaps you should store your stuff somewhere else, such as an (in-memory) cache.

No. It is actually not only the fastest way - it is the ONLY (!) way. All other mechanisms INTERNALLY use a DataReader anyway.

I suspect that SqlDataReader is about as good as you're going to get.

SqlDataReader is the fastest way. Make sure you use the get by ordinal methods rather than get by column name. e.g. GetString(1);
Also worthwhile is experimenting with MinPoolSize in the connection string so that there are always some connections in the pool.

The SqlDataReader will be the fastest way.
Optimize the use of it, by using the appropriate Getxxx method , which takes an ordinal as parameter.
If it is not fast enough, see if you can tweak your query. Put a covering index on the column (s) that you want to retrieve. By doing so, Sql Server only has to read the index, and does not have to go to the table directly to retrieve all the info that is required.

What about transforming one column of rows to one row of columns, and having only one row to read? SqlDataReader has an optimization for reading a single row (System.Data.CommandBehavior.SingleRow argument of ExecuteReader), so maybe it can improve the speed a bit.
I see several advantages:
Single row improvement,
No need to access an array on each iteration (reader[0]),
Cloning an array (reader) to another one may be faster than looping through elements and adding each one to a new array.
On the other hand, it has a disadvantage to force SQL database to do more work.

"Provides a way of reading a forward-only stream of rows from a SQL Server database" This is the use of SqlDataReader from MSDN . The Data structure behind SqlDataReder only allow read forward, it's optimized for reading data in one direction. In my opinion, I want to use SqlDataReader than DataSet for simple data reading.

You have 4 sets of overheads
- Disk Access
- .net code (cpu)
- SQL server code (cpu)
- Time to switch between managed and unmanaged code (cpu)
Firstly is
select * where column = “junk”
fast enough for you, if not the only solution is to make the disk faster. (You can get data from SQL Server faster than it can read it)
You may be able to define a Sql Server function in C# then run the function over the column; sorry I don’t know how to do it. This may be faster than a data reader.
If you have more than one CPU, and you know a value the middle of the table, you could try using more than one thread.
You may be able to write some TSQL that combines all the strings into a single string using a separator you know is safe. Then split the string up again in C#. This will reduce the number of round trips between managed and unmanaged code.

Some surface-level things to consider that may affect speed (besides a data-reader):
Database Query Optimization
OrderBy is expensive
Distinct is expensive
RowCount is expensive
GroupBy is expensive
etc. Sometimes you can't live without these things, but if you can handle some of these things in your C# code instead, it may be faster.
Database Table indexing (for starters, are the fields in your WHERE clause indexed?)
Database Table DataTypes (are you using the smallest possible, given the data?)
Why are you converting the datareader to an array?
e.g., would it serve just as well to create an adapter/datatable that you then would not need to convert to an array?
Have you looked into Entity Framework? (might be slower...but if you're out of options, might be worthwhile to look into just to make sure)
Just random thoughts. Not sure what might help in your situation.

If responsiveness is an issue loading a great deal of data, look at using the asynchronous methods - BeginReader.
I use this all the time for populating large GUI elements in the background while the app continues to be responsive.
You haven't said exactly how large this data is, or why you are loading it all into an array.
Often times, for large amounts of data, you may want to leave it in the database or let the database do the heavy lifting. But we'd need to know what kind of processing you are doing that needs it all in an array at one time.

Related

Performance: Lots of queries or lots of processing?

Currently I am creating a C# application which has to read a lot of data (over 2,000,000 records) from an existing database and compare it with a lot of other data (also about 2,000,000 records) which do not exist in the database. These comparisons will mostly be String comparisons. The amount of data will grow much bigger and therefore I need to know which solution will result in the best performance.
I have already searched the internet and I came up with two solutions;
Solution 1
The application will execute a single query (SELECT column_name FROM table_name, for example) and store all the data in a DataTable. The application will then compare all the stored data with the input, and if there is a comparison it will be written to the database.
Pros:
The query will only be executed once. After that, I can use the stored data multiple times for all incoming records.
Cons:
As the database grows bigger, so will my RAM usage. Currently I have to work with 1GB (I know, tough life) and I'm afraid it won't fit if I'd practically download the whole content of the database in it.
Processing all the data will take lots and lots of time.
Solution 2
The application will execute a specific query for every record, for example
SELECT column_name FROM table_name WHERE value_name = value
and will then check is the DataTable will have records, something like
if(datatable.Rows.Count>0) { \\etc }
If it has records, I can conclude there are matching records and I can write to the database.
Pros:
Probably a lot less usage of RAM since I will only get specific data.
Processing goes a lot faster.
Cons:
I will have to execute a lot of queries. If you are interested in numbers, it will probably around 5 queries per record. Having 2,000,000 records, that would be 10,000,000 queries.
My question is, what would be the smartest option, given that I have limited RAM?
Any other suggestions are welcome aswell, ofcourse.
If you have SQL Server available to you, this seems a job directly suited to SQL Server Integration Services. You might consider using that tool instead of building your own. It depends on your exact business needs, but in general data merging like this would be a batch/unattended or tool based operation ?
You might be able to code it to run faster than SSIS, but I'd give it a try just to see if its acceptible to you, and save yourself the cost of the custom development.

Is there a desgin pattern that would help to compare big amount of data?

I need to compare particular content of 2 SQL tables located in different servers: Table1 and Table2.
I want to compare each row from Table1 against the whole content of Table2.
Comparison logic is kind of complicated so I want to apply a logical operator that I will wrinte in C#. So I don't want to do the comparison on the SQL query itself.
My concern is the size of the data I will work on will be around 200 MB.
I was thinking to load the data into a DataTable by using ADO.Net and do the comparison on the memory.
What would you recommend? Is there already a pattern like approach to compare massive data?
200 MB should not be a problem. A .NET application can handle much more than that at once.
But even so, I would probably use a forward-only data reader for Table 1, just because there's no good reason not to, and that should reduce the amount of memory required. You can keep table 2 in memory with whatever structure you are accustomed to.
You can use two SqlDataReaders. They only have one row in memory at a time, are forward only, and extremely efficient. After getting the row back from the reader you can then compare the values. Here is an example.
See MSDN.
The most scalable solution is to create SQLCLR functions to execute the comparisons you want.
You should probably avoid a row-by-row comparison at all costs. The network latency and delays due to round-tripping will result in extremely slow execution.
A quick&dirty solution is to extract the data to local files then do the comparison as you will pay the network tax only once. Unfortunately, you lose the speedup provided by database indexes and query optimizations.
A similar solution is to load all the data once in memory and then use indexing structures like dictionaries to provide additional speedup. This is probably doable as your data can fit in memory. You still pay the network tax only once but gain from faster execution.
The most scalable solution is to create SQLCLR code to create one or more functions that will perform the comparisons you want. This way you avoid the network tax altogether, avoid creating and optimizing your own structures in memory and can take advantage of indexes and optimizations.
These solutions may not be applicable, depending on the actual logic of the comparisons you are doing. Both solutions rely on sorting the data correctly
1) Binary search. - You can find the matching row in table 2 without scanning through all of table 2 by using a binary search, this will significantly reduce the number of comparisons
2) If you are looking for overlaps/matches/missing rows between the two tables, you can sort both tables in the same order. Then you can loop through the two tables simultaniously, keeping a pointer to the current row of each table. If table 1 is "ahead" of table 2, then you only increment the table 2 pointer until they are either equal, or table 2 is ahead. Then once table 2 is ahead, you start incrementing table 1 until it is ahead. etc. In this way you only have to loop through each record from each table one time, and you are guaranteed that there were no matches that you missed.
If table 1 and 2 match, then that is a match. while table 1 is ahead, then every row in table 2 is "missing" from table 1, and visa versa.
This solution would also work if you only need to take some action if the rows are in a certain range of each other or something.
3) If you have to actually do some action for every row in table 2 for every row in table 1, then its just two nested loops, and there is not much that you can do to optimize that other than make the comparison/work as efficient as possible. You could possibly multi-thread it though depending on what the work was and where your bottle-neck is.
Can you stage the data to the same database using a quick ETL/SSIS job? This would allow you to do set operations which might be easier deal with. If not, I would agree with the recommendations for forward-only data reader with one table in memory
A couple years ago I wrote a db table comparison tool, which is now an open-source project called Data Comparisons.
You can check out the source code if you want. There is a massive optimization you can make when the two tables you're comparing are on the same physical server, because you can write a SQL query to take care of this. I called this the "Quick compare" method in Data Comparisons, and it's available whenever you're sharing the same connection string for both sides of the comparison.
When they're on two different servers, however, you have no choice but to pull the data into memory and compare the rows there. Using SqlDataReaders would work. However, it's complicated when you must know exactly what's different (what rows are missing from table A or table B, what rows are different, etc). For that reason my method was to use DataTables, which are slower but at least they provide you with the necessary functionality.
Building this tool was a learning process for me. There are probably opportunities for optimization with the in-memory comparison. For example, loading the data into a Dictionary and doing your comparisons off of primary keys with Linq would probably be faster. You could even try Parallel Linq and see if that helps. And as Jeffrey L Whitledge mentioned, you might as well use a SqlDataReader for one of the tables while the other is stored in memory.

LINQ to SQL - How to make this works with database faster

I have a problem. My LINQ to SQL queries are pushing data to the database at ~1000 rows per second. But this is much too slow for me. The objects are not complicated. CPU usage is <10% and bandwidth is not the bottleneck too.
10% is on client, on server is 0% or max 1% generally not working at all, not traversing indexes etc.
Why 1000/s are slow, i need something around 20000/s - 200000/s to solve my problem in other way i will get more data than i can calculate.
I dont using transaction but LINQ using, when i post for example milion objects new objects to DataContext and run SubmitChanges() then this is inserting in LINQ internal transaction.
I dont use parallel LINQ, i dont have many selects, mostly in this scenario i'm inserting objects and want use all resources i have not only 5% od cpu and 10kb/s of network!
when i post for example milion objects
Forget it. Linq2sql is not intended for such large batch updates/inserts.
The problem is that Linq2sql will execute a separate insert (or update) statement for each insert (update). This kind of behaviour is not suitable with such large numbers.
For inserts you should look into SqlBulkCopy because it is a lot faster (and really order of magnitudes faster).
Some performance optimization can be achived with LINQ-to-SQL using first off precompiled queries. A large part of the cost is compiling the actual query.
http://www.albahari.com/nutshell/speedinguplinqtosql.aspx
http://msdn.microsoft.com/en-us/library/bb399335.aspx
Also you can disable object tracking which may give you milliseconds of improvement. This is done on the datacontext right after you instantiate it.
I also encountered this problem before. The solution I used is Entity Framework. There is a tutorial here. One traditional way is to use LINQ-To-Entity, which has similar syntax and seamless integration of C# objects. This way gave me 10x acceleration in my impression. But a more efficient (in magnitude) way is to write SQL statement by yourself, and then use ExecuteStoreQuery function to fetch the results. It requires you to write SQL rather than LINQ statements, but the returned results can still be read by C# easily.

How can I use a very large dictionary in C#?

I want to use a lookup map or dictionary in a C# application, but it is expected to store 1-2 GB of data.
Can someone please tell if I will still be able to use dictionary class, or if I need to use some other class?
EDIT : We have an existing application which uses oracle database to query or lookup object details. It is however too slow, since the same objects are getting repeatedly queried. I was feeling that it might be ideal to use a lookup map for this scenario, to improve the response time. However I am worried if size will make it a problem
Short Answer
Yes. If your machine has enough memory for the structure (and the overhead of the rest of the program and system including operating system).
Long Answer
Are you sure you want to? Without knowing more about your application, it's difficult to know what to suggest.
Where is the data coming from? A file? Files? A database? Services?
Is this a caching mechanism? If so, can you expire items out of the cache once they haven't been accessed for a while? This way, you don't have to hold everything in memory all the time.
As others have suggested, if you're just trying to store lots of data, can you just use a database? That way you don't have to have all of the information in memory at once. With indexing, most databases are excellent at performing fast retrieves. You could combine this approach with a cache.
Is the data that will be in memory read only, or will it have to be persisted back to some storage when something changes?
Scalability - do you expect that the amount of data that will be stored in this dictionary will increase as time goes on? If so, you're going to run into a point where it's very expensive to buy machines that can handle this amount of data. You might want to look a distributed caching system if this is the case (AppFrabric comes to mind) so you can scale out horizontally (more machines) instead of vertically (one really big expensive point of failure).
UPDATE
In light of the poster's edit, it sounds like caching would go a long way here. There are many ways to do this:
Simple dictionary caching - just cache stuff as its requested.
Memcache
Caching Application Block I'm not a huge fan of this implementation, but others have had success.
As long as you're on a 64GB machine, yes you should be able to use that large of a dictionary. However if you have THAT much data, a database may be more appropriate (cassandra is really nothing but a gigantic dictionary, and there's always MySQL).
When you say 1-2GB of data, I assume that you mean the items are complex objects that cumulatively contain 1-2GB.
Unless they're structs (and they shouldn't be), the dictionary doesn't care how big the items are.
As long as you have less than about 224 items (I pulled that number out of a hat), you can store as much as you can fit in memory.
However, as everyone else has suggested, you should probably use a database instead.
You may want to use an in-memory database such as SQL CE.
You can but For a Dictionary as large as that you are better off using a DataBase
Use a database.
Make sure you've a good DB model, put correct indexes, and off you go.
You can use subdictionaries.
Dictionary<KeyA, Dictionary<KeyB ....
Where KeyA is some common part of KeyB.
For example, if you have a String dictionary you can use the First letter as KeyA.

Comparing i4o vs. PLINQ for larger collections

I have a question for anyone who has experience on i4o or PLINQ. I have a big object collection (about 400K ) needed to query. The logic is very simple and straightforward. For example, there has a collection of Person objects, I need to find the persons matched with same firstName, lastName, datebirth, or the first initial of FirstName/lastname, etc. It is just a time consuming process using LINQ to Object.
I am wondering if i4o (http://www.codeplex.com/i4o)
or PLINQ can help on improving the query performance. Which one is better? And if there has any approach out there.
Thanks!
With 400k objects, I wonder whether a database (either in-process or out-of-process) wouldn't be a more appropriate answer. This then abstracts the index creation process. In particular, any database will support multiple different indexes over different column(s), making the queries cited all very supportable without having to code specifically for each (just let the query optimizer worry about it).
Working with it in-memory may be valid, but you might (with vanilla .NET) have to do a lot more manual index management. By the sounds of it, i4o would certainly be worth investigating, but I don't have any existing comparison data.
i4o : is meant to speed up quering using linq by using indexes like old relational database days.
PLinq: is meant to use extra cpu cores to process the query in parallel.
If performance is your target, depending on your hardware, I say go with i4o it will make a hell of improvement.
I haven't used i4o but I have used PLINQ.
Without know specifics of the query you're trying to improve it's hard to say which (if any) will help.
PLINQ allows for multiprocessing of queries, where it's applicable. There are time however when parallel processing won't help.
i4o looks like it helps with indexing, which will speed up some calls, but not others.
Bottom line is, it depends on the query being run.

Categories