I know there were many disscussions about what is fast when we talk about fetching data from DB. But still, my problem is that I need to load a huge amount of data (between 20.000 rows - 5 million rows, based on conditions which user executes), in order to export all that in Excel worksheets, from DatagridView.
I know I could set pagination of query in DatagridView & I have allready tested that (which works nice & fast btw), but bottom problem of exporting to Excel still exists, so I need to fetch everything. Currently what I have done is to load data in different thread, so that UI doesn't freeze while fetching data.
Query in DB is optimized as It get's - unfortunally It's not a simple one since It uses even DB links with joins to produce final resultset. But from the point of DB we can't do nothing else.
I allready tested all I could think of - working with DataAdapter.Fill & DataReader.Load methods, with Datasets,DataTable and even List < T >. I did all of these in order to see difference between perfomances (speed vs. memory usage).
Results were actually a bit surprising to me - with all talking about DataReader.Load speed over DataAdapter.Fill method, and filling List < T > over DataTable, I came to a conclusion that in my case all statements doesn't apply quite as It supposed to:
1.) Filling List < T > was slower than filling DataTable, but lower memory consumption;
2.) DataReader.Load vs DataAdapter.Fill is almost no difference at all - sometimes first wins, sometimes not, and differencies are in couple of seconds.
Anyway, nothing I tried haven't produced desired results, since my data for about 2 millions of rows is loading around 5 minutes (at It's best case).
Is there anything I have missed that beats all these methods when It comes to filling data in terms of speed ? Maybe some caching ?
For any advices thanks in advance !
P.S.: I would be satisfied just with a plain answer on where to start, that's why I didn't include any code.
Related
This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).
I have a windows application written in C# that needs to load load 250,000 rows from database and provide a "search as you type" feature which means as soon as user types something in a text box, the application needs to search all 250,000 records (which are btw, single column with 1000 characters each row) using like search and display the found records.
The approach I followed was:
1- The application loads all the records into a typed List<EmployeeData>
while (objSQLReader.Read())
{
lstEmployees.Add(new EmployeesData(
Convert.ToInt32(objSQLReader.GetString(0)),
objSQLReader.GetString(1),
objSQLReader.GetString(2)));
}
2- In TextChanged event, Using LINQ, I search (with combination of Regular Expression) and attach the IEnumerable<EmployeesData> to a ListView which is in Virtual Mode.
String strPattern = "(?=.*wood*)(?=.*james*)";
IEnumerable<EmployeesData> lstFoundItems = from objEmployee in lstEmployees
where Regex.IsMatch(Employee.SearchStr, strPattern, RegexOptions.IgnoreCase)
select objEmployee;
lstFoundEmployees = lstFoundItems;
3- RetrieveVirtualItem event is handled to display items in ListView to display the item.
e.Item = new ListViewItem(new String[] {
lstFoundEmployees.ElementAt(e.ItemIndex).DateProjectTaskClient,
e.ItemIndex.ToString() });
Though the lstEmployees is loaded relatively fast (1.5 seconds) for loading the list from SQL Server, to search on TextChanged, it takes more than 7 minutes to search using LINQ. Searching thru SQL Server directly by performing a LIKE search takes less than 7 seconds.
What am I doing wrong here? How can I make this search faster (not more 2 seconds)? This is a requirement from my client. So, any help is highly appreciated. Please Help...
Does the database column that stores the text data have an index on it? If so, something similar to the trie structure that Nicholas described is already in use. Indexes in SQL Server are implemented using B+ trees, which have a an average search time on the order of log base 2 of n, where n is the height of the tree. This means that if you have 250,000 records in the table the number of operations required to search are log base 2 ( 250,000 ) or approximately 18 operations.
When you load all of the information into a data reader and then use a LINQ expression it's a linear operation, (O) n, where n is the length of the list. So worst case, it's going to be 250,000 operations. If you use a DataView there will be indexes that can be used to help with searching, which will drastically improve performance.
At the end of the day if there will not be too many requests submitted against the database server leverage the query optimizer to do this. As long as the LIKE operation isn't performed with a wildcard at the front of the string (i.e. LIKE %some_string) (negates the use of an index) and there is an index on the table you will have really fast performance. If there are just too many requests that will be submitted to the database server, either put all of the information into a DataView so an index can be used, or use a dictionary as Tim suggested above, which has a search time of O(1) (on the order of one), assuming the dictionary is implemented using a hash table.
You'd be wanting to preload things and build yourself a data structure called a trie
It's memory-intensive, but it's what the doctor ordered in this case.
See my answer to this question. If you need instant response (i.e. as fast as a user types), loading the data into memory can be a very attractive option. It may use a bit of memory, but it is very fast.
Even though there are many characters (250K records * 1000), how many unique values are there? An in-memory structure based off of keys with pointers to records matching those keys really doesn't have to be that big, even accounting for permutations of those keys.
If the data it truly won't fit into memory or changes frequently, keep it in the database and use SQL Server Full Text Indexing, which will handle searches such as this much better than a LIKE. This assumes a fast connection from the application to the database.
Full Text Indexing offers a powerful set of operators/expressions which can be used to make searches more intelligent. It's available with the free SQL Expression Edition, which will handle up to 10GB of data.
If the records can be sorted, you may want to go with a binary search, which is much, much faster for large data sets. There are several implementations in .NET collections, like List<T> and Array.
I have an app that does an SQL and loads a set of data into a datatable. As part of the processing there are 6 or 7 DataTable.Select() to filter some data. Each item that needs processing takes 300ms. There are 5000 items to process so takes 25 mins. This is unacceptable.
Would creating POCO's and loading them into a List and then using LINQ to query the list be quicker than using DataTable.Select?
Thanks
UPDATE: I have delved in a bit more and there are 2 datatables each with around 15000 records. The 2 queries used to populate the datatables take a second each. It then takes 25mins to loop over 5000 items in a Dictionary's values property and do 5 DataTable.Select's
eg/
foreach (OutputRecord Mailpiece in DictionaryMailpieces.Values)
{
try
{
DataRow[] R = DataTable1.Select("MAILPIECE = " + Mailpiece.MailpieceSetSequenceNumber + " AND (STATUS = 4034 OR STATUS = 4037)", "DAL_DATE desc");
if (R != null && R.Length > 0)
{
}
}
catch
{
}
}
Funny there is no "SQL" tag associated with your question. I suggest, you learn how to use the SQL language and its benefits. From what you say, it is likely you are, with your code, creating a lot of Cartesian products, instead of leveraging the Relational Database facilities (joins, indexes, etc.)
Using cross joins of DataTables or Lists or anything similar will always lead to heavy performance degradation, whatever language or platform is used.
That said, you could use LINQ because it's capable of producing smart SQL (dynamically), but you still want to avoid all ToList(), ToArray() and similar extension methods on IEnumerable(T) that summon all the underlying data (keep it enumerable from end to end and leverage "object streaming" whenever possible). If you understand really what is a Relational Database and how to use it efficiently, you will be a better LINQ developer.
Almost anything will be faster than manipulating an ADO.NET DataTable -- they are not designed for fast retrieval in any sense. You should also put the objects into an appropriate data structure; a DataTable is a red-black binary tree of rows, so if you don't want that, you shouldn't use one.
If you're just using the DataTable as a sequential collection of rows with fields, then you'll probably see a factor of 2 or more speedup just by replacing the DataTable with a List<T> and replacing your Select calls with Where calls, although it depends on what you're doing with it.
EDIT: Actually, I changed my mind. Nothing you could be doing sorting-or-filtering-wise with 5000 items in a DataTable implies a cost of anywhere close to 300ms, so the bottleneck is probably unrelated.
Using LINQ will most likely not provide a huge speed improvement, in and of itself. That being said, you could potentially use PLINQ to simplify the parallelization of the processing, which could allow this to scale better on multicore systems. This tends to be much simpler when using POCOs instead of DataTable, as DataTable is not thread safe, and has concurrency issues.
That being said - I suspect that profiling this process would, overall, give you a much better potential improvement, as it would allow you to find and correct any bottlenecks. If there are no specific bottlenecks, and the process just requires that amount of raw processing, caching may also help. In addition, it's possible that leaving the data on the database and using some form of ORM may help, as well, as the "6 or 7" filter operations could be run on a scalable server instead of locally. All of this is highly dependent on the nature of your data and algorithm, however, so it would require some careful consideration to determine whether it would be beneficial or detrimental overall.
Would creating POCO's and loading them into a List and then using LINQ to query the list be quicker than using DataTable.Select?
We have no idea, you didn't give us enough information. We have no idea how your method is coded (maybe you have an errant Thread.Sleep(300) buried in your code; we can't tell).
More importantly, we need to know where the bottleneck is. To figure that out, you need a profiler. Get one and then once you know what the bottleneck is, we can probably help you eek out some extra performance.
That said, switching to LINQ probably isn't going to single-handedly be the solution to your performance problems. Something else is wrong, and whether it is coded using DataTables and LINQ is mostly irrelevant. The performance gains are going to come from having the right plan of attack to your problem; DataTables and LINQ are just ways of implementing that plan of attack.
I've got some text data that I'm loading into a SQL Server 2005 database using Linq-to-SQL using this method (psuedo-code):
Create a DataContext
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record);
}
dataContext.SubmitChanges();
The code is a little C# console application. This works fine so far, but I'm about to do an import of the real data (rather than a test subset) and this contains about 2 million rows instead of the 1000 I've tested. Am I going to have to do some clever batching or something similar to avoid the code falling over or performing woefully, or should Linq-to-SQL handle this gracefully?
It looks like this would work however the changes (and thus memory) that are kept by the DataContext are going to grow with each InsertOnSubmit. Maybe it's adviseable to perform a SubmitChanges every 100 records?
I would also take a look at SqlBulkCopy to see if it doesn't fit your usecase better.
IF you need to do bulk inserts, you should check out SqlBulkCopy
Linq-to-SQL is not really suited for doing large-scale bulk inserts.
You would want to call SubmitChanges() every 1000 records or so to flush the changes so far otherwise you'll run out of memory.
If you want performance, you might want to bypass Linq-To-SQL and go for System.Data.SqlClient.SqlBulkCopy instead.
Just for the record I did as marc_s and Peter suggested and chunked the data. It's not especially fast (it took about an hour and a half as Debug configuration, with the debugger attached and quite a lot of console progress output), but it's perfectly adequate for our needs:
Create a DataContext
numRows = 0;
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record)
// Submit the changes in thousand row batches
if (numRows % 1000 == 999)
dataContext.SubmitChanges()
numRows++
}
dataContext.SubmitChanges()
Nightly, I need to fill a SQL Server 2005 table from an ODBC source with over 8 million records. Currently I am using an insert statement from linked server with syntax select similar to this:
Insert Into SQLStagingTable from Select * from OpenQuery(ODBCSource, 'Select * from SourceTable')
This is really inefficient and takes hours to run. I'm in the middle of coding a solution using SqlBulkInsert code similar to the code found in this question.
The code in that question is first populating a datatable in memory and then passing that datatable to the SqlBulkInserts WriteToServer method.
What should I do if the populated datatable uses more memory than is available on the machine it is running (a server with 16GB of memory in my case)?
I've thought about using the overloaded ODBCDataAdapter fill method which allows you to fill only the records from x to n (where x is the start index and n is the number of records to fill). However that could turn out to be an even slower solution than what I currently have since it would mean re-running the select statement on the source a number of times.
What should I do? Just populate the whole thing at once and let the OS manage the memory? Should I populate it in chunks? Is there another solution I haven't thought of?
The easiest way would be to use ExecuteReader() against your odbc data source and pass the IDataReader to the WriteToServer(IDataReader) overload.
Most data reader implementations will only keep a very small portion of the total results in memory.
SSIS performs well and is very tweakable. In my experience 8 million rows is not out of its league. One of my larger ETLs pulls in 24 million rows a day and does major conversions and dimensional data warehouse manipulations.
If you have indexes on the destination table, you might consider disabling those till the records get inserted?