Use linq to iterate through large DB tables

Use linq to iterate through large DB tables - c#

I have two tables: Foo and Bar. For each row in Foo, I now want to add a row in Bar which references the respective Foo record. Foo will likely contain several millions of records.
Normally this answer would have been perfect: linq to sql - loop through table data and set value. But as it says on the tin, using the following line is not particularly ideal for large tables.
List<User> users = dc.Users.ToList();
Since caching the entire table in a List<> is not going to work, what other options do I have? Is there an elegant way to "page through" the records, for instance? Since I am quite sure that this is a relatively common problem, I think it's likely that there is a best practice for this too. I have not been able to find it, however.

Your talking about several million rows of data, then Linq is not your friend.
Consider using a stored procedure or, if you like, DbContext.ExecuteCommand.
Both will result in a huge performance gain.

You can work with predefined batches using .Skip() and .Take() methods. Another thing to consider is using a trigger so that you don't need to worry about the second table at all.

Related

How to manage a million records?

I really need an expert's help to answer my query.
Here is the scenario:
Im using an sql select query to retrieve a million records.
I need to perform sorting and grouping on the resultant records which im storing in a datatable( in one execution)
and looping through it for grouping and sorting it.
I know this is so childish and not the right way to process it.
How can i manage the million records effectively and apply the grouping and sorting to it?
Really need help out here. Heard of executing the select query batch wise but how to implement the grouping and sorting while we dont have the entire data in hand?
I cannot go for sql order by and group by directly and that's against my requirement.
Here is what i'm doing right now:
I have the following objects, i.e the column names for grouping and Sorting
List<Group> groupList;
List<Sort> sortList;
DataTable reportData; // Here im having the entire records from db
Im looping through the 'reportData' row by row and matches the current and previous row for the custom grouping and sorting. Would like to know how the same can be done when we are using a batchwise execution or any alternative solution is there?

I need to perform sorting and grouping on the resultant records which
im storing in a datatable( in one execution) and looping through it
for grouping and sorting it.
What for?
Seriously.
Do not pull then try plaing smart with a stupid object model behind (and datasets are not particularly smart, sorry).
Group and sort in your select statement, pull the data lready grouped and joined and be done with it.
A million records was a small amount of data for sql server when the original version was release (4.2 it was, a port of sysase sql server) 17 years of so ago. These days it is something that fits likely into the processor thiird level cache and is nothing a proper sql server even realizes it has just processed.
SQL is particulaly good ad doing projects and ever since they indoruced MARS you can even run multiple queries over one connection, which comes in handy here.
So, go back - throw away the dataset and "I try to program a sort algo" and create proper SQL statements to pull the data as you need it.

Sounds like you should implement Partition Pruning. Partitioning will allow for a separation of content like you are requesting in order to have faster queries.

If I understood correctly, in your case, I would create a temporary database table with the structure I want especially to cover my grouping.
Then I would select the records from main tables and insert them to the temporary one appying all modifications including grouping.
A specific index on how you want them sorted should be also applied.
After that, just select from this table, do what you have to do, and finally if the data are not needed any more, delete the temporary table.
I would choose the above solution because a million of records in memory smells trouble to me...

For example:
1. Lets assume that you would like to group them by their DocumentTypeID
var groupByType = reportData.GroupBy(g=>g.DocumentTypeID);
2. Sorting Alphabetically
var sortAlphabetically = reportData.OrderBy(g=>g.DocumentName);
3. Grouping and Sorting
var groupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.OrderBy(g=>g.DocumentName);
4. Sort and Group
var groupAndSort = reportData.OrderBy(g=>g.DocumentName)
.GroupBy(g=>g.DocumentTypeID);
5. Multiple Grouping and sorting
var multipleGroupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.GroupBy(g=>g.CreatedOnDate.Month)
.OrderBy(g=>g.DocumentName);
so on and so forth...
But I would still discourage bringing million rows to application. It will cost memory. There are of course ways to manage it through stored procedures etc.

DataTable Select vs List<T> LINQ Performance

I have an app that does an SQL and loads a set of data into a datatable. As part of the processing there are 6 or 7 DataTable.Select() to filter some data. Each item that needs processing takes 300ms. There are 5000 items to process so takes 25 mins. This is unacceptable.
Would creating POCO's and loading them into a List and then using LINQ to query the list be quicker than using DataTable.Select?
Thanks
UPDATE: I have delved in a bit more and there are 2 datatables each with around 15000 records. The 2 queries used to populate the datatables take a second each. It then takes 25mins to loop over 5000 items in a Dictionary's values property and do 5 DataTable.Select's
eg/
foreach (OutputRecord Mailpiece in DictionaryMailpieces.Values)
{
try
{
DataRow[] R = DataTable1.Select("MAILPIECE = " + Mailpiece.MailpieceSetSequenceNumber + " AND (STATUS = 4034 OR STATUS = 4037)", "DAL_DATE desc");
if (R != null && R.Length > 0)
{
}
}
catch
{
}
}

Funny there is no "SQL" tag associated with your question. I suggest, you learn how to use the SQL language and its benefits. From what you say, it is likely you are, with your code, creating a lot of Cartesian products, instead of leveraging the Relational Database facilities (joins, indexes, etc.)
Using cross joins of DataTables or Lists or anything similar will always lead to heavy performance degradation, whatever language or platform is used.
That said, you could use LINQ because it's capable of producing smart SQL (dynamically), but you still want to avoid all ToList(), ToArray() and similar extension methods on IEnumerable(T) that summon all the underlying data (keep it enumerable from end to end and leverage "object streaming" whenever possible). If you understand really what is a Relational Database and how to use it efficiently, you will be a better LINQ developer.

Almost anything will be faster than manipulating an ADO.NET DataTable -- they are not designed for fast retrieval in any sense. You should also put the objects into an appropriate data structure; a DataTable is a red-black binary tree of rows, so if you don't want that, you shouldn't use one.
If you're just using the DataTable as a sequential collection of rows with fields, then you'll probably see a factor of 2 or more speedup just by replacing the DataTable with a List<T> and replacing your Select calls with Where calls, although it depends on what you're doing with it.
EDIT: Actually, I changed my mind. Nothing you could be doing sorting-or-filtering-wise with 5000 items in a DataTable implies a cost of anywhere close to 300ms, so the bottleneck is probably unrelated.

Using LINQ will most likely not provide a huge speed improvement, in and of itself. That being said, you could potentially use PLINQ to simplify the parallelization of the processing, which could allow this to scale better on multicore systems. This tends to be much simpler when using POCOs instead of DataTable, as DataTable is not thread safe, and has concurrency issues.
That being said - I suspect that profiling this process would, overall, give you a much better potential improvement, as it would allow you to find and correct any bottlenecks. If there are no specific bottlenecks, and the process just requires that amount of raw processing, caching may also help. In addition, it's possible that leaving the data on the database and using some form of ORM may help, as well, as the "6 or 7" filter operations could be run on a scalable server instead of locally. All of this is highly dependent on the nature of your data and algorithm, however, so it would require some careful consideration to determine whether it would be beneficial or detrimental overall.

Would creating POCO's and loading them into a List and then using LINQ to query the list be quicker than using DataTable.Select?
We have no idea, you didn't give us enough information. We have no idea how your method is coded (maybe you have an errant Thread.Sleep(300) buried in your code; we can't tell).
More importantly, we need to know where the bottleneck is. To figure that out, you need a profiler. Get one and then once you know what the bottleneck is, we can probably help you eek out some extra performance.
That said, switching to LINQ probably isn't going to single-handedly be the solution to your performance problems. Something else is wrong, and whether it is coded using DataTables and LINQ is mostly irrelevant. The performance gains are going to come from having the right plan of attack to your problem; DataTables and LINQ are just ways of implementing that plan of attack.

Where to do pagination/filtering? In the database or in the code?

I have to write the code for the following method:
public IEnumerable<Product> GetProducts(int pageNumber, int pageSize, string sortKey, string sortDirection, string locale, string filterKey, string filterValue)
The method will be used by a web UI and must support pagination, sorting and filtering. The database (SQL Server 2008) has ~250,000 products. My question is the following: where do I implement the pagination, sorting and filtering logic? Should I do it in a T-SQL stored procedure or in the C# code?
I think that it is better if I do it in T-SQL but I will end up with a very complex query. On the other hand, doing that in C# implies that I have to load the entire list of products, which is also bad...
Any idea what is the best option here? Am I missing an option?

You would definitely want to have the DB do this for you. Moving ~250K records up from the database for each request will be a huge overhead. If you are using LINQ-to-SQL, the Skip and Take methods will do this (here is an example), but I don't know exactly how efficient they are.

I think other (and potentionaly best) option is to use some higher level framework that shield you from complexity of query writing. EntityFramework, NHibernate and LINQ(toSQL) help you a lot. That said database is typically best place to do it in your case.

today itself I implement pagination for my website. I have done with stored procedure though I am using Entity-Framework. I found that executing a complex query is better then fetching all records and doing pagination with code. So do it with stored procedure.
And I see your code line, which you have attached, I have implemented in same way only.

I would definatly do it in a stored procedure something along the lines of :
SELECT * FROM (
SELECT
ROW_NUMBER() OVER (ORDER BY Quantity) AS row, *
FROM Products
) AS a WHERE row BETWEEN 11 AND 20
If you are using linq then the Take and Skip methods will take care of this for you.

Definitely in the DB for preference, if at all possible.
Sometimes you can mix things up a bit such as if you have the results returned from a database function (not a stored procedure, functions can be parts of larger queries in ways that stored procedures cannot), then you can have another function order and paginate, or perhaps have Linq2SQL or similar call for a page of results from said function, producing the correct SQL as needed.
If you can at least get the ordering done in the database, and will usually only want the first few pages (quite often happens in real use), then you can at least have reasonable performance for those cases, as only enough rows to skip to, and then take, the wanted rows need be loaded from the db. You of course still need to test that performance is reasonable in those rare cases where someone really does look for page 1,2312!
Still, that's only a compromise for cases where paging is very difficult indeed, as a rule always page in the DB unless it's either extremely difficult for some reason, or the total number of rows is guaranteed to be low.

Avoiding SQL Not IN by using Replace and Length check

I have a situation where I have to dynamically create my SQL strings and I'm trying to use paramaters and sp_executesql where possible so I can reuse query plans. In doing lots of reading online and personal experience I have found "NOT IN"s and "INNER/LEFT JOIN"s to be slow performers and expensive when the base (left-most) table is large (1.5M rows with like 50 columns). I also have read that using any type of function should be avoided as it slows down queries, so I'm wondering which is worse?
I have used this workaround in the past, although I'm not sure it's the best thing to do, to avoid using a "NOT IN" with a list of items when, for example I'm passing in a list of 3 character strings with, for example a pipe delimiter (only between elements):
LEN(#param1) = LEN(REPLACE(#param1, [col], ''))
instead of:
[col] NOT IN('ABD', 'RDF', 'TRM', 'HYP', 'UOE')
...imagine the list of strings being 1 to about 80 possible values long, and this method doesn't lend it self to paraterization either.
In this example I can use "=" for a NOT IN and I would use a traditional list technique for my IN, or != if that is a faster although I doubt it. Is this faster than using the NOT IN?
As a possible third alternative, what if I knew all the other possibilities (the IN possabilities, which could potentially be 80-95x longer list) and pass those instead; this would be done in the application's Business Layer as to take the workload off of the SQL Server. Not a very good possability for query plan reuse but if it shaves a sec or two off a big nasty query, why the hell not.
I'm also adept at SQL CLR function creation. Since the above is string manipulation would a CLR function be best?
Thoughts?
Thanks in advance for any and all help/advice/etc.

As Donald Knuth is often (mis)quoted, "premature optimization is the root of all evil".
So, first of all, are you sure that if you write your code in the most clear and simple way (to both write and read), it performs slowly? If not, check it, before starting to use any "clever" optimization tricks.
If the code is slow, check the query plans thouroughly. Most of the time query execution takes much longer than query compilation, so usually you do not have to worry about query plan reuse. Hence, building optimal indexes and/or table structures usually gives significantly better results than tweaking the ways the query is built.
For instance, I seriously doubt that your query with LEN and REPLACE has better performance than NOT IN - in either case all the rows will be scanned and checked for a match. For a long enough list MSSQL optimizer would automatically create a temp table to optimize equality comparison.
Even more, tricks like this tend to introduce bugs: say, your example would work incorrectly if [col] = 'AB'.
IN queries are often faster then NOT IN, because for IN queries only part of the rows is enough to be checked. The efficiency of the method depends on whether you can get a correct list for IN quickly enough.
Speaking of passing a variable-length list to the server, there're many discussions here on SO and elsewhere. Generally, your options are:
table-valued parameters (MSSQL 2008+ only),
dynamically constructed SQL (error prone and/or unsafe),
temp tables (good for long lists, probably too much overhead in writing and execution time for short ones),
delimited strings (good for short lists of 'well-behaved' values - like a handful of integers),
XML parameters (somewhat complex, but works well - if you use a good XML library and do not construct complex XML text 'by hand').
Here is an article with a good overview of these techniques and a few more.

I have found "NOT IN"s and "INNER/LEFT JOIN"s to be slow performers and expensive when the base (left-most) table is large
It shouldn't be slow if you indexed your table correctly. Something that can make the query slow is if you have a dependent subquery. That is, the query must be re-evaluated for each row in the table because the subquery references values from the outer query.
I also have read that using any type of function should be avoided as it slows down queries
It depends. SELECT function(x) FROM ... probably won't make a huge difference to the performance. The problems are when you use function of a column in other places in the query such as JOIN conditions, WHERE clause, or ORDER BY as it may mean that an index cannot be used. A function of a constant value is not a problem though.
Regarding your query, I'd try using [col] NOT IN ('ABD', 'RDF', 'TRM', 'HYP', 'UOE') first. If this is slow, make sure that you have indexed the table appropriately.

First off, since you are only filtering out a small percentage of the records, chances are the index on col isn't being used at all so SARG-ability is moot.
So that leaves query plan reuse.
If you are on SQL Server 2008, replace #param1 with a table-valued parameter, and have your application pass that instead of a delimited list. This solves your problem completely.
If you are on SQL Server 2005, I don't think it matters. You could split the delimited list and use NOT IN/NOT EXISTS against the table, but what's the point if you won't get an index seek on col?
Can anyone speak to the last point? Would splitting the list to a table var and then anti-joining it save enough CPU cycles to offset the setup cost?
EDIT, third method for SQL Server 2005 using XML, inspired by OMG Ponies' link:
DECLARE #not_in_xml XML
SET #not_in_xml = N'<values><value>ABD</value><value>RDF</value></values>'
SELECT * FROM Table1
WHERE #not_in_xml.exist('/values/value[text()=sql:column("col")]') = 0
I have no idea how well this performs compared to a delimited list or TVP.

How do I compare two datasets for equality

I have two datasets each with one data table pulled from different sources and I need to know if there are any differences in the data contained in the data tables. I'm trying to avoid looping and comparing each individual record or column, although there may be no other way. All I need to know is if there is a difference in the data, I do not need to know the details of any difference.
I have tried the below code, but it appears that dataset.Merge does not update rowstatus so dataset.HasChanges() always returns false. Any help is appreciated:
var currentDataSet = GetSomeData();
var historicalDataSet = GetSomeHistoricalData();
historicalDataSet.Merge(currentDataSet);
if (historicalDataSet.HasChanges()) DoSomeStuff();

I don't know of any built-in support for this and I wouldn't expect it either. So you'll have to do this by yourself in some way.
The most obvious way would be a brute force, table by table and row by row approach.
If you can rely on certain factors to be the same, ie exactly the same naming, ordering of records etc then you could test if saving both as XML and comparing the results might be an efficient trick.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.