DataSet usage when data source is very large?

DataSet usage when data source is very large? - c#

I have read several MS articles about when to use DataSets in conjuration with a database from within a WinForms application. I certainly like the ease of use DataSets offer, but have a few concerns when using them with a large data source. I want to use a SQLite database to locally store processed web log information. Potentially this could result in tens of thousands of rows of data.
When a DataSet is filled via a database table, does it end up containing ALL the data from the database, or does it contain only a portion of data from the database?
Could I use a DataSet to add rows to the database, perform an Update for example, somehow 'clear' what the DataSet is holding in memory, then perform additional row adding?
So is it possible to essentially manage what a DataSet is currently holding in memory? If a DataSet represents a table that contains 100,000 rows, does that mean all 100,000 rows need to be loaded from the database into memory before it is even usable?
Thanks.

You have very important points here. These points were raised at the beginning of .Net, when we suddenly moved to disconnected state introduced in .NET.
The answer to your problem is paging. You need to manually code your grid or other displaying device (control) so it queries database in chunks. For example, you have a control (but not grid) that has fields and a scroll. You give your scroll 201 clicks. On 200 clicks, it scrolling through 200 records, on click # 201, it queries database for 200 more. May be, add some logic to remove 200 records, when number of them in the dataset reaches 1000. This is just an example.
To save data you can add it to this same DataSet/DataTable. There are few ways of doing it. DataSet/DataTable have capabilities to identify new or edited rows, relationships, etc. On a serious systems, Entity Lists encapsulate Datatables and provide customizations.
May be you want to look into Entity Framework capability. I am not sure if this functionality was included there.
Basically, for some simple application with small data it is Ok to use out of box ADO.net. But in a serious system, normally, there is a lot of ground work with ADO.NET to provide solid Data Access Layer and more additional work to create favorable user experience. In this case, it would be loading data in chunks because if you load 100K records, user will have to wait to load first, then it will be hard to scroll through all of them.
In the end, you need to look at what your application is, and what it is for, and what will be satisfactory or not satisfactory for the user.

Related

How to best add HUGE amount of data to DataGridView

I am reading a long text file with currently round about 900.000 lines (log files). I am then filling a DataTable object with the data and until then everything is fine. But when assigning the huge DataTable object to the DataGridView.DataSource it takes ~ 10 minutes until the application is responsive again and the DataGridView shows the data. The same happens if I load the data to the DataGridView directly without a DataTable object. Is there a better way to work with this huge amount of data and a DataGridView?

Yes; you need to enable "virtual mode". This isn't entirely trivial, as you need to provide the code to provide cell values on-demand (rather than filling everything in advance), but it isn't horrendous either. Here's a complete walkthrough in the Microsoft docs.
However, from a UX perspective, maybe a better solution is to make it such that you don't need to display nearly a million lines in a grid. That isn't a useful user experience, in most cases.

SSIS - Truncate data or update data

Here is my situation.
I have a hierarchical data set that refreshes every night at 1AM. The set itself is fairly small (200K rows).
I've decided to use two approaches:
Load the data, compare it to the existing table data and update the rows accordingly.
I've ran into a small issue though where if the source data is smaller (row count) than the destination data. The destination data rows are not delete to match the refresh source data.
Truncate the destination data and then replace with with the refreshed source data.
Number 2 being the most simple but for some reason I feel this is a bad practice.
Does anyone have advice on how to properly deal with this situation?

Approach #2 is fine as long as it doesn't cause problems that affect your users.
Approach #1 is also fine, and is especially recommended for really large tables. You would simply need to adjust your code to delete the destination rows that are missing from the incoming source rows.

Ways to speed up queries SQL Server 2008 R2 without SqlDataSource object

I'm trying to build a product catalog application in ASP.NET and C# that will allow a user to select product attributes from a series of drop-down menus, with a list of relevant products appearing in a gridview.
On page load, the options for each of the drop-downs are queried from the database, as well as the entire product catalog for the gridview. Currently this catalog stands at over 6000 items, but we're looking at perhaps five or six times that when the application goes live.
The query that pulls this catalog runs in less than a second when executed in SQL Server Management Studio, but takes upwards of ten seconds to render on the web page. We've refined the query as much as we know how: pulling only the columns that will show in our gridview (as opposed to saying select * from ...) and adding the with (nolock) command to the query to pull data without waiting for updates, but it's still too slow.
I've looked into SqlCacheDependency, but all the directions I can find assume I'm using a SqlDataSource object. I can't do this because every time the user makes a selection from the menu, a new query is constructed and sent to the database to refine the list of displayed products.
I'm out of my depth here, so I'm hoping someone can offer some insight. Please let me know if you need further information, and I'll update as I can.
EDIT: FYI, paging is not an option here. The people I'm building this for are standing firm on that point. The best I can do is wrap the gridview in a div with overflow: auto set in the CSS.
The tables I'm dealing with aren't going to update more than once every few months, if that; is there any way to cache this information client-side and work with it that way?

Most of your solution will come in a few forms (none of which have to do with a Gridview):
Good indexes. Create good indexes for the tables that pull this data; good indexes are defined as:
Indexes that store as little information as actually needed to display the product. The smaller the amount of data stored, the greater amount of data can be stored per 8K page in SQL Server.
Covering indexes: Your SQL Query should match exactly what you need (not SELECT *) and your index should be built to cover that query (hence why it's called a 'covering index')
Good table structure: this goes along with the index. The fewer joins needed to pull the information, the faster you can pull it.
Paging. You shouldn't ever pull all 6000+ objects at once -- what user can view 6000 objects at once? Even if a theoretical superhuman could process that much data; that's never going to be your median usecase. Pull 50 or so at a time (if you really even need that many) or structure your site such that you're always pulling what's relevant to the user, instead of everything (keep in mind this is not a trivial problem to solve)
The beautiful part of paging is that your clients don't even need to know you've implemented paging. One such technique is called "Infinite Scrolling". With it, you can go ahead and fetch the next N rows while the customer is scrolling to them.

If, as you're saying paging really is not an option (although I really doubt it ; please explain why you think it is, and I'm pretty sure someone will find a solution), there's really no way to speed up this kind of operation.
As you noticed, it's not the query that's taking long, it's the data transfer. Copying the data from one memory space (sql) to another (your application) is not that fast, and displaying this data is orders of magnitude slower.
Edit: why are your clients "firm on that point" ? Why do they think it's not possible otherwise ? Why do they think it's the best solution ?

There are many options to show a big largeset of data on a grid but third parties software.
Try to use jquery/javascript grids with ajax calls. It will help you to render on client a large amount of rows. Even you can use the cache to not query many times the database.
Those are a good grids that will help your to show thousands of rows on a web browser:
http://www.trirand.com/blog/
https://github.com/mleibman/SlickGrid
http://demos.telerik.com/aspnet-ajax/grid/examples/overview/defaultcs.aspx
http://w2ui.com/web/blog/7/JavaScript-Grid-with-One-Million-Records
I Hope it helps.

You can load all the rows into a Datatable on the client using a Background thread when the application (Web page) starts. Then only use the Datatable to populate your Grids etc....So you do not have to hit SQL again until you need to read / write different data. (All the other answers cover the other options)

How bad is it to 'delete and insert' as against using DataAdapter.Update in ADO.NET

I have developed a C# windows form with a DataGridView selecting data from a table with IDENTITY columns in SQL Server.
When the user clicks 'Save', after making changes to DataGridView, i am deleting all rows that were fetched, and inserting all rows from the DataGridView to the database. This creates a new set of IDENTITY columns. I may end up having 100,000 rows in this table. My worry is that at some point in the future the IDENTITY column may exceed it's limit.
On the other hand i find the additional code for DataAdapter.Update a bit daunting. That is why i preferred the above approach.
How bad a choice is it ?

In my opinion, it's very bad, for the reasons outlined below.
As you pointed out, you have a limited number of faux "Updates" to your table. Which will cause catastrophic failure at an undefined point in time. You could solve this by reseeding the table, but that's messy
Your network traffic/CPU consumption/memory consumption is A LOT larger than it could be, which may cause you to over resource your servers when you aren't updating bulk records
When performing deletes from SQL, any indexing will be fragmented. If you have any indexes on the table, over time they will become more and more useless. You could work around this by re-indexing frequently
In short, take the time now to build a robust solution. You won't regret it.
You could also look into an ORM, such as EntityFramework, which handles Insert/Update/Delete actions automatically.

C# GUI, have to display a huge table and make it sortable

I am making a small C# GUI application that reads table-like (cells, rows, columns) data from a binary file and displays it to the end user. Some files are very small (5 columns / 10 rows), but some are very big (245 columns and almost 50,000 rows).
The one and only method I found to easily display a MsExcel-like table was DataGridView. I was really happy with it when I tried the small files, but as soon as I tried with the huge one it went OOM before it even finished loading (and I had more than 4 GB of free memory).
After that though I found out its VirtualMode, and it was really fast. However unfortunately columns were no longer sortable and that is a must.
So, what can I do to obtain performance similar to DataGridView's virtual mode but have it sortable as well? (If you have another control in mind it's okay, I don't have to necessarily use DataGridView)
Also, please note that:
The binary files were not designed or produced by me. I have no control over their format.
The data is not in a database.
My end users shouldn't have to install a database and import the data there.

You can handle the sorting by yourself and sort the datasource of datagridview with one of the "standard" sorting algorithm.
If you use List, you can use a "Sort()" method. However every collection can by sorted by yourself.

Look for some third party contols. I've used Janus (www.janusys.com) and DevExpress (www.devexpress.com) for grids and they work great.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.