Long time watcher and this is my first post so please go easy on me. I've looked around but my query is quite specific and I can't find it elsewhere.
I have an application which consists of 2 controls.
A control on the left contains a tree view which displays searches mapped from an XML file. Each Search has an associated GUID.
A control on the right displays a datagrid, the information of which I obtain from the tree view through a Dictionary (Guid<->Dataset).
When the user clicks on a node in the tree view, my application works out which GUID the search is linked with and presents the associated dataset which gets flattened and displays on the datagrid.
I save(Serialise) and load(Deserialise) the Dictionary when exiting/loading the application respectively.
I did some memory profiling recently and for larger searches the memory footprint of the Dictionary can be quite large (200mb) for the limited capacity of the user machines which I really need to sort out.
I was wondering if anyone had any idea how to achieve this.
I was thinking about splitting the Serialisable Dictionary into constituent Datasets and storing each one individually on the hard drive (maybe with the GUID as the filename). The saved Datasets would then be deserialised with each Node.Click event and displayed in the grid view. My worry about this is the potential pause in between each click event as it saves the old search and loads a new one.
Simple answer, toss the dictionary and the datasets into a Sqlite file. (or other db) It will be a little slower but I expect faster than any code you can hand code in a reasonable amount of time. If used correctly a db layer will do memory buffering and caching to make the ui more responsive.
Related
I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW
If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.
Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.
I'm trying to build a product catalog application in ASP.NET and C# that will allow a user to select product attributes from a series of drop-down menus, with a list of relevant products appearing in a gridview.
On page load, the options for each of the drop-downs are queried from the database, as well as the entire product catalog for the gridview. Currently this catalog stands at over 6000 items, but we're looking at perhaps five or six times that when the application goes live.
The query that pulls this catalog runs in less than a second when executed in SQL Server Management Studio, but takes upwards of ten seconds to render on the web page. We've refined the query as much as we know how: pulling only the columns that will show in our gridview (as opposed to saying select * from ...) and adding the with (nolock) command to the query to pull data without waiting for updates, but it's still too slow.
I've looked into SqlCacheDependency, but all the directions I can find assume I'm using a SqlDataSource object. I can't do this because every time the user makes a selection from the menu, a new query is constructed and sent to the database to refine the list of displayed products.
I'm out of my depth here, so I'm hoping someone can offer some insight. Please let me know if you need further information, and I'll update as I can.
EDIT: FYI, paging is not an option here. The people I'm building this for are standing firm on that point. The best I can do is wrap the gridview in a div with overflow: auto set in the CSS.
The tables I'm dealing with aren't going to update more than once every few months, if that; is there any way to cache this information client-side and work with it that way?
Most of your solution will come in a few forms (none of which have to do with a Gridview):
Good indexes. Create good indexes for the tables that pull this data; good indexes are defined as:
Indexes that store as little information as actually needed to display the product. The smaller the amount of data stored, the greater amount of data can be stored per 8K page in SQL Server.
Covering indexes: Your SQL Query should match exactly what you need (not SELECT *) and your index should be built to cover that query (hence why it's called a 'covering index')
Good table structure: this goes along with the index. The fewer joins needed to pull the information, the faster you can pull it.
Paging. You shouldn't ever pull all 6000+ objects at once -- what user can view 6000 objects at once? Even if a theoretical superhuman could process that much data; that's never going to be your median usecase. Pull 50 or so at a time (if you really even need that many) or structure your site such that you're always pulling what's relevant to the user, instead of everything (keep in mind this is not a trivial problem to solve)
The beautiful part of paging is that your clients don't even need to know you've implemented paging. One such technique is called "Infinite Scrolling". With it, you can go ahead and fetch the next N rows while the customer is scrolling to them.
If, as you're saying paging really is not an option (although I really doubt it ; please explain why you think it is, and I'm pretty sure someone will find a solution), there's really no way to speed up this kind of operation.
As you noticed, it's not the query that's taking long, it's the data transfer. Copying the data from one memory space (sql) to another (your application) is not that fast, and displaying this data is orders of magnitude slower.
Edit: why are your clients "firm on that point" ? Why do they think it's not possible otherwise ? Why do they think it's the best solution ?
There are many options to show a big largeset of data on a grid but third parties software.
Try to use jquery/javascript grids with ajax calls. It will help you to render on client a large amount of rows. Even you can use the cache to not query many times the database.
Those are a good grids that will help your to show thousands of rows on a web browser:
http://www.trirand.com/blog/
https://github.com/mleibman/SlickGrid
http://demos.telerik.com/aspnet-ajax/grid/examples/overview/defaultcs.aspx
http://w2ui.com/web/blog/7/JavaScript-Grid-with-One-Million-Records
I Hope it helps.
You can load all the rows into a Datatable on the client using a Background thread when the application (Web page) starts. Then only use the Datatable to populate your Grids etc....So you do not have to hit SQL again until you need to read / write different data. (All the other answers cover the other options)
Currently I'm trying to performance-optimize a code in a Windows Forms .NET 2.0 application that does copy operations on hierarchical database objects.
Here is an example structure:
Each object in a tree is represented by a database table row. In addition, for each object there are also several "side-objects" associated. E.g. a test case object also has
1..n permissions
1..n attributes
1..n attachments
...
These "side-objects" are stored in separate database tables.
Copying trees
The user of the applications can select a tree element, right click and select "copy" then paste it later at another position in the tree.
This operation copies all child objects and all "side-objects" to the new location.
From a database perspective, this can be several hundred or even thousand SELECT and INSERT statements, depending on the size of the child tree to copy.
From a user perspective, a progress dialog is being shown to keep the UI responsive. Beside that, most users complain that it takes way too long to do "...a simple copy and paste..." operation.
Optimizing performance
So my goal is to speed up things.
The current algorithm goes something like this:
Read an object from DB.
Store this object as a new entry to DB.
Do the same for all "side-objects" for the object.
Recursively do the same for all child objects of the object.
As you can imagine, for a large number of objects this quickly sums up to a large number of database operations.
Since I have had no clue so far on how to optimize (batch operations? but then how and for which object?), my question is as following.
My question
Can you give me any hints/pattern/best practices on how to clone a large number of hierarchically related objects as described above?
(ideally in a database-agnostic way, although the backend is a Microsoft SQL Server in most cases)
I'm assuming you're doing all the cloning in your .NET app which is causing a lot of roundtrips to the database server?
First make sure you're doing all your cloning on the database and avoiding these roundtrips. It should be possible to do exactly what you are doing, but using a recursive stored procedure, just by rewriting your current C# algorithm in SQL. You should see a big performance boost.
Someone cleverer than me might say it can be done using a single CTE query but this post seems to suggest that's not possible. Certainly, I can't see how you'd do this and preserve the relationships between the new ids
I have 5 types of objects: place info (14 properties),owner company info (5 properties), picture, ratings (stores multiple vote results), comments.
All those 5 objects will gather to make one object (Place) which will have all the properties and information about all the Place's info, pictures, comments, etc
What I'm trying to achieve is to have a page that displays the place object and all it's properties. another issue, if I want to display the Owner Companies' profiles I'll have object for each owner company (but I'll add a sixth property which is a list of all the places they own)
I've been practicing for a while, but I never got into implementing and performance experience, but I sensed that it was too much!
What do you think ?
You have to examine the use case scenarios for your solution. Do you need to always show all of the data, or are you starting off with displaying only a portion of it? Are users likely to expand any collapsed items as part of regular usage or is this information only used in less common usages?
Depending on your answers it may be best to fetch and populate the entire page with all of the data at once, or it may be the case that only some data is needed to render the initial screen and the rest can be fetched on-demand.
In most cases the best solution is likely to involve fetching only the required data and to update the page dynamically using ajax queries as needed.
As for optimizing data access, you need to strike a balance between the number of database requests and the complexity of each individual request. Because of network latency it is often important to fetch as much as possible using as few queries as possible, even if this means you'll sometimes be fetching data that you do not always need. But if you include too much data in a single query, then computing all the joins may also be costly. It is quite rare to see a solution in which it is better to first fetch all root objects and then for every element go fetch some additional objects associated with that element. As such, design your solution to fetch all data at once, but include only what you really need and try to keep the number of involved tables to a minimum.
You have 3 issues to deal with really, and they are often split into DAL, BLL and UI
Your objects obviously belong in the BLL and if you're considering performance then you need to consider how your objects will be created and how they interface to the DAL. I have many objects with 50-200 properties so 14 properties is really no issue.
The UI side of it is seperate, and if you're considering the performance of displaying a lot of information onto a single page you'll consider tabbed content, grids etc.
Tackle it one thing at a time and see where your problems lie.
I have 3 form's in my project
When I open form 3, I fill Combo box with data (from DataBase)
it take's time......
How I can fill this Combo box only one time when the program is open ?
(in the first form - form1)
thank's in advance
There are a million ways to do this, and your question is pretty vague. is it the same data in all three combo boxes? Regardless, you want to load the data and store the lists in memory when you application first initializes. There are a lot of good, and a lot of bad ways to do this. Then when each form comes up, check to see if the list in memory is filled, if it is, bind to that list. (If not, of course, fill the list from the database, and then bind to it).
The overall concept is to preload the data, and then always check your memory persistence before going to the database.
Edit
To quickly list a good and bad way of storing these values in memory before I turn in for the night. I'll try to expand on this in the morning.
The best way would be to create a memory repository layer in your application, and have your business objects poll it before heading to the database, but there is some complexity in using this sort of model (mainly dealing with concurrency issues.)
The worst way would be to just declare some global collections of data somewhere, and pull them directly into your UI.