Performant way to clone rather large subtree in database? - c#

Currently I'm trying to performance-optimize a code in a Windows Forms .NET 2.0 application that does copy operations on hierarchical database objects.
Here is an example structure:
Each object in a tree is represented by a database table row. In addition, for each object there are also several "side-objects" associated. E.g. a test case object also has
1..n permissions
1..n attributes
1..n attachments
...
These "side-objects" are stored in separate database tables.
Copying trees
The user of the applications can select a tree element, right click and select "copy" then paste it later at another position in the tree.
This operation copies all child objects and all "side-objects" to the new location.
From a database perspective, this can be several hundred or even thousand SELECT and INSERT statements, depending on the size of the child tree to copy.
From a user perspective, a progress dialog is being shown to keep the UI responsive. Beside that, most users complain that it takes way too long to do "...a simple copy and paste..." operation.
Optimizing performance
So my goal is to speed up things.
The current algorithm goes something like this:
Read an object from DB.
Store this object as a new entry to DB.
Do the same for all "side-objects" for the object.
Recursively do the same for all child objects of the object.
As you can imagine, for a large number of objects this quickly sums up to a large number of database operations.
Since I have had no clue so far on how to optimize (batch operations? but then how and for which object?), my question is as following.
My question
Can you give me any hints/pattern/best practices on how to clone a large number of hierarchically related objects as described above?
(ideally in a database-agnostic way, although the backend is a Microsoft SQL Server in most cases)

I'm assuming you're doing all the cloning in your .NET app which is causing a lot of roundtrips to the database server?
First make sure you're doing all your cloning on the database and avoiding these roundtrips. It should be possible to do exactly what you are doing, but using a recursive stored procedure, just by rewriting your current C# algorithm in SQL. You should see a big performance boost.
Someone cleverer than me might say it can be done using a single CTE query but this post seems to suggest that's not possible. Certainly, I can't see how you'd do this and preserve the relationships between the new ids

Related

Query DB with the little things, or store some "bigger chunks" of results and filter them in code?

I'm working on an application that imports video files and lets the user browse them and filter them based on various conditions. By importing I mean creating instances of my VideoFile model class and storing them in a DB table. Once hundreds of files are there, the user wants to browse them.
Now, the first choice they have in the UI is to select a DateRecorded, which calls a GetFilesByDate(Date date) method on my data access class. This method will query the SQL database, asking only for files with the given date.
On top of that, I need to filter files by, let's say, FrameRate, Resolution or UserRating. This would place additional criteria on the files already filtered by their date. I'm deciding which road to take:
Only query the DB for a new set of files when the desired DateRecorded changes. Handle all subsequent filtering manually in C# code, by iterating over the stored collection of _filesForSelectedDay and testing them against current additional rules.
Query the DB each time any little filter changes, asking for a smaller and very specific set of files more often.
Which one would you choose, or even better, any thoughts on pros and cons of either of those?
Some additional points:
A query in GetFilesByDate is expected to return tens of items, so it's not very expensive to store the result in a collection always sitting in memory.
Later down the road I might want to select files not just for a specific day, but let's say for the entire month. This may give hundreds or thousands of items. This actually makes me lean towards option two.
The data access layer is not yet implemented. I just have a dummy class implementing the required interface, but storing the data in a in-memory collection instead of working with any kind of DB.
Once I'm there, I'll almost certainly use SQLite and store the database in a local file.
Personally I'd always go the DB every time until it proves impractical. If it's a small amount of data then the overhead should also be small. When it gets larger then the DB comes into its own. It's unlikely you will be able to write code better than the DB although the round trip can cost. Using the DB your data will always be consistent and up to date.
If you find you are hitting the BD too hard then you can try caching your data and working out if you already have some or all of the data being requested to save time. However then you have aging and consistency problems to deal with. You also then have servers with memory stuffed full of data that could be used for other things!
Basically, until it becomes an issue, just use the DB and use your energy on the actual problems you encounter, not the maybes.
If you've already gotten a bunch of data to begin with, there's no need to query the db again for a subset of that set. Just store it in an object which you can query on refinement of the search query by the user.

List to Database

I might be way off here, and this question probably bordering subjective, but here goes anyway.
Currently I use IList<T> to cache information from the database in memory so I can use LINQ to query information from them. I have a ORM'ish layer I've written with the help of some questions here on SO, to easily query the information I need from the DB. For example:
IList<Customer> customers = DB.GetDataTable("Select * FROM Customers").ToList<Customer>();
Its been working fine. I also have extension methods to do CRUD updates on single items within these lists:
DB.Update<Customer>(customers(0));
Again working quite well.
Now in the GUI layer of my app, specifically when binding DataGridView's for the user to edit the data, i find myself bypassing this DAL layer and directly using TableAdapters within the forms which kind of breaks the layered architecture which smells a bit to me. I've also found the fact that I'm using TableAdapters here and ILists there, there are differing standards followed throughout my code which I would like to consolidate into one.
Ideally, I would like to be able to bind to these lists and then have the DAL update the list's 'dirty' data for me. To me, this process would involve the following:
Traversing the list for any 'dirty' items
For each of these, see if there is already an item with the PK in the DB
If (2), then update, else insert
Finally, perform a Delete FROM * WHERE ID NOT IN('all ids in list') query
I'm not entirely sure how this is handled in a TableAdapter, but I can see the performance of this method dropping significantly and quite quickly with increasing items in the list.
So my question is this:
Is there an easier way of committing List to a database? Note the word commit, as it may be an insert/update or delete.
Should I maybe convert to DataTable? e.g. here
I'm sure some of the more advanced ORM's will perform this type of thing, however is there any mini-orm (e.g. dapper/Petapoco/Simple.data etc) that can do this for me? I want to keep it simple (as is with my current DAL) and flexible (I don't mind writing the SQL if its gets me exactly what I need).
Currently I use IList to cache information from the database in memory so I can use LINQ to query information from them.
Linq also has a department called Linq-to-Datasets so this is not a compelling reason.
Better decide what you really want/need:
a full ORM like Entity Framework
use DataSets with DataDapters
use basic ADO.NET (DataReader and List<>) and implement your own change-tracking.
You can mix them to some extent but like you noted it's better to pick one.

How to deal with large objects?

I have 5 types of objects: place info (14 properties),owner company info (5 properties), picture, ratings (stores multiple vote results), comments.
All those 5 objects will gather to make one object (Place) which will have all the properties and information about all the Place's info, pictures, comments, etc
What I'm trying to achieve is to have a page that displays the place object and all it's properties. another issue, if I want to display the Owner Companies' profiles I'll have object for each owner company (but I'll add a sixth property which is a list of all the places they own)
I've been practicing for a while, but I never got into implementing and performance experience, but I sensed that it was too much!
What do you think ?
You have to examine the use case scenarios for your solution. Do you need to always show all of the data, or are you starting off with displaying only a portion of it? Are users likely to expand any collapsed items as part of regular usage or is this information only used in less common usages?
Depending on your answers it may be best to fetch and populate the entire page with all of the data at once, or it may be the case that only some data is needed to render the initial screen and the rest can be fetched on-demand.
In most cases the best solution is likely to involve fetching only the required data and to update the page dynamically using ajax queries as needed.
As for optimizing data access, you need to strike a balance between the number of database requests and the complexity of each individual request. Because of network latency it is often important to fetch as much as possible using as few queries as possible, even if this means you'll sometimes be fetching data that you do not always need. But if you include too much data in a single query, then computing all the joins may also be costly. It is quite rare to see a solution in which it is better to first fetch all root objects and then for every element go fetch some additional objects associated with that element. As such, design your solution to fetch all data at once, but include only what you really need and try to keep the number of involved tables to a minimum.
You have 3 issues to deal with really, and they are often split into DAL, BLL and UI
Your objects obviously belong in the BLL and if you're considering performance then you need to consider how your objects will be created and how they interface to the DAL. I have many objects with 50-200 properties so 14 properties is really no issue.
The UI side of it is seperate, and if you're considering the performance of displaying a lot of information onto a single page you'll consider tabbed content, grids etc.
Tackle it one thing at a time and see where your problems lie.

C#: Very fast object search & retrieval using any persistence model

I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.
You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.
This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.
3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.

Best way to process a large database?

Background:
I have one Access database (.mdb) file, with half a dozen tables in it. This file is ~300MB large, so not huge, but big enough that I want to be efficient. In it, there is one major table, a client table. The other tables store data like consultations made, a few extra many-to-one to one fields, that sort of thing.
Task:
I have to write a program to convert this Access database to a set of XML files, one per client. This is a database conversion application.
Options:
(As I see it)
Load the entire Access database into memory in the form of List's of immutable objects, then use Linq to do lookups in these lists for associated data I need.
Benefits:
Easy parallelised. Startup a ThreadPool thread for each client. Because all the objects are immutable, they can be freely shared between the threads, which means all threads have access to all data at all times, and it is all loaded exactly once.
(Possible) Cons:
May use extra memory, loading orphaned items, items that aren't needed anymore, etc.
Use Jet to run queries on the database to extract data as needed.
Benefits:
Potentially lighter weight. Only loads data that is needed, and as it is needed.
(Possible) Cons:
Potentially heavier! May load items more than once and hence use more memory.
Possibly hard to paralellise, unless Jet/OleDb supports concurrent queries (can someone confirm/deny this?)
Some other idea?
What are StackOverflows thoughts on the best way to approach this problem?
Generate XML parts from SQL. Store each fetched record in the file as you fetch it.
Sample:
SELECT '<NODE><Column1>' + Column1 + '</Column1><Column2>' + Column2 + '</Column2></Node>' from MyTable
If your objective is to convert your database to xml files, you can then:
connect to your database through an ADO/OLEDB connection
successively open each of your tables as ADO recordsets
Save each of your recordset as a XML file:
myRecordset.save myXMLFile, adPersistXML
If you are working from the Access file, use the currentProject.accessConnection as your ADO connection
From the sounds of this, it would be a one-time operation. I strongly discourage the actual process of loading the entire setup into memory, that just does not seem like an efficient method of doing this at all.
Also, depending on your needs, you might be able to extract directly from Access -> XML if that is your true end game.
Regardless, with a database that small, doing them one at a time, with a few specifically written queries in my opinion would be easier to manage, faster to write, and less error prone.
I would lean towards jet, since you can be more specific in what data you want to pull.
Also I noticed the large filesize, this is a problem i have recently come across at work. Is this an access 95 or 97 db? If so converting the DB to 2000 or 2003 and then back to 97 will reduce this size, it seems to be a bug in some cases. The DB I was dealing with claimed to be 70meg after i converted it to 2000 and back again it was 8 meg.

Categories