Here is my situation.
I have a hierarchical data set that refreshes every night at 1AM. The set itself is fairly small (200K rows).
I've decided to use two approaches:
Load the data, compare it to the existing table data and update the rows accordingly.
I've ran into a small issue though where if the source data is smaller (row count) than the destination data. The destination data rows are not delete to match the refresh source data.
Truncate the destination data and then replace with with the refreshed source data.
Number 2 being the most simple but for some reason I feel this is a bad practice.
Does anyone have advice on how to properly deal with this situation?
Approach #2 is fine as long as it doesn't cause problems that affect your users.
Approach #1 is also fine, and is especially recommended for really large tables. You would simply need to adjust your code to delete the destination rows that are missing from the incoming source rows.
Related
I'm currently working with an import file that has 460,000 rows of data within it. Each row consists of a ID and a quantity (eg. "1,120"). This information is read from the file, then should be used to update each individual row within a database (eg. UPDATE item SET quantity = QTY WHERE id = 1).
The problem I'm having, though, is actually being able to actually run the query efficiently. If I run an individual query for each line, it's really not going to work (As I've found out the hard way).
I'm not in any way a SQL user and I'm currently learning, but from what I've seen, the web doesn't seem to have any useful results on this.
I was wondering if anybody had experience with updating such a large dataset, and if so, would they be willing to share the methods that they used to achieve this?
460k rows isn't a lot, so you should be okay there.
I'd recommend importing the entire dataset into a temporary table, or table variable. To get the solution working, start by creating an actual physical table, which you can either DROP or TRUNCATE while you are getting this working.
Create the table, then import all the data into it. Then, do your table update based on a join to this import table.
Discard the import table when appropriate. Once this is all working how you want it to, you can do the entire thing using a stored procedure, and use a temporary table to handle the imported data while you are working with it.
460000 rows is a small dataset. Really small.
Bulk insert into tempoary table, then use an update command to run the update on the original data in one run.
I have read several MS articles about when to use DataSets in conjuration with a database from within a WinForms application. I certainly like the ease of use DataSets offer, but have a few concerns when using them with a large data source. I want to use a SQLite database to locally store processed web log information. Potentially this could result in tens of thousands of rows of data.
When a DataSet is filled via a database table, does it end up containing ALL the data from the database, or does it contain only a portion of data from the database?
Could I use a DataSet to add rows to the database, perform an Update for example, somehow 'clear' what the DataSet is holding in memory, then perform additional row adding?
So is it possible to essentially manage what a DataSet is currently holding in memory? If a DataSet represents a table that contains 100,000 rows, does that mean all 100,000 rows need to be loaded from the database into memory before it is even usable?
Thanks.
You have very important points here. These points were raised at the beginning of .Net, when we suddenly moved to disconnected state introduced in .NET.
The answer to your problem is paging. You need to manually code your grid or other displaying device (control) so it queries database in chunks. For example, you have a control (but not grid) that has fields and a scroll. You give your scroll 201 clicks. On 200 clicks, it scrolling through 200 records, on click # 201, it queries database for 200 more. May be, add some logic to remove 200 records, when number of them in the dataset reaches 1000. This is just an example.
To save data you can add it to this same DataSet/DataTable. There are few ways of doing it. DataSet/DataTable have capabilities to identify new or edited rows, relationships, etc. On a serious systems, Entity Lists encapsulate Datatables and provide customizations.
May be you want to look into Entity Framework capability. I am not sure if this functionality was included there.
Basically, for some simple application with small data it is Ok to use out of box ADO.net. But in a serious system, normally, there is a lot of ground work with ADO.NET to provide solid Data Access Layer and more additional work to create favorable user experience. In this case, it would be loading data in chunks because if you load 100K records, user will have to wait to load first, then it will be hard to scroll through all of them.
In the end, you need to look at what your application is, and what it is for, and what will be satisfactory or not satisfactory for the user.
I'm recieving and parsing a large text file.
In that file I have a numerical ID identifying a row in a table, and another field that I need to update.
ID Current Location
=========================
1 Boston
2 Cambridge
3 Idaho
I was thinking of creating a single SQL command string and firing that off using ADO.Net, but some of these files I'm going to recieve have thousands of lines. Is this doable or is there a limit I'm not seeing?
If you may have thousands of lines, then composing a SQL statement is definitely NOT the way to go. Better code-based alternatives include:
Use SQLBulkCopy to insert the change data to a staging table and then UPDATE your target table using the staging table as the source. It also has excellent batching options (unlike the other choices)
Write a stored procedure to do the Update that accepts an XML parameter that contains the UPDATE data.
Write a stored procedure to do the Update that accepts a table-valued parameter that contains the UPDATE data.
I have not compared them myself but it is my understanding that #3 is generally the fastest (though #1 is plenty fast for almost any need).
Writing one huge INSERT statement well be very slow. You also don't want to parse the whole massive file at once. What you need to do is something along the lines of:
Figure out a good chunk size. Let's call it chunk_size. This will be the number of records you'll read from the file at a time.
Load chunk_size number of records from the file into a DataTable.
Use SQLBulkCopy to insert the DataTable into the DB.
Repeat 2 & 3 until the file is done.
You'll have to experiment to find an optimal size for chunk_size so start small and work your way up.
I'm not sure of an actual limit, if one exists, but why not take "bite sized" chunks of the file that you feel comfortable with and break it into several commands? You can always wrap it in a single transaction if it's important that they all fail or succeed.
Say grab 250 lines at a time, or whatever.
I'm currently working on a project where we have a large data warehouse which imports several GB of data on a daily basis from a number of different sources. We have a lot of files with different formats and structures all being imported into a couple of base tables which we then transpose/pivot through stored procs. This part works fine. The initial import however, is awfully slow.
We can't use SSIS File Connection Managers as the columns can be totally different from file to file so we have a custom object model in C# which transposes rows and columns of data into two base tables; one for column names, and another for the actual data in each cell, which is related to a record in the attribute table.
Example - Data Files:
Example - DB tables:
The SQL insert is performed currently by looping through all the data rows and appending the values to a SQL string. This constructs a large dynamic string which is then executed at the end via SqlCommand.
The problem is, even running in a 1MB file takes about a minute, so when it comes to large files (200MB etc) it takes hours to process a single file. I'm looking for suggestions as to other ways to approach the insert that will improve performance and speed up the process.
There are a few things I can do with the structure of the loop to cut down on the string size and number of SQL commands present in the string but ideally I'm looking for a cleaner, more robust approach. Apologies if I haven't explained myself well, I'll try and provide more detail if required.
Any ideas on how to speed up this process?
The dynamic string is going to be SLOW. Each SQLCommand is a separate call to the database. You are much better off streaming the output as a bulk insertion operation.
I understand that all your files are different formats, so you are having to parse and unpivot in code to get it into your EAV database form.
However, because the output is in a consistent schema you would be better off either using separate connection managers and the built-in unpivot operator, or in a script task adding multiple rows to the data flow in the common output (just like you are currently doing in building your SQL INSERT...INSERT...INSERT for each input row) and then letting it all stream into a destination.
i.e. Read your data and in the script source, assign the FileID, RowId, AttributeName and Value to multiple rows (so this is doing the unpivot in code, but instead of generating a varying number of inserts, you are just inserting a varying number of rows into the dataflow based on the input row).
Then pass that through a lookup to get from AttributeName to AttributeID (erroring the rows with invalid attributes).
Stream straight into an OLEDB destination, and it should be a lot quicker.
One thought - are you repeatedly going back to the database to find the appropriate attribute value? If so, switching the repeated queries to a query against a recordset that you keep at the clientside will speed things up enormously.
This is something I have done before - 4 reference tables involved. Creating a local recordset and filtering that as appropriate caused a speed up of a process from 2.5 hours to about 3 minutes.
Why not store whatever reference tables are needed within each database and perform all lookups on the database end? Or it may even be better to pass a table type into each database where keys are needed, store all reference data in one central database and then perform your lookups there.
I am doing some reading, and came across avoiding an internalStore if my application does not need to massage the data before being sent to SQL. What is a data massage?
Manipulate, process, alter, recalculate. In short, if you are just moving the data in raw then no need to use internalStore, but if you're doing anything to it prior to storage, then you might want an internalStore.
Sometimes the whole process of moving data is referred to as "ETL" meaning "Extract, Transform, Load". Massaging the data is the "transform" step, but it implies ad-hoc fixes that you have to do to smooth out problems that you have encountered (like a massage does to your muscles) rather than transformations between well-known formats.
Thinks that you might do to "massage" data include:
Change formats from what the source system emits to what the target system expects, e.g. change date format from d/m/y to m/d/y.
replace missing values with defaults, e.g. Supply "0" when a quantity is not given.
Filter out records that not needed in the target system.
Check validity of records, and ignore or report on rows that would cause an error if you tried to insert them.
Normalise data to remove variations that should be the same, e.g. replace upper case with lower case, replace "01" with "1".
Clean up, normalization, filtering, ... Just changing the data somehow from the original input into a form that is better suited to your use.
And finally there is the less savory practice of massaging the data by throwing out data (or adjusting the numbers) when they don't give you the answer you want. Unfortunately people doing statistical analysis often massage the data to get rid of those pesky outliers which disprove their theory. Because of this practice referring to data cleaning as massaging the data is inappropriate. Cleaning the data to make it something that can go into your system (getting rid of meaningless dates like 02/30/2009 because someone else stored them in varchar instead of as dates, separating first and last names into separate fields, fixing all uppercase data, adding default values for fields that require data when the supplied data isn't given, etc.) is one thing - massaging the data implies a practice of adjusting the data inappropriately.
Also to comment on the idea that it is bad to have an internal store if you are not changing any data, I strongly disagree with this (and I have have loaded thousands of files from hundreds of sources through the years. In the first place, there is virtually no data that doesn't need to at least be examined for for cleaning. And if it was ok in the first run doesn't guarantee that a year later it won't be putting garbage into your system. Loading any file without first putting it into a staging table and cleaning it is simply irresponsible.
Also we find it easier to research issues with data if we can see easily the contents of the file we loaded in a staging table. Then we can pinpoint exactly which file/source gave us the data in question and that resolves many issues where the customer thinks we loading bad information that they actually sent us to load. In fact we always use two staging tables, one for the raw data as it came in from the file and one for the data after cleaning but before loading to the production tables. As a result I can resolve issues in seconds or minutes that would take hours if I had to go back and search through the original files. Because one thing you can guarantee is that if you are importing data, there will be times when the content of that data will be questioned.