I am currently working on a project to parse an excel sheet and insert any values into a database which were not inserted previously. The sheet contains roughly 80 date-value pairs for different names, with an average of about 1500 rows per pair.
Each name has 5 date-value pairs entered manually at the end of the week. Over the weekend, my process will parse the excel file and insert any values that are not currently in the database.
My question is, given the large amount of total data and the small amount added each week, how would you determine easily which values need to be inserted? I have considered adding another table to store the last date inserted for each name and taking any rows after that.
Simplest solution, I would bring it all into a staging table and do the compare in the server. Alternatively, SSIS with an appropriate sort and lookup could determine the differences and insert them.
120000 rows is not significant to compare in the database using SQL, but 120000 individual calls to the database to verify if the row is in the database might take a while on a client-side.
Option 1 would be to create a "lastdate" table that is automatically stamped at the end of your weekend import. Then the next week your program could query the last record in that table, then only read from the excel file after that date. probably your best bet.
Option 2 would be to find a unique field in the data, and row by row check if that key exists in the database. If it doesn't exist, you add it, if it does you won't. This would be my 2nd choice if Option 1 didn't work how you expect it.
It all depends how bullet proof your solution needs to be. If you trust the users that the spreadsheet will not be tweaked in any way that would make it inconsistent, than your solution would be fine.
If you want to be on the safe side (e.g. if some old values could potentially change), you would need to compare the whole thing with the database. To be honest the amount of data you are talking here doesn't seem very big, especially when you process will run on a Weekend. And you can still optimize by writing "batch" type of stored procs for the database.
Thanks for the answers all.
I have decided, rather than creating a new table that stores the last date, I will just select the max date for each name, then insert values after that date into the table.
This assumes that the data prior to the last date remains consistent, which should be fine for this problem.
Related
I have developed an application in C# (ASP.NET 4.0) using SQL Server 2014 as the database.
I have question about rounding up and summing data. I have data that comes in via CSV through FTP and I import the raw data into a table. The data is logged every minute by the customer. The data is identified by customer Id.
I have now been asked to take that data and sum the time series data into 15 minutes chunks from the hour.
They then want that data rounded up into days (midnight to midnight), then that data rounded up into weeks (Monday to Sunday). They also want the day data rounded up into calendar months from midnight to midnight and the month data is to be rounded up into a year.
The idea is that the raw time series data is grouped into its constituent periods, such as day, week, month so they can see the total for that time period.
I have looked at cursors and loops in SQL and I have been told the overhead will be too great as I already have 300 million rows and counting to process.
I don't whether I should develop a service in C# that does it all on the server or use the database. The research I have done contradicts itself slightly in each case.
Any hints would be great as where to look and what to try.
I believe you are looking more for a desighn than a solution here.
I would suggest you to create a table which will hold the data of the ftp load along with a batch id (a unique identifier).
create another table where you load this batch id with a status column, alays insert a row here once if you are doen with ftp load into the table1, make status as N. This polling script should call the below sp.
Now, create a polling script from c# or if you are experienced with service broker in sql use that, to poll this table2 with batch id and status with status as N.
Now, create another stored procedure which will sum up all the records for this batch id only. And add the values to the daily count approaitely..
The same will be done weekly counts and all...
Once all this is done, remove the information from the table1 with the batch id for which we processed, if you need this info for future purpose you can sotre it in different table.
To have the power of managing data and ready for any change for business rules in the future, you need to add some control columns to the table.
The controls managing period/hour/day/month/year /... whatever in future
Just when you add the period fill the corresponding control fields once at a time with the corresponding value:
period 1..4
hour 1-24
day 1..366
week 1..55
month 1..12
year 1.. (if needed)
You can define set of SQL functions to fill these columns at once(during data loading from the file).
Create index for these columns.
Once , you do this you can by, c# code /sql code, you do the summing up dynamically to any period/hour/day/.....
You can benefit from Analysis server / window functions / pivots to do your magic :) on data for any interval.
This approach gives you the power of keeping data , no deletion , except for archiving purpose and managing changes in the future.
So I'm upgrading an old parser right now. It's written in C# and uses SQL to insert records into a database.
Currently it reads and parses a few thousand lines of data from a file, then inserts the new data into a database containing over a million records.
Sometimes it can take over 10 minutes just to add a few thousand lines.
I've come to the conclusion that this bottleneck in performance is due to a SQL command where it uses an IF NOT EXISTS statement to determine whether the row attempting to be inserted already exists, and if it doesn't insert the record.
I believe the problem is that it just takes way too long to call the IF NOT EXISTS on every single row in the new data.
Is there a faster way to determine whether data exists already or not?
I was thinking to insert all of the records first anyways using the SQLBulkCopy Class, then running a stored procedure to remove the duplicates.
Does anyone else have any suggestions or methods to do this as efficiently and quickly as possible? Anything would be appreciated.
EDIT: To clarify, I'd run a stored procedure (on the large table) after copying the new data into the large table
large table = 1,000,000+ rows
1. Create an IDataReader to loop over your source data.
2. Place the values into a strong dataset.
3. Every N number of rows, send the dataset (.GetXml) to a stored procedure. Let's say 1000 for the heck of it.
4. Have the stored procedure shred the xml.
5. Do your INSERT/UPDATE based on this shredded xml.
6. Return from the procedure, keep looping until you're done.
Here is an older example:
http://granadacoder.wordpress.com/2009/01/27/bulk-insert-example-using-an-idatareader-to-strong-dataset-to-sql-server-xml/
The key is that you are doing "bulk" operations.......instead of row by row. And you can pick a sweet spot # (1000 for example) that gives you the best performance.
I have about 1.5 million files I need to insert records for in the database.
Each record is inserted with a key that includes the name of the file.
The catch: The files are not uniquely identified currently.
So, what we'd like to do is, for each file:
Insert a record. One of the fields in the record should include an amazon S3 key which should include the ID of the newly inserted record.
Rename the file to include the ID so that it matches the format of the key.
The best thing I can think to do is:
Run an individual insert command that returns the ID of the added row.
Add that back as a property to the individual business object I'm looping through.
Generate an update statement that updates the S3 key to include the ID
Output the file, concatenate the ID into the end the file name.
As I can tell, that looks to be :
1.5 million insert statements
individual SqlCommand executions and read because we need the ID back),
1.5 million times setting a property on an object.
1.5 million update statements generated and executed
Perhaps could make this a one giant concatenated update statement to do them all at once; not sure if that helps
1.5 million file copies.
I can't get around the actual file part, but for the rest, is there a better strategy I'm not seeing?
If you make the client application generate the IDs you can use a straight-forward SqlBulkCopy to insert all rows at once. It will be done in seconds.
If you want to keep the IDENTITY property of the column, you can run a DBCC CHECKIDENT(RESEED) to advance the identity counter by 1.5m to give you a guaranteed gap that you can insert into. If the number of rows is not statically known you can perform the inserting in smaller chunks of maybe 100k until you are done.
You will cut the number of SQL statements in half up by not relying on the database to generate your ID for each row. Do everything locally (including the assignment of an ID) and then do a single batch of inserts at the end, with identity_insert on.
This will cause SQL Server to use your ID's for this batch of records.
If this is still too slow (and 1.5 million inserts might be), the next step would be to output your data to a text file (XML, comma delimited, or whatever) and then do a bulk import operation on the file.
That's as fast as you will be able to make it, I think.
I'm migrating data from one system to another and will be receiving a CSV file with the data to import. The file could contain up to a million records to import. I need to get each line in the file, validate it and put the data into the relevant tables. For example, the CSV would be like:
Mr,Bob,Smith,1 high street,London,ec1,012345789,work(this needs to be looked up in another table to get the ID)
There's a lot more data than this example in the real files.
So, the SQL would be something like this:
Declare #UserID
Insert into User
Values ('Mr', 'Bob', 'Smith', 0123456789)
Set #UserID = ##Identity
Insert into Address
Values ('1 high street', 'London', 'ec1', select ID from AddressType where AddressTypeName = 'work')
I was thinking of iterating over each row and call an SP with the parameters from the file which will contain the SQL above. Would this be the best way of tackling this? It's not time critical as this will just be run once when updating a site.
I'm using C# and SQL Server 2008 R2.
What about you load it into a temporary table (note that this may be logically temporary - not necessarily technically) as staging, then process it from there. This is standard ETL behavior (and a million is tiny for ETL), you first stage the data, then clean it, then put it to the final place.
When performing tasks of this nature, you do not think in terms of rotating through each record individually as that will be a huge performence problem. In this case you bulk insert the records to a staging table or use the wizard to import to a staging table (look out for teh deafult 50 characters espcially in the address field).Then you write set-based code to do any clean up you need (removing bad telephone numbers or zip code or email addresses or states or records missing data in fields that are required in your database or transforing data using lookup tables (suppose you have table with certain required values, those are likely not the same values that you wil find in this file, you need to convert them. We use doctor specialties a lot. So our system might store them as GP but the file might give us a value of General Practioner. You need to look at all teh non-matching values for the field and then determine if you can map them to existing values, if you need to throw the record out or if you need to add more values to your lookup table. Once you have gotten rid of records you don't want and cleaned up those you can in your staging table then you import to the prod tables. Inserts should be written using the SELECT version of INSERT not with the VALUES clause when you are writing more than one or two records.
I'm currently working on a project where we have a large data warehouse which imports several GB of data on a daily basis from a number of different sources. We have a lot of files with different formats and structures all being imported into a couple of base tables which we then transpose/pivot through stored procs. This part works fine. The initial import however, is awfully slow.
We can't use SSIS File Connection Managers as the columns can be totally different from file to file so we have a custom object model in C# which transposes rows and columns of data into two base tables; one for column names, and another for the actual data in each cell, which is related to a record in the attribute table.
Example - Data Files:
Example - DB tables:
The SQL insert is performed currently by looping through all the data rows and appending the values to a SQL string. This constructs a large dynamic string which is then executed at the end via SqlCommand.
The problem is, even running in a 1MB file takes about a minute, so when it comes to large files (200MB etc) it takes hours to process a single file. I'm looking for suggestions as to other ways to approach the insert that will improve performance and speed up the process.
There are a few things I can do with the structure of the loop to cut down on the string size and number of SQL commands present in the string but ideally I'm looking for a cleaner, more robust approach. Apologies if I haven't explained myself well, I'll try and provide more detail if required.
Any ideas on how to speed up this process?
The dynamic string is going to be SLOW. Each SQLCommand is a separate call to the database. You are much better off streaming the output as a bulk insertion operation.
I understand that all your files are different formats, so you are having to parse and unpivot in code to get it into your EAV database form.
However, because the output is in a consistent schema you would be better off either using separate connection managers and the built-in unpivot operator, or in a script task adding multiple rows to the data flow in the common output (just like you are currently doing in building your SQL INSERT...INSERT...INSERT for each input row) and then letting it all stream into a destination.
i.e. Read your data and in the script source, assign the FileID, RowId, AttributeName and Value to multiple rows (so this is doing the unpivot in code, but instead of generating a varying number of inserts, you are just inserting a varying number of rows into the dataflow based on the input row).
Then pass that through a lookup to get from AttributeName to AttributeID (erroring the rows with invalid attributes).
Stream straight into an OLEDB destination, and it should be a lot quicker.
One thought - are you repeatedly going back to the database to find the appropriate attribute value? If so, switching the repeated queries to a query against a recordset that you keep at the clientside will speed things up enormously.
This is something I have done before - 4 reference tables involved. Creating a local recordset and filtering that as appropriate caused a speed up of a process from 2.5 hours to about 3 minutes.
Why not store whatever reference tables are needed within each database and perform all lookups on the database end? Or it may even be better to pass a table type into each database where keys are needed, store all reference data in one central database and then perform your lookups there.