I use C# SqlBulkCopy class to load big XML file to SQL server. I've implemented IDataReader, which loops throught XML and get values.
The file contains a lot of tables, so I have to call SqlBulkCopy. WriteToServer method as many times as many tables I have in source XML file. Every time DataReader loops throught whole file, which takes a lot of time.
How can I improve perfomance of my app? Is there a better way to do what I want?
Here is a plan of my program:
Loop thru source file - determine tables and their columns (and datatypes).
Create tables on Sql Server.
Load data to Sql Server by looping source file and get the values for each table I've determined, one by one.
Related
So I'm upgrading an old parser right now. It's written in C# and uses SQL to insert records into a database.
Currently it reads and parses a few thousand lines of data from a file, then inserts the new data into a database containing over a million records.
Sometimes it can take over 10 minutes just to add a few thousand lines.
I've come to the conclusion that this bottleneck in performance is due to a SQL command where it uses an IF NOT EXISTS statement to determine whether the row attempting to be inserted already exists, and if it doesn't insert the record.
I believe the problem is that it just takes way too long to call the IF NOT EXISTS on every single row in the new data.
Is there a faster way to determine whether data exists already or not?
I was thinking to insert all of the records first anyways using the SQLBulkCopy Class, then running a stored procedure to remove the duplicates.
Does anyone else have any suggestions or methods to do this as efficiently and quickly as possible? Anything would be appreciated.
EDIT: To clarify, I'd run a stored procedure (on the large table) after copying the new data into the large table
large table = 1,000,000+ rows
1. Create an IDataReader to loop over your source data.
2. Place the values into a strong dataset.
3. Every N number of rows, send the dataset (.GetXml) to a stored procedure. Let's say 1000 for the heck of it.
4. Have the stored procedure shred the xml.
5. Do your INSERT/UPDATE based on this shredded xml.
6. Return from the procedure, keep looping until you're done.
Here is an older example:
http://granadacoder.wordpress.com/2009/01/27/bulk-insert-example-using-an-idatareader-to-strong-dataset-to-sql-server-xml/
The key is that you are doing "bulk" operations.......instead of row by row. And you can pick a sweet spot # (1000 for example) that gives you the best performance.
I have an Excel file which I am required to parse, validate and then load into a SQL Server database using Interop. I have the application working and everything is fine by reading a sheet, reading each line (row and columns) and adding that line to an List as an Insert statement. When I reach the end of the Worksheet, I execute all of the Insert statements as one batch.
The problem I have is that it is using a lot of RAM when the worksheet is big (1000+ rows). Is there a better or more efficient strategy for larger data? Should I be committing more and clearing the List?
I don't think there is much you can do on the parsing side (unless you are coding it all yourself), but I'd INSERT the data as soon as you have a row available. No need to store it in a list. In your solution, you are basically storing all data twice (once in the "Excel memory" and once in "database insert memory").
I'm recieving and parsing a large text file.
In that file I have a numerical ID identifying a row in a table, and another field that I need to update.
ID Current Location
=========================
1 Boston
2 Cambridge
3 Idaho
I was thinking of creating a single SQL command string and firing that off using ADO.Net, but some of these files I'm going to recieve have thousands of lines. Is this doable or is there a limit I'm not seeing?
If you may have thousands of lines, then composing a SQL statement is definitely NOT the way to go. Better code-based alternatives include:
Use SQLBulkCopy to insert the change data to a staging table and then UPDATE your target table using the staging table as the source. It also has excellent batching options (unlike the other choices)
Write a stored procedure to do the Update that accepts an XML parameter that contains the UPDATE data.
Write a stored procedure to do the Update that accepts a table-valued parameter that contains the UPDATE data.
I have not compared them myself but it is my understanding that #3 is generally the fastest (though #1 is plenty fast for almost any need).
Writing one huge INSERT statement well be very slow. You also don't want to parse the whole massive file at once. What you need to do is something along the lines of:
Figure out a good chunk size. Let's call it chunk_size. This will be the number of records you'll read from the file at a time.
Load chunk_size number of records from the file into a DataTable.
Use SQLBulkCopy to insert the DataTable into the DB.
Repeat 2 & 3 until the file is done.
You'll have to experiment to find an optimal size for chunk_size so start small and work your way up.
I'm not sure of an actual limit, if one exists, but why not take "bite sized" chunks of the file that you feel comfortable with and break it into several commands? You can always wrap it in a single transaction if it's important that they all fail or succeed.
Say grab 250 lines at a time, or whatever.
How to increase performance on exporting a database with tables with one to many relationship (in so case many to many relationship) into a single excel file.
Right now, I get all the data from the database and process it into a table using a few for loops, then i change the header of the html file to download it as an excel file. But it take a while for the number of records i have (about 300 records. )
I was just wondering, if there is a faster way to improved performance.
Thanks
It sounds like you're loading each table into memory with your c# code, and then building a flat table by looping through the data. A vastly simpler and faster way to do that would be to use a SQL query with a few JOINs in it:
http://www.w3schools.com/sql/sql_join.asp
http://en.wikipedia.org/wiki/Join_(SQL)
Also, I get the impression that you're rendering the resulting flat table to html, and then saving that as an excel file. There are several ways that you can create that excel (or csv) file directly, without having to turn it into an html table first.
I'm currently working on a project where we have a large data warehouse which imports several GB of data on a daily basis from a number of different sources. We have a lot of files with different formats and structures all being imported into a couple of base tables which we then transpose/pivot through stored procs. This part works fine. The initial import however, is awfully slow.
We can't use SSIS File Connection Managers as the columns can be totally different from file to file so we have a custom object model in C# which transposes rows and columns of data into two base tables; one for column names, and another for the actual data in each cell, which is related to a record in the attribute table.
Example - Data Files:
Example - DB tables:
The SQL insert is performed currently by looping through all the data rows and appending the values to a SQL string. This constructs a large dynamic string which is then executed at the end via SqlCommand.
The problem is, even running in a 1MB file takes about a minute, so when it comes to large files (200MB etc) it takes hours to process a single file. I'm looking for suggestions as to other ways to approach the insert that will improve performance and speed up the process.
There are a few things I can do with the structure of the loop to cut down on the string size and number of SQL commands present in the string but ideally I'm looking for a cleaner, more robust approach. Apologies if I haven't explained myself well, I'll try and provide more detail if required.
Any ideas on how to speed up this process?
The dynamic string is going to be SLOW. Each SQLCommand is a separate call to the database. You are much better off streaming the output as a bulk insertion operation.
I understand that all your files are different formats, so you are having to parse and unpivot in code to get it into your EAV database form.
However, because the output is in a consistent schema you would be better off either using separate connection managers and the built-in unpivot operator, or in a script task adding multiple rows to the data flow in the common output (just like you are currently doing in building your SQL INSERT...INSERT...INSERT for each input row) and then letting it all stream into a destination.
i.e. Read your data and in the script source, assign the FileID, RowId, AttributeName and Value to multiple rows (so this is doing the unpivot in code, but instead of generating a varying number of inserts, you are just inserting a varying number of rows into the dataflow based on the input row).
Then pass that through a lookup to get from AttributeName to AttributeID (erroring the rows with invalid attributes).
Stream straight into an OLEDB destination, and it should be a lot quicker.
One thought - are you repeatedly going back to the database to find the appropriate attribute value? If so, switching the repeated queries to a query against a recordset that you keep at the clientside will speed things up enormously.
This is something I have done before - 4 reference tables involved. Creating a local recordset and filtering that as appropriate caused a speed up of a process from 2.5 hours to about 3 minutes.
Why not store whatever reference tables are needed within each database and perform all lookups on the database end? Or it may even be better to pass a table type into each database where keys are needed, store all reference data in one central database and then perform your lookups there.