I have a folder that contains more than 100 .txt files. These files contain huge amount of data that needs to be orchestrated and updated in the SQL Server table. Each text file line will have only two columns that interests me and based upon the 1st column data (which is a PK in the SQL Server Table) I want to update the 2nd column data in the DB table.
Please suggest the best way to do it in C# 3.0
Presently I am using a StringBuilder and appending the Update query.
I think first you should use Bulk Copy to import the data to the sql server and then you write stored procedure. Building the query and then running will make the whole process slow.
Related
I need to insert large ammount of data each day into my SQL Server database. The data is inserted from the file, where rows represent data from this year, and some of these rows may be changed, while new rows are added to the file each day, so I need to check if some row is changed in the file and to update database and always to insert new rows.
So, what approach do you recommend (clear database/bulk insert, read line-by-line and insert in C#, ssis, etc.) ?
Based on your comments, I would just drop the table and reload the CSV every night using SSIS, then create fresh indexes as part of the nightly job.
If each CSV contains all of the relevant information, then this is the simplest way to go.
No reason to fool around with update/merge logic that I can see.
Plus, given your aversion to SSIS, a straight table load with index creation should be very easy to implement in a C# script.
SSIS Route:
First, build the first load as demonstrated here.
Next, right click on the table is SSMS and generate the create script for that table.
Then, create an Execute SQL task in SSIS that runs before the load task. That SQL task will run the following 2 pieces of code drop table <your table name>, followed by the create table script you copied earlier.
Finally (and optionally), create an Execute SQL task that runs after the data load task that will create any needed indexes. Since I know nothing about your data, I'd recommend a nonclustered index that includes all of the columns that you use for parameters in your report, such as CREATE NONCLUSTERED INDEX IX_SalesPerson_SalesQuota_SalesYTD ON Sales.SalesPerson (SalesQuota, SalesYTD);
While not a perfectly tuned solution, it should suffice for what you are trying to do and be easy to maintain.
I may be able to add screenshots later.
There are different approach depends on how data is being used by application for the moment I would suggest following steps:
Create schedule a job to upload data into Temp Table(s)
Bulk Insert into Temp Table
Insert New rows form Temp Table to Main Table
Update existing rows from Temp to Main Table
or You can delete existing rows by comparing Temp Table to Main Table and Then Build Insert every thing from Temp Table to Main Table.
Sounds like you need an "upsert". SQL server supports the MERGE statement that achieves this. You could feed your csv data into a temp table & then MERGE it into your destination table w/that syntax. Probably SSIS would let you set that up in a neat job.
I have a VERY large (50 million+ records) dataset that I am importing from an old Interbase database into a new SQL Server database.
My current approach is:
acquire csv files from the Interbase database (done, used a program called "FBExport" I found somewhere online)
The schema of the old database doesn't match the new one (not under my control), so now I need to mass edit certain fields in order for them to work in the new database. This is the area I need help with
after editing to the correct schema, I am using SqlBulkCopy to copy the newly edited data set into the SQL Server database.
Part 3 works very quickly, diagnostics shows that importing 10,000 records at once is done almost instantly.
My current (slow) approach to part 2 is I just read the csv file line by line, and lookup the relevant information (ex. the csv file has an ID that is XXX########, whereas the new database has a separate column for each XXX and ########. ex2. the csv file references a model via a string, but the new database references via an ID in the model table) and then insert a new row into my local table, and then SqlBulkCopy after my local table gets large.
My question is: What would be the "best" approach (perfomance wise) for this data-editing step? I figure there is very likely a linq-type approach to this, would that perform better, and how would I go about doing that if it would?
If step #3’s importing is very quick, I would be tempted to create a temporary database whose schema exactly matches the old database and import the records into it. Then I’d look at adding additional columns to the temporary table where you need to split the XXX######## into XXX and ########. You could then use SQL to split the source column into the two separate ones. You could likewise use SQL to do whatever ID based lookups and updates you need to ensure the record relationships continue to be correct.
Once the data has been massaged into a format which is acceptable, you can insert the records into the final tables using IDENTITY_INSERT ON, excluding all legacy columns/information.
In my mind, the primary advantage of doing it within the temporary SQL DB is that at any time you can write queries to ensure that record relationships using the old key(s) are still correctly related to records using the new database’s auto generated keys.
This is of coursed based on me being more comfortable doing data transformations/validation in SQL than in C#.
I'm migrating data from one system to another and will be receiving a CSV file with the data to import. The file could contain up to a million records to import. I need to get each line in the file, validate it and put the data into the relevant tables. For example, the CSV would be like:
Mr,Bob,Smith,1 high street,London,ec1,012345789,work(this needs to be looked up in another table to get the ID)
There's a lot more data than this example in the real files.
So, the SQL would be something like this:
Declare #UserID
Insert into User
Values ('Mr', 'Bob', 'Smith', 0123456789)
Set #UserID = ##Identity
Insert into Address
Values ('1 high street', 'London', 'ec1', select ID from AddressType where AddressTypeName = 'work')
I was thinking of iterating over each row and call an SP with the parameters from the file which will contain the SQL above. Would this be the best way of tackling this? It's not time critical as this will just be run once when updating a site.
I'm using C# and SQL Server 2008 R2.
What about you load it into a temporary table (note that this may be logically temporary - not necessarily technically) as staging, then process it from there. This is standard ETL behavior (and a million is tiny for ETL), you first stage the data, then clean it, then put it to the final place.
When performing tasks of this nature, you do not think in terms of rotating through each record individually as that will be a huge performence problem. In this case you bulk insert the records to a staging table or use the wizard to import to a staging table (look out for teh deafult 50 characters espcially in the address field).Then you write set-based code to do any clean up you need (removing bad telephone numbers or zip code or email addresses or states or records missing data in fields that are required in your database or transforing data using lookup tables (suppose you have table with certain required values, those are likely not the same values that you wil find in this file, you need to convert them. We use doctor specialties a lot. So our system might store them as GP but the file might give us a value of General Practioner. You need to look at all teh non-matching values for the field and then determine if you can map them to existing values, if you need to throw the record out or if you need to add more values to your lookup table. Once you have gotten rid of records you don't want and cleaned up those you can in your staging table then you import to the prod tables. Inserts should be written using the SELECT version of INSERT not with the VALUES clause when you are writing more than one or two records.
I need ideas for a problem I am working on:
I am writing a data synchronizer in C#.Net that will receive CSV files, one for each table in a SQL Server database.
Some of the rows in the csv files will reference existing rows in the database, requiring an update, and some will reference new rows, requiring an insert.
Since there might be a lot of files (20 or so) and potentially a lot of rows in each, how can I make this scalable? Reading one row at the time, connecting to the database to make sure if a row with that same ID exists or not (to make sure if it is an update or insert) and then making another connection for doing the actual update or insert seems wasteful.
Load everything in a temporary table (bulk insert)
Perform a merge update to the target table.
You should be using SQL Server Integration Services for this kind of work.
SSIS is a platform for data integration and workflow applications. It features a fast and flexible data warehousing tool used for data extraction, transformation, and loading (ETL).
Also a good source to use as reference could be CsvReader
The best way will be to use SSIS .In SSIS we have csv reader component (Flat File Source) which handles all type of flat files ( pipeline or tab delimited. etc ) .Using Lookup u can check with the existing row in the table and then u can either update , insert or delete using Oledb Component .
If you don't want to use SSIS there is a other way round using XML Stored Procedure .Instead of hitting the database for every row u can pass the data as an XML and then manipulate in the Stored Procedure .
Example : To insert data into table using XML as a source
CREATE PROCEDURE [dbo].[sp_Insert_XML]
#XMLDATA xml
AS
BEGIN
SET NOCOUNT ON;
-- Insert statements for procedure here
Insert into RCMReport(
ProjectName
,Category
,EndTime)
Select
XMLDATA.item.value('#ProjectName[1]', 'varchar(255)') AS ProjectName,
XMLDATA.item.value('#Category[1]', 'varchar(200)') AS Category,
XMLDATA.item.value('#EndTime[1]', 'datetime') AS EndTime
FROM #XMLDATA.nodes('//RCMReport/InsertList') AS XMLDATA(item)
my windows app is reading text file and inserting it into the database. The problem is text file is extremely big (at least for our low-end machines). It has 100 thousands rows and it takes time to write it into the database.
Can you guys suggest how should i read and write the data efficiently so that it does not hog machine memory?
FYI...
Column delimiter : '|'
Row delimiter : NewLine
It has approximately 10 columns.. (It has an information of clients...like first name, last name, address, phones, emails etc.)
CONSIDER THAT...I AM RESTRICTED FROM USING BULK CMD.
You don't say what kind of database you're using, but if it is SQL Server, then you should look into the BULK INSERT command or the BCP utility.
Given that there is absolutely no chance of getting help from your security folks and using BULK commands, here is the approach I would take:
Make sure you are reading the entire text file first before inserting into the database. Thus reducing the I/O.
Check what indexes you have on the destination table. Can you insert into a temporary table with no indexes or dependencies so that the individual inserts are fast?
Does this data need to be visible immediately after insert? If not then you can have a scheduled job to read from the temp table in step 2 above and insert into the destination table (that has indexes, foreign keys etc.).
Is it possible for you to register your custom assembly into Sql Server? (I'm assuming it's sql server because you've already said you used bulk insert earlier).
Than you can call your assembly to do (mostly) whatever you need, like getting a file from some service (or whatever your option is), parsing and inserting directly into tables.
This is not an option I like, but it could be a saver sometimes.