Join dataset and sql server table - c#

I have a dataset object in C# (SSIS package) containing about 40000 rows and SQL table containing about 50000000 rows. I want to join these tables on their IDs.
I can't load the SQL table in C# as its too big, also, I don't have a permission on that server to create table (for cloning the object from C#).
Is there any way that I can join object and table?
Does C# or SSIS package support this kind of solution?

It is possible to do in SSIS.
Below are some scenarios to do it. Key question - do you have 1-many match or many-many match
Alternative 1 - you need to match all rows of SQL Table refers to C# table (1 SQL table row matches 0 or 1 C# table rows).
High level view on the approach:
Create a dataset object with data and store it in SSIS Object type variable. Script Task will do it.
In DataFlow Script Source - read rows from the variable and write it to Cache Destination, persist into Cache file.
On the next Data Flow - read SQL table with OLEDB, and perform join with Lookup transformation, where Lookup uses Cache file created on step 2 as reference. You can add columns from Cache table as you wish.
The destination of the last Data Flow is up to you
Comments and samples:
Before entering code in Script Source -- add Output and specify output columns with its names and data types.
Script code for reading data from DataSet variable:
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
#endregion
// Add in the appropriate namespaces
using System.Data;
using System.Data.OleDb;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
public override void CreateNewOutputRows()
{
// Set up the DataAdapter to extract the data, and the DataTable object to capture those results
DataTable dt = new DataTable();
// Copy DataTable from DataSet
dt = Variables.vResults.DataTable["dtName"];
// Since we know the column metadata at design time, we simply need to iterate over each row in
// the DataTable, creating a new row in our Data Flow buffer for each
foreach (DataRow dr in dt.Rows)
{
// Create a new, empty row in the output buffer
SalesOutputBuffer.AddRow();
// Now populate the columns - here are sample names,
// have to define it before as columns in Script Source Output
SalesOutputBuffer.PurchOrderID = int.Parse(dr["PurchOrderID"].ToString());
SalesOutputBuffer.RevisionNumber = int.Parse(dr["RevisionNumber"].ToString());
SalesOutputBuffer.CreateDate = DateTime.Parse(dr["CreateDate"].ToString());
SalesOutputBuffer.TotalDue = decimal.Parse(dr["TotalDue"].ToString());
}
}
}
Alternative 2. You want to match all rows of C# DataSet to SQL Table (1 C# table row matches 0 or 1 SQL Table rows)
High level view on the approach:
Create a dataset object with data and store it in SSIS Object type variable. Script Task will do it.
In DataFlow Script Source - read rows from the variable.
Then - create a Lookup with Partial Cache and define SQL query to your table. You can create a No Cache Lookup if IDs in C# table are unique. Define match condition and columns needed from SQL Table.
Save result at some Destination
Bad Alternative - 1-many match with row multiplication
Example - row from C# table can match several SQL table rows and you have to output several rows in this case.
High level view on the approach:
Create a dataset object with data and store it in SSIS Object type variable. Script Task will do it.
In DataFlow Script Source - read rows from the variable. Sort it by ID.
Ad another Data Source where reading SQL Table, ordered by ID in the same direction.
Do a SSIS Merge Join
Save results to some destination
The bad thing about this scenario is that it may require a lot of RAM to do Sort and Merge Join transformations.

Ferdipux's approach is good one and less complicated than what lies below. The tradeoff between their solution and this is performance versus complexity. In the approach outlined by Ferdipux, you'll have to pull all 50 million rows from your source table into the data flow just to identify whether you have a match.
The approach I propose is to
Load your dataset into a temporary table. You might not be able to create a permanent table but temporary objects should not be an issue
Rewrite your source query to incorporate the temporary table.
Now the database engine can efficiently extract the source data with minimal impact.
Technical bits
Execute SQL Task (create temp table)
-> Data Flow Task (populate temp table
-> Data Flow Task (extract from big table)
In your connection manager to Source1, change the property for RetainSameConnection to True. This will ensure our temporary table does not go out of scope during execution.
Execute SQL Task
Create a global temporary table when the package begins.
During development, you will need to open a connection and run the supplied code and KEEP THE CONNECTION OPEN. This is as simple as running the query in SSMS and not closing the application.
IF OBJECT_ID('tempdb..##SO_59281633') IS NOT NULL
BEGIN
DROP TABLE ##SO_59281633;
END
-- Create global temporary table
CREATE TABLE ##SO_59281633
(
SomeKey int NOT NULL
, AValue varchar(50) NOT NULL
);
Data Flow Task (populate temp table)
Use the approach outlined by Ferdipux but also you need to specify on the Data Flow's properties that DelayValidation = True
Validation happens when the package is opened for editing and when it begins execution. During normal execution, the temporary table won't exist until the previous task has executed so specifying Delay Validation means this task will not validate until it is time for it to start - as opposed to the package.
If you're comfortable with .net, you can replace this step by using a Script Task and then use an ADO.NET/OLE DB connection and command objects to load the temporary table.
Data Flow Task (extract from big table)
You will again need to specify DelayValidation = True here as the source query will rely on a temporary object.
The source for your OLE DB Source will be changed from Table to Query and then specify your query
SELECT * FROM dbo.BigTable AS BT INNER JOIN ##SO_59281633 AS SO ON SO.SomeKey = BT.SomeKey;

Related

How can I insert 10 million records in the shortest time possible?

I have a file (which has 10 million records) like below:
line1
line2
line3
line4
.......
......
10 million lines
So basically I want to insert 10 million records into the database.
so I read the file and upload it to SQL Server.
C# code
System.IO.StreamReader file =
new System.IO.StreamReader(#"c:\test.txt");
while((line = file.ReadLine()) != null)
{
// insertion code goes here
//DAL.ExecuteSql("insert into table1 values("+line+")");
}
file.Close();
but insertion will take a long time.
How can I insert 10 million records in the shortest time possible using C#?
Update 1:
Bulk INSERT:
BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
ROWTERMINATOR =' \n'
);
My Table is like below:
DATAs
(
DatasField VARCHAR(MAX)
)
but I am getting following error:
Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
Below code worked:
BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
);
Please do not create a DataTable to load via BulkCopy. That is an ok solution for smaller sets of data, but there is absolutely no reason to load all 10 million rows into memory before calling the database.
Your best bet (outside of BCP / BULK INSERT / OPENROWSET(BULK...)) is to stream the contents from the file into the database via a Table-Valued Parameter (TVP). By using a TVP you can open the file, read a row & send a row until done, and then close the file. This method has a memory footprint of just a single row. I wrote an article, Streaming Data Into SQL Server 2008 From an Application, which has an example of this very scenario.
A simplistic overview of the structure is as follows. I am assuming the same import table and field name as shown in the question above.
Required database objects:
-- First: You need a User-Defined Table Type
CREATE TYPE ImportStructure AS TABLE (Field VARCHAR(MAX));
GO
-- Second: Use the UDTT as an input param to an import proc.
-- Hence "Tabled-Valued Parameter" (TVP)
CREATE PROCEDURE dbo.ImportData (
#ImportTable dbo.ImportStructure READONLY
)
AS
SET NOCOUNT ON;
-- maybe clear out the table first?
TRUNCATE TABLE dbo.DATAs;
INSERT INTO dbo.DATAs (DatasField)
SELECT Field
FROM #ImportTable;
GO
C# app code to make use of the above SQL objects is below. Notice how rather than filling up an object (e.g. DataTable) and then executing the Stored Procedure, in this method it is the executing of the Stored Procedure that initiates the reading of the file contents. The input parameter of the Stored Proc isn't a variable; it is the return value of a method, GetFileContents. That method is called when the SqlCommand calls ExecuteNonQuery, which opens the file, reads a row and sends the row to SQL Server via the IEnumerable<SqlDataRecord> and yield return constructs, and then closes the file. The Stored Procedure just sees a Table Variable, #ImportTable, that can be access as soon as the data starts coming over (note: the data does persist for a short time, even if not the full contents, in tempdb).
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using Microsoft.SqlServer.Server;
private static IEnumerable<SqlDataRecord> GetFileContents()
{
SqlMetaData[] _TvpSchema = new SqlMetaData[] {
new SqlMetaData("Field", SqlDbType.VarChar, SqlMetaData.Max)
};
SqlDataRecord _DataRecord = new SqlDataRecord(_TvpSchema);
StreamReader _FileReader = null;
try
{
_FileReader = new StreamReader("{filePath}");
// read a row, send a row
while (!_FileReader.EndOfStream)
{
// You shouldn't need to call "_DataRecord = new SqlDataRecord" as
// SQL Server already received the row when "yield return" was called.
// Unlike BCP and BULK INSERT, you have the option here to create a string
// call ReadLine() into the string, do manipulation(s) / validation(s) on
// the string, then pass that string into SetString() or discard if invalid.
_DataRecord.SetString(0, _FileReader.ReadLine());
yield return _DataRecord;
}
}
finally
{
_FileReader.Close();
}
}
The GetFileContents method above is used as the input parameter value for the Stored Procedure as shown below:
public static void test()
{
SqlConnection _Connection = new SqlConnection("{connection string}");
SqlCommand _Command = new SqlCommand("ImportData", _Connection);
_Command.CommandType = CommandType.StoredProcedure;
SqlParameter _TVParam = new SqlParameter();
_TVParam.ParameterName = "#ImportTable";
_TVParam.TypeName = "dbo.ImportStructure";
_TVParam.SqlDbType = SqlDbType.Structured;
_TVParam.Value = GetFileContents(); // return value of the method is streamed data
_Command.Parameters.Add(_TVParam);
try
{
_Connection.Open();
_Command.ExecuteNonQuery();
}
finally
{
_Connection.Close();
}
return;
}
Additional notes:
With some modification, the above C# code can be adapted to batch the data in.
With minor modification, the above C# code can be adapted to send in multiple fields (the example shown in the "Steaming Data..." article linked above passes in 2 fields).
You can also manipulate the value of each record in the SELECT statement in the proc.
You can also filter out rows by using a WHERE condition in the proc.
You can access the TVP Table Variable multiple times; it is READONLY but not "forward only".
Advantages over SqlBulkCopy:
SqlBulkCopy is INSERT-only whereas using a TVP allows the data to be used in any fashion: you can call MERGE; you can DELETE based on some condition; you can split the data into multiple tables; and so on.
Due to a TVP not being INSERT-only, you don't need a separate staging table to dump the data into.
You can get data back from the database by calling ExecuteReader instead of ExecuteNonQuery. For example, if there was an IDENTITY field on the DATAs import table, you could add an OUTPUT clause to the INSERT to pass back INSERTED.[ID] (assuming ID is the name of the IDENTITY field). Or you can pass back the results of a completely different query, or both since multiple results sets can be sent and accessed via Reader.NextResult(). Getting info back from the database is not possible when using SqlBulkCopy yet there are several questions here on S.O. of people wanting to do exactly that (at least with regards to the newly created IDENTITY values).
For more info on why it is sometimes faster for the overall process, even if slightly slower on getting the data from disk into SQL Server, please see this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP
In C#, the best solution is to let the SqlBulkCopy reads the file. To do this you need to pass an IDataReader direct to SqlBulkCopy.WriteToServer method. Here is an example: http://www.codeproject.com/Articles/228332/IDataReader-implementation-plus-SqlBulkCopy
the best way is a mix between your 1st solution and 2nd,
create DataTable and in the loop add rows to it then use BulkCopy to upload
to DB in one connection use this for help in bulk copy
one other thing to pay attention that bulk copy is a very sensitive operation that almost
every mistake will void the copy, such if you declare the column name in the dataTable as "text" and in the DB its "Text" it will throw an exception, good luck.
If you want to insert 10 million records in the shortest time to direct using SQL query for testing purpose you should use this query
CREATE TABLE TestData(ID INT IDENTITY (1,1), CreatedDate DATETIME)
GO
INSERT INTO TestData(CreatedDate) SELECT GetDate()
GO 10000000

SQL Bulk copy - insert default value

I have a c#.net application in which I need to insert the default value from application to sql server by using sql bulkcopy.
Example:
SqlColumnMapping("src_col1","dest_col1");
SqlColumnMapping("src_col2","dest_col2");
in "dest_col3", I would like to insert default value.
How could I map it in app and how the default value can be inserted in database?
Thanks
Hint: do not use SqlBulkCopy - that thing has tons of problems. Most around locking, default values also are in the game.
Use it against a temporary table ;)
THis is what I do.
Create a temp table with the proper field structure. You can make fields nullable here if they have a default value (information_schema can help you find it). THis step can be automated - 100% and it is not that hard.
SqlBulkCopy into the temp table. No locking issues.
After that you can run updates for default values ;)
INSERT INTO the final table.
Problems with SqlBulkCopy locking:
Locks the table. Exclusively.
It does not wait. It tries to get a lock, immediately. If that fails it retries. If the table is busy, it never gets the lock as it never waits until it gets one - and every new request is end of the queue.
We got hit badly by that in a ETL scenario some years back.
On top, as you found out, you can not work with default values.
I actually have that stuff totally isolated now in a separate bulk loader class and am just in the process of allowing this to UPDATE rows (by merging from the temp table).
Here's how you do it. Create a DataTable object that has the same structure as your desitination table, except remove the columns that have a default value. If you are using DataSet Designer in Visual Studio, you can remove the columns that have default values from the TableAdapter.
Using an SqlConnection called "connection" and a DataTable object called "table", your code would look something like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
foreach (System.Data.DataColumn c in table.Columns)
{
bulkCopy.ColumnMappings.Add(c.ColumnName, c.ColumnName);
}
bulkCopy.DestinationTableName = table.TableName;
bulkCopy.WriteToServer(table);
}
Again, in order to use this method, you have to ensure that your DataTable object does not contain the columns that you would like to insert with default values.

How to retrieve server generated Identity values when using SqlBulkCopy

I know I can do a bulk insert into my table with an identity column by not specifying the SqlBulkCopyOptions.KeepIdentity as mentioned here.
What I would like to be able to do is get the identity values that the server generates and put them in my datatable, or even a list. I saw this post, but I want my code to be general, and I can't have a version column in all my tables. Any suggestions are much appreciated. Here is my code:
public void BulkInsert(DataTable dataTable, string DestinationTbl, int batchSize)
{
// Get the DataTable
DataTable dtInsertRows = dataTable;
using (SqlBulkCopy sbc = new SqlBulkCopy(sConnectStr))
{
sbc.DestinationTableName = DestinationTbl;
// Number of records to be processed in one go
sbc.BatchSize = batchSize;
// Add your column mappings here
foreach (DataColumn dCol in dtInsertRows.Columns)
{
sbc.ColumnMappings.Add(dCol.ColumnName, dCol.ColumnName);
}
// Finally write to server
sbc.WriteToServer(dtInsertRows);
}
}
AFAIK, you can't.
The only way (that I know of) to get the values(s) of the identity field is by using either SCOPE_IDENTITY() when you insert row-by-row; or by using the OUTPUT approach when inserting an entire set.
The 'simplest' approach probably would be that you would SqlBulkCopy the records in the table and then fetch them back again later on. The problem might be that it could be hard to properly (and quickly) fetch those rows from the server again. (e.g. it would be rather ugly (and slow) to have a WHERE clause with IN (guid1, guid2, .., guid999998, guid999999) =)
I'm assuming performance is an issue here as you're already using SqlBulkCopy so I'd suggest to go for the OUTPUT approach in which case you'll firstly need a staging table to SqlBulkCopy your records in. Said table should then be including some kind of batch-identifier (GUID?) as to allow multiple treads to run side by side. You'll need a stored procedure to INSERT <table> OUTPUT inserted.* SELECT the data from the staging-table into the actual destination table and also clean-up the staging table again. The returend recordset from said procedure would then match 1:1 to the origanal dataset responsible for filling the staging table, but off course you should NOT rely on it's order. In other words : your next challenge than will be matching the returned Identity-fields back to the original records in your application.
Thinking things over, I'd say that in all cases -- except the row-by-row & SCOPY_IDENTITY() approach, which is going to be dog-slow -- you'll need to have (or add) a 'key' to your data to link the generated id's back to the original data =/
You can do a similar approach described above by deroby but instead of retrieving them back via a WHERE IN (guid1, etc... You match them back up to the rows inserted in memory based on their order.
So I would suggest to add a column onto the table to match the row to a SqlBulkCopy transaction and then do the following to match the generated Ids back to the in memory collection of rows you just inserted.
Create a new Guid and set this value on all the rows in the bulk copy mapping to the new column
Run the WriteToServer method of the BulkCopy object
Retrieve all the rows that have that same key
Iterate through this list which will be in the order they were added, these will be in the same order as the the in memory collection of rows so you then will know the generated id for each item.
This will give you better performance than giving each individual row a unique key. So after you bulk insert the data table you could do something like this (In my example I will have a list of objects from which I will create the data table and then map the generated ids back to them)
List<myObject> myCollection = new List<myObject>
Guid identifierKey = Guid.NewGuid();
//Do your bulk insert where all the rows inserted have the identifierKey
//set on the new column. In this example you would create a data table based
//off the myCollection object.
//Identifier is a column specifically for matching a group of rows to a sql
//bulk copy command
var myAddedRows = myDbContext.DatastoreRows.AsNoTracking()
.Where(d => d.Identifier == identiferKey)
.ToList();
for (int i = 0; i < myAddedRows.Count ; i++)
{
var savedRow = myAddedRows[i];
var inMemoryRow = myCollection[i];
int generatedId = savedRow.Id;
//Now you know the generatedId for the in memory object you could set a
// a property on it to store the value
inMemoryRow.GeneratedId = generatedId;
}

Retrieving scalar data using .xsd dataset object?

Can someone suggest the best way to retrieve a scalar value when the site uses .xsd files for the data sets? I have such site where before I commit to a insert task I need to verify duplicates.
Back in the day one would just instantiate a new connection and command object and run the query through BLL/DAL - easy job. With this prepackaged xsd file that the Studio creates for you I have no idea how to do it.
Thanks,
Risho
First, i would recommend to add an unique index in your database to ensure that it's impossible to create duplicates.
To answer your question: you can add queries to the automatically created TableAdapters:
How to: Create TableAdapter queries
From MSDN
TableAdapter with multiple queries
Unlike standard data adapters, TableAdapters can contain multiple
queries to fill their associated data tables. You can define as many
queries for a TableAdapter as your application requires, as long as
each query returns data that conforms to the same schema as its
associated data table. This enables loading of data that satisfies
differing criteria. For example, if your application contains a table
of customers, you can create a query that fills the table with every
customer whose name begins with a certain letter, and another query
that fills the table with all customers located in the same state. To
fill a Customers table with customers in a given state you can create
a FillByState query that takes a parameter for the state value: SELECT
* FROM Customers WHERE State = #State. You execute the query by calling the FillByState method and passing in the parameter value like
this: CustomerTableAdapter.FillByState("WA").
In addition to queries that return data of the same schema as the
TableAdapter's data table, you can add queries that return scalar
*(single) values.* For example, creating a query that returns a count of
customers (SELECT Count(*) From Customers) is valid for a
CustomersTableAdapter even though the data returned does not conform
to the table's schema.

Bulkcopy the updated & newly inserted data between differenet databases using C#

This is my first post.. I have 2 SQL Server databases located on different servers..
Let's say SDT for source data table from source database SDB to DDT (Destination data table) for Database DDB
I'm using C# for bulk copying from SDT to DDT..
My code is something like this:
sqlcommand = "Delete * from DDT where locID = #LocIDParam" // #LocIDParam is the parameter for a specific location //
then bulk copy "Select * from SDT where locID = #LocIDParam" // the steps are well known..
I just don't want to go for useless details..
However, my SDT has a huge data so that it causes high traffic for bulk copying the whole table
Is there anyway for bulk copying the only updated records from SDT to DDT as well as inserting the new ones???
Do you think using an SQL trigger for updated and newly inserted data is the best idea for this kind of scenarios? (trigger to insert the primary key value into a single column table for the new and update then deleting and inserting from/to DDT based on this )
PS. I don't want to use SQL replication for that since it has a lot of problems..
Thank you in advance
From the date I suppose you already fond your solution. In case not, here is how we deal with a somehow similar situation.
On the source table we have a column that shows if the data has to be send to the destination. We use a boolean but you can also have a datetime field that shows last update date.
Then our pull process does following :
Pull all the flagged data in a temporary table on the destination server
Update records that exists in both table
Insert all records from temporary table that don't exist in destination table
Drop the temporary table
If you use SQL 2008, there is a merge option that I don't know. Here a link that explains it :
SQL 208 MERGE command
Hope this will help you if you still need.

Categories