I am receiving (streamed) data from an external source (over Lightstreamer) into my C# application.
My C# application receives data from the listener. The data from the listener are stored in a queue (ConcurrentQueue).
The queue is getting cleaned every 0.5 seconds with TryDequeue into a DataTable. The DataTable will then be copy into a SQL database using SqlBulkCopy.
The SQL database processes the newly data arrived from the staging table into the final table.
I currently receive around 300'000 rows per day (can increae within the next weeks strongly) and my goal is to stay under 1 second from the time I receive the data until they are available in the final SQL table.
Currently the maximum rows per seconds I have to process is around 50 rows.
Unfortunately, since receiving more and more data, my logic is getting slower in performance (still far under 1 second, but I wanna keep improving). The main bottleneck (so far) is the processing of the staging data (on the SQL database) into the final table.
In order to improve the performance, I would like to switch the staging table into a memory-optimized table. The final table is already a memory-optimized table so they will work together fine for sure.
My questions:
Is there a way to use SqlBulkCopy (out of C#) with memory-optimized tables? (as far as I know there is no way yet)
Any suggestions for the fastest way to write the received data from my C# application into the memory-optimized staging table?
EDIT (with solution):
After the comments/answers and performance evaluations I decided to give up the bulk insert and use SQLCommand to handover a IEnumerable with my data as table-valued parameter into a native compiled stored procedure to store the data directly in my memory-optimized final table (as well as a copy into the "staging" table which now serves as archive). Performance increased significantly (even I did not consider parallelizing the inserts yet (will be at a later stage)).
Here is part of the code:
Memory-optimized user-defined table type (to handover the data from C# into SQL (stored procedure):
CREATE TYPE [Staging].[CityIndexIntradayLivePrices] AS TABLE(
[CityIndexInstrumentID] [int] NOT NULL,
[CityIndexTimeStamp] [bigint] NOT NULL,
[BidPrice] [numeric](18, 8) NOT NULL,
[AskPrice] [numeric](18, 8) NOT NULL,
INDEX [IndexCityIndexIntradayLivePrices] NONCLUSTERED
(
[CityIndexInstrumentID] ASC,
[CityIndexTimeStamp] ASC,
[BidPrice] ASC,
[AskPrice] ASC
)
)
WITH ( MEMORY_OPTIMIZED = ON )
Native compiled stored procedures to insert the data into final table and staging (which serves as archive in this case):
create procedure [Staging].[spProcessCityIndexIntradayLivePricesStaging]
(
#ProcessingID int,
#CityIndexIntradayLivePrices Staging.CityIndexIntradayLivePrices readonly
)
with native_compilation, schemabinding, execute as owner
as
begin atomic
with (transaction isolation level=snapshot, language=N'us_english')
-- store prices
insert into TimeSeries.CityIndexIntradayLivePrices
(
ObjectID,
PerDateTime,
BidPrice,
AskPrice,
ProcessingID
)
select Objects.ObjectID,
CityIndexTimeStamp,
CityIndexIntradayLivePricesStaging.BidPrice,
CityIndexIntradayLivePricesStaging.AskPrice,
#ProcessingID
from #CityIndexIntradayLivePrices CityIndexIntradayLivePricesStaging,
Objects.Objects
where Objects.CityIndexInstrumentID = CityIndexIntradayLivePricesStaging.CityIndexInstrumentID
-- store data in staging table
insert into Staging.CityIndexIntradayLivePricesStaging
(
ImportProcessingID,
CityIndexInstrumentID,
CityIndexTimeStamp,
BidPrice,
AskPrice
)
select #ProcessingID,
CityIndexInstrumentID,
CityIndexTimeStamp,
BidPrice,
AskPrice
from #CityIndexIntradayLivePrices
end
IEnumerable filled with the from the queue:
private static IEnumerable<SqlDataRecord> CreateSqlDataRecords()
{
// set columns (the sequence is important as the sequence will be accordingly to the sequence of columns in the table-value parameter)
SqlMetaData MetaDataCol1;
SqlMetaData MetaDataCol2;
SqlMetaData MetaDataCol3;
SqlMetaData MetaDataCol4;
MetaDataCol1 = new SqlMetaData("CityIndexInstrumentID", SqlDbType.Int);
MetaDataCol2 = new SqlMetaData("CityIndexTimeStamp", SqlDbType.BigInt);
MetaDataCol3 = new SqlMetaData("BidPrice", SqlDbType.Decimal, 18, 8); // precision 18, 8 scale
MetaDataCol4 = new SqlMetaData("AskPrice", SqlDbType.Decimal, 18, 8); // precision 18, 8 scale
// define sql data record with the columns
SqlDataRecord DataRecord = new SqlDataRecord(new SqlMetaData[] { MetaDataCol1, MetaDataCol2, MetaDataCol3, MetaDataCol4 });
// remove each price row from queue and add it to the sql data record
LightstreamerAPI.PriceDTO PriceDTO = new LightstreamerAPI.PriceDTO();
while (IntradayQuotesQueue.TryDequeue(out PriceDTO))
{
DataRecord.SetInt32(0, PriceDTO.MarketID); // city index market id
DataRecord.SetInt64(1, Convert.ToInt64((PriceDTO.TickDate.Replace(#"\/Date(", "")).Replace(#")\/", ""))); // # is used to avoid problem with / as escape sequence)
DataRecord.SetDecimal(2, PriceDTO.Bid); // bid price
DataRecord.SetDecimal(3, PriceDTO.Offer); // ask price
yield return DataRecord;
}
}
Handling the data every 0.5 seconds:
public static void ChildThreadIntradayQuotesHandler(Int32 CityIndexInterfaceProcessingID)
{
try
{
// open new sql connection
using (SqlConnection TimeSeriesDatabaseSQLConnection = new SqlConnection("Data Source=XXX;Initial Catalog=XXX;Integrated Security=SSPI;MultipleActiveResultSets=false"))
{
// open connection
TimeSeriesDatabaseSQLConnection.Open();
// endless loop to keep thread alive
while(true)
{
// ensure queue has rows to process (otherwise no need to continue)
if(IntradayQuotesQueue.Count > 0)
{
// define stored procedure for sql command
SqlCommand InsertCommand = new SqlCommand("Staging.spProcessCityIndexIntradayLivePricesStaging", TimeSeriesDatabaseSQLConnection);
// set command type to stored procedure
InsertCommand.CommandType = CommandType.StoredProcedure;
// define sql parameters (table-value parameter gets data from CreateSqlDataRecords())
SqlParameter ParameterCityIndexIntradayLivePrices = InsertCommand.Parameters.AddWithValue("#CityIndexIntradayLivePrices", CreateSqlDataRecords()); // table-valued parameter
SqlParameter ParameterProcessingID = InsertCommand.Parameters.AddWithValue("#ProcessingID", CityIndexInterfaceProcessingID); // processing id parameter
// set sql db type to structured for table-value paramter (structured = special data type for specifying structured data contained in table-valued parameters)
ParameterCityIndexIntradayLivePrices.SqlDbType = SqlDbType.Structured;
// execute stored procedure
InsertCommand.ExecuteNonQuery();
}
// wait 0.5 seconds
Thread.Sleep(500);
}
}
}
catch (Exception e)
{
// handle error (standard error messages and update processing)
ThreadErrorHandling(CityIndexInterfaceProcessingID, "ChildThreadIntradayQuotesHandler (handler stopped now)", e);
};
}
Use SQL Server 2016 (it's not RTM yet, but it's already much better than 2014 when it comes to memory-optimized tables). Then use either a memory-optimized table variable or just blast a whole lot of native stored procedure calls in a transaction, each doing one insert, depending on what's faster in your scenario (this varies). A few things to watch out for:
Doing multiple inserts in one transaction is vital to save on network roundtrips. While in-memory operations are very fast, SQL Server still needs to confirm every operation.
Depending on how you're producing data, you may find that parallelizing the inserts can speed things up (don't overdo it; you'll quickly hit the saturation point). Don't try to be very clever yourself here; leverage async/await and/or Parallel.ForEach.
If you're passing a table-valued parameter, the easiest way of doing it is to pass a DataTable as the parameter value, but this is not the most efficient way of doing it -- that would be passing an IEnumerable<SqlDataRecord>. You can use an iterator method to generate the values, so only a constant amount of memory is allocated.
You'll have to experiment a bit to find the optimal way of passing through data; this depends a lot on the size of your data and how you're getting it.
Batch the data from the staging table to the final table in row counts less than 5k, I typically use 4k, and do not insert them in a transaction. Instead, implement programmatic transactions if needed. Staying under 5k inserted rows keeps the number of row locks from escalating into a table lock, which has to wait until everyone else gets out of the table.
Are you sure it's your logic slowing down and not the actual transactions to the database? Entity Framework for example is "sensitive", for lack of a better term, when trying to insert a ton of rows and gets pretty slow.
There's a third party library, BulkInsert, on Codeplex which I've used and it's pretty nice to do bulk inserting of data: https://efbulkinsert.codeplex.com/
You could also write your own extension method on DBContext if you EF that does this too that could be based on record count. Anything under 5000 rows use Save(), anything over that you can invoke your own bulk insert logic.
Related
I need to programmatically insert tens of millions of records into a Postgres database. Presently, I'm executing thousands of insert statements in a single query.
Is there a better way to do this, some bulk insert statement I do not know about?
PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).
There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:
INSERT INTO films (code, title, did, date_prod, kind) VALUES
('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.
One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.
UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
);
without VALUES using subselect with additional existance check:
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
SELECT UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
SELECT 1 FROM tablename tt
WHERE tt.fieldname1=temptable.fieldname1
);
the same syntax to bulk updates:
UPDATE tablename
SET fieldname1=temptable.data
FROM (
SELECT UNNEST(ARRAY[1,2]) AS id,
UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;
((this is a WIKI you can edit and enhance the answer!))
The external file is the best and typical bulk-data
The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.
Bulk insert with some transformation
In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.
CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like
fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...
You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by
INSERT INTO tablename (x, y, z)
SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms
FROM tmp_tablename_fdw
-- WHERE condictions
;
You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:
CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');
JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:
INSERT INTO tablename (fname, metadata, content)
SELECT fname, meta, j -- do any data transformation here
FROM jsonb_read_files('myRawData%.json')
-- WHERE any_condiction_here
;
where the function jsonb_read_files() reads all files of a folder, defined by a mask:
CREATE or replace FUNCTION jsonb_read_files(
p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
WITH t AS (
SELECT (row_number() OVER ())::int id,
f AS fname,
p_fpath ||'/'|| f AS f
FROM pg_ls_dir(p_fpath) t(f)
WHERE f LIKE p_flike
) SELECT id, fname,
to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
pg_read_file(f)::jsonb
FROM t
$f$ LANGUAGE SQL IMMUTABLE;
Lack of gzip streaming
The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:
gunzip remote_or_local_file.csv.gz | convert_to_sql | psql
So ideal (future) is a server option for format .csv.gz.
Note after #CharlieClark comment: currently (2022) nothing to do, the best alternative seems pgloader STDIN:
gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo
You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.
Here is an example recipe in Python, using psycopg2 with binary input.
It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.
My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table.
Then I check what can be checked: duplicates, keys that already exist in the target, etc.
Then I just do a do insert into target select * from tmp or similar.
If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)
I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.
$ createdb test
$ csvsql --db postgresql:///test --insert examples/*.csv
I implemented very fast Postgresq data loader with native libpq methods.
Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/
May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html
As others have noted, when importing data into Postgres, things will be slowed by the checks that Postgres is designed to do for you. Also, you often need to manipulate the data in one way or another so that it's suitable for use. Any of this that can be done outside of the Postgres process will mean that you can import using the COPY protocol.
For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.
The query below can create test table with generate_series column which has 10000 rows. *I usually create such test table to test query performance and you can check generate_series():
CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
count
-------
10000
(1 row)
postgres=# SELECT * FROM test;
generate_series
-----------------
1
2
3
4
5
6
-- More --
And, run the query below to insert 10000 rows if you've already had test table:
INSERT INTO test (generate_series) SELECT generate_series(1, 10000);
I have a strange performance issue with executing a simple merge SQL command on Entity Framework 6.
First my Entity Framework code:
var command = #"MERGE [StringData] AS TARGET
USING (VALUES (#DCStringID_Value, #TimeStamp_Value)) AS SOURCE ([DCStringID], [TimeStamp])
ON TARGET.[DCStringID] = SOURCE.[DCStringID] AND TARGET.[TimeStamp] = SOURCE.[TimeStamp]
WHEN MATCHED THEN
UPDATE
SET [DCVoltage] = #DCVoltage_Value,
[DCCurrent] = #DCCurrent_Value
WHEN NOT MATCHED THEN
INSERT ([DCStringID], [TimeStamp], [DCVoltage], [DCCurrent])
VALUES (#DCStringID_Value, #TimeStamp_Value, #DCVoltage_Value, #DCCurrent_Value);";
using (EntityModel context = new EntityModel())
{
for (int i = 0; i < 100; i++)
{
var entity = _buffer.Dequeue();
context.ContextAdapter.ObjectContext.ExecuteStoreCommand(command, new object[]
{
new SqlParameter("#DCStringID_Value", entity.DCStringID),
new SqlParameter("#TimeStamp_Value", entity.TimeStamp),
new SqlParameter("#DCVoltage_Value", entity.DCVoltage),
new SqlParameter("#DCCurrent_Value", entity.DCCurrent),
});
}
}
Execution time ~20 seconds.
This looks a little bit slow so I tried the same command to run direct in management studio (also 100 times in a row).
SQL Server Management Studio:
Execution time <1 second.
Ok that is strange!?
Some tests:
First I compare both execution plans (Entity Framework and SSMS) but they are absolutely identical.
Second I tried is using a transaction inside my code.
using (PowerdooModel context = PowerdooModel.CreateModel())
{
using (var dbContextTransaction = context.Database.BeginTransaction())
{
try
{
for (int i = 0; i < 100; i++)
{
context.ContextAdapter.ObjectContext.ExecuteStoreCommand(command, new object[]
{
new SqlParameter("#DCStringID_Value", entity.DCStringID),
new SqlParameter("#TimeStamp_Value", entity.TimeStamp),
new SqlParameter("#DCVoltage_Value", entity.DCVoltage),
new SqlParameter("#DCCurrent_Value", entity.DCCurrent),
});
}
dbContextTransaction.Commit();
}
catch (Exception)
{
dbContextTransaction.Rollback();
}
}
}
Third I added 'OPTION(recompile)' to avoid parameter sniffing.
Execution time still ~10 seconds. What is still very poor performance.
Question: what am I doing wrong? Please give me a hint.
-------- Some more tests - edited 18.11.2016 ---------
If I execute the commands inside a transaction (like above), the following times comes up:
Duration complete: 00:00:06.5936006
Average Command: 00:00:00.0653457
Commit: 00:00:00.0590299
Is it not strange that the commit nearly takes no time and the average command takes nearly the same time?
In SSMS you are running one long batch that contains 100 separate MERGE statements. In C# you are running 100 separate batches. Obviously it is longer.
Running 100 separate batches in 100 separate transactions is obviously longer than 100 batches in 1 transaction. Your measurements confirm that and show you how much longer.
To make it efficient use a single MERGE statement that processes all 100 rows from a table-valued parameter in one go. See also Table-Valued Parameters for .NET Framework
Often a table-valued parameter is a parameter of a stored procedure, but you don't have to use a stored procedure. It could be a single statement, but instead of multiple simple scalar parameters you'd pass the whole table at once.
I never used entity framework, so I can't show you a C# example how to call it. I'm sure if you search for "how to pass table-valued parameter in entity framework" you'll find an example. I use DataTable class to pass a table as a parameter.
I can show you an example of T-SQL stored procedure.
At first you define a table type that pretty much follows the definition of your StringData table:
CREATE TYPE dbo.StringDataTableType AS TABLE(
DCStringID int NOT NULL,
TimeStamp datetime2(0) NOT NULL,
DCVoltage float NOT NULL,
DCCurrent float NOT NULL
)
Then you use it as a type for the parameter:
CREATE PROCEDURE dbo.MergeStringData
#ParamRows dbo.StringDataTableType READONLY
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
BEGIN TRY
MERGE INTO dbo.StringData WITH (HOLDLOCK) as Dst
USING
(
SELECT
TT.DCStringID
,TT.TimeStamp
,TT.DCVoltage
,TT.DCCurrent
FROM
#ParamRows AS TT
) AS Src
ON
Dst.DCStringID = Src.DCStringID AND
Dst.TimeStamp = Src.TimeStamp
WHEN MATCHED THEN
UPDATE SET
Dst.DCVoltage = Src.DCVoltage
,Dst.DCCurrent = Src.DCCurrent
WHEN NOT MATCHED BY TARGET THEN
INSERT
(DCStringID
,TimeStamp
,DCVoltage
,DCCurrent)
VALUES
(Src.DCStringID
,Src.TimeStamp
,Src.DCVoltage
,Src.DCCurrent)
;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
-- TODO: handle the error
ROLLBACK TRANSACTION;
END CATCH;
END
Again, it doesn't have to be a stored procedure, it can be just a MERGE statement with one table-valued parameter.
I'm pretty sure that it would be much faster than your loop with 100 separate queries.
Details on why there should be HOLDLOCK hint with MERGE: “UPSERT” Race Condition With MERGE
A side note:
It is not strange that Commit in your last test is very fast. It doesn't do much, because everything is already written to the database. If you tried to do Rollback, that would take some time.
It looks like the difference is that you are using OPTION (RECOMPILE) in your EF code only, but when you run the code in SSMS there is no recompile.
Recompiling 100 times would certainly add some time to execution.
Related to the answer from #Vladimir Baranov I take a look how Entity Framework supports bulk operations (for me bulk merge).
Short answer... no! Bulk delete, update and merge operations are not supported.
Then I found this sweet little library!
zzzproject Entity Framework Extensions
It provides a finished extension that works like #Vladimir Baranov descripted.
Create a temporary table in SQL Server.
Bulk Insert data with .NET SqlBulkCopy into the temporary table.
Perform a SQL statement between the temporary table and the destination table.
Drop the temporary table from the SQL Server.
You can find a short interview by Jonathan Allen about the functions and how it can dramatically improve Entity Framework performance with bulk operations.
For me it really kicks up the performance from 10 seconds to under 1 second. Nice!
I want to be clear, I'm not working or be related to them. This is no advertisement. Yes the library is commercial, but basic version is free.
I have a file (which has 10 million records) like below:
line1
line2
line3
line4
.......
......
10 million lines
So basically I want to insert 10 million records into the database.
so I read the file and upload it to SQL Server.
C# code
System.IO.StreamReader file =
new System.IO.StreamReader(#"c:\test.txt");
while((line = file.ReadLine()) != null)
{
// insertion code goes here
//DAL.ExecuteSql("insert into table1 values("+line+")");
}
file.Close();
but insertion will take a long time.
How can I insert 10 million records in the shortest time possible using C#?
Update 1:
Bulk INSERT:
BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
ROWTERMINATOR =' \n'
);
My Table is like below:
DATAs
(
DatasField VARCHAR(MAX)
)
but I am getting following error:
Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
Below code worked:
BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
);
Please do not create a DataTable to load via BulkCopy. That is an ok solution for smaller sets of data, but there is absolutely no reason to load all 10 million rows into memory before calling the database.
Your best bet (outside of BCP / BULK INSERT / OPENROWSET(BULK...)) is to stream the contents from the file into the database via a Table-Valued Parameter (TVP). By using a TVP you can open the file, read a row & send a row until done, and then close the file. This method has a memory footprint of just a single row. I wrote an article, Streaming Data Into SQL Server 2008 From an Application, which has an example of this very scenario.
A simplistic overview of the structure is as follows. I am assuming the same import table and field name as shown in the question above.
Required database objects:
-- First: You need a User-Defined Table Type
CREATE TYPE ImportStructure AS TABLE (Field VARCHAR(MAX));
GO
-- Second: Use the UDTT as an input param to an import proc.
-- Hence "Tabled-Valued Parameter" (TVP)
CREATE PROCEDURE dbo.ImportData (
#ImportTable dbo.ImportStructure READONLY
)
AS
SET NOCOUNT ON;
-- maybe clear out the table first?
TRUNCATE TABLE dbo.DATAs;
INSERT INTO dbo.DATAs (DatasField)
SELECT Field
FROM #ImportTable;
GO
C# app code to make use of the above SQL objects is below. Notice how rather than filling up an object (e.g. DataTable) and then executing the Stored Procedure, in this method it is the executing of the Stored Procedure that initiates the reading of the file contents. The input parameter of the Stored Proc isn't a variable; it is the return value of a method, GetFileContents. That method is called when the SqlCommand calls ExecuteNonQuery, which opens the file, reads a row and sends the row to SQL Server via the IEnumerable<SqlDataRecord> and yield return constructs, and then closes the file. The Stored Procedure just sees a Table Variable, #ImportTable, that can be access as soon as the data starts coming over (note: the data does persist for a short time, even if not the full contents, in tempdb).
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using Microsoft.SqlServer.Server;
private static IEnumerable<SqlDataRecord> GetFileContents()
{
SqlMetaData[] _TvpSchema = new SqlMetaData[] {
new SqlMetaData("Field", SqlDbType.VarChar, SqlMetaData.Max)
};
SqlDataRecord _DataRecord = new SqlDataRecord(_TvpSchema);
StreamReader _FileReader = null;
try
{
_FileReader = new StreamReader("{filePath}");
// read a row, send a row
while (!_FileReader.EndOfStream)
{
// You shouldn't need to call "_DataRecord = new SqlDataRecord" as
// SQL Server already received the row when "yield return" was called.
// Unlike BCP and BULK INSERT, you have the option here to create a string
// call ReadLine() into the string, do manipulation(s) / validation(s) on
// the string, then pass that string into SetString() or discard if invalid.
_DataRecord.SetString(0, _FileReader.ReadLine());
yield return _DataRecord;
}
}
finally
{
_FileReader.Close();
}
}
The GetFileContents method above is used as the input parameter value for the Stored Procedure as shown below:
public static void test()
{
SqlConnection _Connection = new SqlConnection("{connection string}");
SqlCommand _Command = new SqlCommand("ImportData", _Connection);
_Command.CommandType = CommandType.StoredProcedure;
SqlParameter _TVParam = new SqlParameter();
_TVParam.ParameterName = "#ImportTable";
_TVParam.TypeName = "dbo.ImportStructure";
_TVParam.SqlDbType = SqlDbType.Structured;
_TVParam.Value = GetFileContents(); // return value of the method is streamed data
_Command.Parameters.Add(_TVParam);
try
{
_Connection.Open();
_Command.ExecuteNonQuery();
}
finally
{
_Connection.Close();
}
return;
}
Additional notes:
With some modification, the above C# code can be adapted to batch the data in.
With minor modification, the above C# code can be adapted to send in multiple fields (the example shown in the "Steaming Data..." article linked above passes in 2 fields).
You can also manipulate the value of each record in the SELECT statement in the proc.
You can also filter out rows by using a WHERE condition in the proc.
You can access the TVP Table Variable multiple times; it is READONLY but not "forward only".
Advantages over SqlBulkCopy:
SqlBulkCopy is INSERT-only whereas using a TVP allows the data to be used in any fashion: you can call MERGE; you can DELETE based on some condition; you can split the data into multiple tables; and so on.
Due to a TVP not being INSERT-only, you don't need a separate staging table to dump the data into.
You can get data back from the database by calling ExecuteReader instead of ExecuteNonQuery. For example, if there was an IDENTITY field on the DATAs import table, you could add an OUTPUT clause to the INSERT to pass back INSERTED.[ID] (assuming ID is the name of the IDENTITY field). Or you can pass back the results of a completely different query, or both since multiple results sets can be sent and accessed via Reader.NextResult(). Getting info back from the database is not possible when using SqlBulkCopy yet there are several questions here on S.O. of people wanting to do exactly that (at least with regards to the newly created IDENTITY values).
For more info on why it is sometimes faster for the overall process, even if slightly slower on getting the data from disk into SQL Server, please see this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP
In C#, the best solution is to let the SqlBulkCopy reads the file. To do this you need to pass an IDataReader direct to SqlBulkCopy.WriteToServer method. Here is an example: http://www.codeproject.com/Articles/228332/IDataReader-implementation-plus-SqlBulkCopy
the best way is a mix between your 1st solution and 2nd,
create DataTable and in the loop add rows to it then use BulkCopy to upload
to DB in one connection use this for help in bulk copy
one other thing to pay attention that bulk copy is a very sensitive operation that almost
every mistake will void the copy, such if you declare the column name in the dataTable as "text" and in the DB its "Text" it will throw an exception, good luck.
If you want to insert 10 million records in the shortest time to direct using SQL query for testing purpose you should use this query
CREATE TABLE TestData(ID INT IDENTITY (1,1), CreatedDate DATETIME)
GO
INSERT INTO TestData(CreatedDate) SELECT GetDate()
GO 10000000
I am doing a loop insert as seen below(Method A), it seems that calling the database with every single loop isn't a good idea. I found an alternative is to loop a comma-delimited string in my SProc instead to do the insert so to have only one entry to the DB. Will be any significant improvement in terms of performance? :
Method A:
foreach (DataRow row in dt.Rows)
{
userBll = new UserBLL();
UserId = (Guid)row["UserId"];
// Call userBll method to insert into SQL Server with UserId as one of the parameter.
}
Method B:
string UserIds = "Tom, Jerry, 007"; // Assuming we already concatenate the strings. So no loops this time here.
userBll = new UserBLL();
// Call userBll method to insert into SQL Server with 'UserIds' as parameter.
Method B SProc / Perform a loop insert in the SProc.
if right(rtrim(#UserIds ), 1) <> ','
SELECT #string = #UserIds + ','
SELECT #pos = patindex('%,%' , #UserIds )
while #pos <> 0
begin
SELECT #piece = left(#v, (#pos-1))
-- Perform the insert here
SELECT #UserIds = stuff(#string, 1, #pos, '')
SELECT #pos = patindex('%,%' , #UserIds )
end
Less queries usually mean faster processing. That said, a co-worker of mine had some success with .NET Framework's wrapper of the TSQL BULK INSERT, which is provided by the Framework as SqlBulkCopy.
This MSDN blog entry shows how to use it.
The main "API" sample is this (taken from the linked article as-is, it writes the contents of a DataTable to SQL):
private void WriteToDatabase()
{
// get your connection string
string connString = "";
// connect to SQL
using (SqlConnection connection =
new SqlConnection(connString))
{
// make sure to enable triggers
// more on triggers in next post
SqlBulkCopy bulkCopy =
new SqlBulkCopy
(
connection,
SqlBulkCopyOptions.TableLock |
SqlBulkCopyOptions.FireTriggers |
SqlBulkCopyOptions.UseInternalTransaction,
null
);
// set the destination table name
bulkCopy.DestinationTableName = this.tableName;
connection.Open();
// write the data in the "dataTable"
bulkCopy.WriteToServer(dataTable);
connection.Close();
}
// reset
this.dataTable.Clear();
this.recordCount = 0;
}
The linked article explains what needs to be done to leverage this mechanism.
In my experience, there are three things you don't want to have to do for each record:
Open/close a sql connection per row. This concern is handled by ADO.NET connection pooling. You shouldn't have to worry about it unless you have disabled the pooling.
Database roundtrip per row. This tends to be less about the network bandwidth or network latency and more about the client side thread sleeping. You want a substantial amount of work on the client side each time it wakes up or you are wasting your time slice.
Open/close the sql transaction log per row. Opening and closing the log is not free, but you don't want to hold it open too long either. Do many inserts in a single transaction, but not too many.
On any of these, you'll probably see a lot of improvement going from 1 row per request to 10 rows per request. You can achieve this by building up 10 insert statements on the client side before transmitting the batch.
Your approach of sending a list into a proc has been written about in extreme depth by Sommarskog.
If you are looking for better insert performance with multiple input values of a given type, I would recommend you look at table valued parameters.
And a sample can be found here, showing some example code that uses them.
You can use bulk insert functionality for this.
See this blog for details: http://blogs.msdn.com/b/nikhilsi/archive/2008/06/11/bulk-insert-into-sql-from-c-app.aspx
I'm looking for an efficient way of inserting records into SQL server for my C#/MVC application. Anyone know what the best method would be?
Normally I've just done a while loop and insert statement within, but then again I've not had quite so many records to deal with. I need to insert around half a million, and at 300 rows a minute with the while loop, I'll be here all day!
What I'm doing is looping through a large holding table, and using it's rows to create records in a different table. I've set up some functions for lookup data which is necessary for the new table, and this is no doubt adding to the drain.
So here is the query I have. Extremely inefficient for large amounts of data!
Declare #HoldingID int
Set #HoldingID = (Select min(HoldingID) From HoldingTable)
While #JourneyHoldingID IS NOT NULL
Begin
Insert Into Journeys (DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
Where HoldingID = #HoldingID
Set #HoldingID = (Select MIN(HoldingID) From Holding Where HoldingID > #HoldingID)
End
I've heard about set-based approaches - is there anything that might work for the above problem?
If you want to insert a lot of data into a MSSQL Server then you should use BULK INSERTs - there is a command line tool called the bcp utility for this, and also a C# wrapper for performing Bulk Copy Operations, but under the covers they are all using BULK INSERT.
Depending on your application you may want to insert your data into a staging table first, and then either MERGE or INSERT INTO SELECT... to transfer those rows from the staging table(s) to the target table(s) - if you have a lot of data then this will take some time, however will be a lot quicker than performing the inserts individually.
If you want to speed this up then are various things that you can do such as changing the recovery model or tweaking / removing triggers and indexes (depending on whether or not this is a live database or not). If its still really slow then you should look into doing this process in batches (e.g. 1000 rows at a time).
This should be exactly what you are doing now.
Insert Into Journeys(DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
ORDER BY HoldingID ASC
you (probably) are able to do it in one statement of the form
INSERT INTO JOURNEYS
SELECT * FROM HOLDING;
Without more information about your schema it is difficult to be absolutely sure.
SQLServer 2008 introduced Table Parameters. These allow you to insert multiple rows in a single trip to the database (send it as a large blob). Without using a temporary table. This article describes how it works (step four in the article)
http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
It differs from bulk inserts in that you do not need special utilities and that all constraints and foreign keys are checked.
I quadrupled my throughput using this and parallelizing the inserts. Now at 15.000 inserts/second in the same table sustained. Regular table with indexes and over a billion rows.