I need to programmatically insert tens of millions of records into a Postgres database. Presently, I'm executing thousands of insert statements in a single query.
Is there a better way to do this, some bulk insert statement I do not know about?
PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).
There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:
INSERT INTO films (code, title, did, date_prod, kind) VALUES
('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.
One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.
UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
);
without VALUES using subselect with additional existance check:
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
SELECT UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
SELECT 1 FROM tablename tt
WHERE tt.fieldname1=temptable.fieldname1
);
the same syntax to bulk updates:
UPDATE tablename
SET fieldname1=temptable.data
FROM (
SELECT UNNEST(ARRAY[1,2]) AS id,
UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;
((this is a WIKI you can edit and enhance the answer!))
The external file is the best and typical bulk-data
The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.
Bulk insert with some transformation
In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.
CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like
fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...
You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by
INSERT INTO tablename (x, y, z)
SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms
FROM tmp_tablename_fdw
-- WHERE condictions
;
You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:
CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');
JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:
INSERT INTO tablename (fname, metadata, content)
SELECT fname, meta, j -- do any data transformation here
FROM jsonb_read_files('myRawData%.json')
-- WHERE any_condiction_here
;
where the function jsonb_read_files() reads all files of a folder, defined by a mask:
CREATE or replace FUNCTION jsonb_read_files(
p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
WITH t AS (
SELECT (row_number() OVER ())::int id,
f AS fname,
p_fpath ||'/'|| f AS f
FROM pg_ls_dir(p_fpath) t(f)
WHERE f LIKE p_flike
) SELECT id, fname,
to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
pg_read_file(f)::jsonb
FROM t
$f$ LANGUAGE SQL IMMUTABLE;
Lack of gzip streaming
The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:
gunzip remote_or_local_file.csv.gz | convert_to_sql | psql
So ideal (future) is a server option for format .csv.gz.
Note after #CharlieClark comment: currently (2022) nothing to do, the best alternative seems pgloader STDIN:
gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo
You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.
Here is an example recipe in Python, using psycopg2 with binary input.
It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.
My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table.
Then I check what can be checked: duplicates, keys that already exist in the target, etc.
Then I just do a do insert into target select * from tmp or similar.
If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)
I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.
$ createdb test
$ csvsql --db postgresql:///test --insert examples/*.csv
I implemented very fast Postgresq data loader with native libpq methods.
Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/
May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html
As others have noted, when importing data into Postgres, things will be slowed by the checks that Postgres is designed to do for you. Also, you often need to manipulate the data in one way or another so that it's suitable for use. Any of this that can be done outside of the Postgres process will mean that you can import using the COPY protocol.
For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.
The query below can create test table with generate_series column which has 10000 rows. *I usually create such test table to test query performance and you can check generate_series():
CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
count
-------
10000
(1 row)
postgres=# SELECT * FROM test;
generate_series
-----------------
1
2
3
4
5
6
-- More --
And, run the query below to insert 10000 rows if you've already had test table:
INSERT INTO test (generate_series) SELECT generate_series(1, 10000);
I may have a slight feeling as to what is going on, but I thought I'd ask to get confirmation, and potentially look at an alternative.
As a slight background, I've written a C# application that is providing a small front end to a stored procedure. The procedure contains a number of temporary table inserts from other stored procedures (and one table valued udf), and some xml processing.
In order to get a sense of how far along I am in the stored procedure, I'm subscribing to the InfoMessage (using an SqlInfoMessageEventHandler) from an SqlConnection. I've put some informative print statements in various places throughout the SP, so I can get a sense of the processing which has completed and update a status bar accordingly).
The rough structure of the SP is along these lines:
Print 'Beginning processing'
Create Temp Table
Insert into Temp Table from Table Valued UDF
Print 'Creating working Tables'
Create more Temp Tables
Insert into Temp Tables from SPs (each SP contains a print statement e.g 'Starting SP1').
All the messages are received and processed successfully, but there is a delay of a few seconds before any messages are returned from the server, then the first few arrive all at once (as though they were all processed but the output was held back for a while).
I had naively assumed (I still have a lot to learn about DB mechanics) that my initial print statement would be returned before any of the other processing instructions within the SP took place.
I assume that the server is doing something regarding fetching execution plans an/or potentially recalculating plans, or is it the case that query optimiser is doing some preprocessing before any results at all are returned?
Hopefully my question is reasonably understandable from that mess of text. Essentially, would the query optimiser make the server perform some of the selects/inserts before the procedure actually begins sequentially encountering my print statements?
I also tried doing some small temp table manipulation before my initial print statement, so some rows would be returned before the slow operations began but the result was much the same.
Thanks for any responses.
I can judge by MSSM, when you use PRINT statements they are late because of output happens once buffer is full. If you would like to output message straight away use:
RAISERROR('Message',0,0) WITH NOWAIT
It will work as Print but will be outputted straight away.
Hope this helps.
I would prefer to use SELECT statement instead of PRINT -
SELECT 'Beginning processing'
Create Temp Table
Insert into Temp Table from Table Valued UDF
SELECT 'Creating working Tables'
Create more Temp Tables
Insert into Temp Tables from SPs (each SP contains a print statement e.g 'Starting SP1').
I'm looking for an efficient way of inserting records into SQL server for my C#/MVC application. Anyone know what the best method would be?
Normally I've just done a while loop and insert statement within, but then again I've not had quite so many records to deal with. I need to insert around half a million, and at 300 rows a minute with the while loop, I'll be here all day!
What I'm doing is looping through a large holding table, and using it's rows to create records in a different table. I've set up some functions for lookup data which is necessary for the new table, and this is no doubt adding to the drain.
So here is the query I have. Extremely inefficient for large amounts of data!
Declare #HoldingID int
Set #HoldingID = (Select min(HoldingID) From HoldingTable)
While #JourneyHoldingID IS NOT NULL
Begin
Insert Into Journeys (DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
Where HoldingID = #HoldingID
Set #HoldingID = (Select MIN(HoldingID) From Holding Where HoldingID > #HoldingID)
End
I've heard about set-based approaches - is there anything that might work for the above problem?
If you want to insert a lot of data into a MSSQL Server then you should use BULK INSERTs - there is a command line tool called the bcp utility for this, and also a C# wrapper for performing Bulk Copy Operations, but under the covers they are all using BULK INSERT.
Depending on your application you may want to insert your data into a staging table first, and then either MERGE or INSERT INTO SELECT... to transfer those rows from the staging table(s) to the target table(s) - if you have a lot of data then this will take some time, however will be a lot quicker than performing the inserts individually.
If you want to speed this up then are various things that you can do such as changing the recovery model or tweaking / removing triggers and indexes (depending on whether or not this is a live database or not). If its still really slow then you should look into doing this process in batches (e.g. 1000 rows at a time).
This should be exactly what you are doing now.
Insert Into Journeys(DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
ORDER BY HoldingID ASC
you (probably) are able to do it in one statement of the form
INSERT INTO JOURNEYS
SELECT * FROM HOLDING;
Without more information about your schema it is difficult to be absolutely sure.
SQLServer 2008 introduced Table Parameters. These allow you to insert multiple rows in a single trip to the database (send it as a large blob). Without using a temporary table. This article describes how it works (step four in the article)
http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
It differs from bulk inserts in that you do not need special utilities and that all constraints and foreign keys are checked.
I quadrupled my throughput using this and parallelizing the inserts. Now at 15.000 inserts/second in the same table sustained. Regular table with indexes and over a billion rows.
I have an account creation process and basically when the user signs up, I have to make entries in mutliple tables namely User, Profile, Addresses. There will be 1 entry in User table, 1 entry in Profile and 2-3 entries in Address table. So, at most there will be 5 entries. My question is should I pass a XML of this to my stored procedure and parse it in there or should I create a transaction object in my C# code, keep the connection open and insert addresses one by one in loop?
How do you approach this scenario? Can making multiple calls degrade the performance even though the connection is open?
No offence, but you're over thinking this.
Gather your information, when you have it all together, create a transaction and insert the new rows one at a time. There's no performance hit here, as the transaction will be short lived.
A problem would be if you create the transaction on the connection, insert the user row, then wait for the user to enter more profile information, insert that, then wait for them to add address information, then insert that, DO NOT DO THIS, this is a needlessly long running transaction, and will create problems.
However, your scenario (where you have all the data) is a correct use of a transaction, it ensures your data integrity and will not put any strain on your database, and will not - on it's own - create deadlocks.
Hope this helps.
P.S. The drawbacks with the Xml approach is the added complexity, your code needs to know the schema of the xml, your stored procedure needs to know the Xml schema too. The stored procedure has the added complexity of parsing the xml, then inserting the rows. I really don't see the advantage of the extra complexity for what is a simple short running transaction.
If you want to insert records in multiple table then using XML parameter is a complex method. Creating Xml in .net and extracting records from xml for three diffrent tables is complex in sql server.
Executing queries within a transaction is easy approach but some performance will degrade there to switch between .net code and sql server.
Best approach is to use table parameter in storedprocedure. Create three data table in .net code and pass them in stored procedure.
--Create Type TargetUDT1,TargetUDT2 and TargetUDT3 for each type of table with all fields which needs to insert
CREATE TYPE [TargetUDT1] AS TABLE
(
[FirstName] [varchar](100)NOT NULL,
[LastName] [varchar](100)NOT NULL,
[Email] [varchar](200) NOT NULL
)
--Now write down the sp in following manner.
CREATE PROCEDURE AddToTarget(
#TargetUDT1 TargetUDT1 READONLY,
#TargetUDT2 TargetUDT2 READONLY,
#TargetUDT3 TargetUDT3 READONLY)
AS
BEGIN
INSERT INTO [Target1]
SELECT * FROM #TargetUDT1
INSERT INTO [Target2]
SELECT * FROM #TargetUDT2
INSERT INTO [Target3]
SELECT * FROM #TargetUDT3
END
In .Net, Create three data table and fill the value, and call the sp normally.
For example assuming your xml as below
<StoredProcedure>
<User>
<UserName></UserName>
</User>
<Profile>
<FirstName></FirstName>
</Profile>
<Address>
<Data></Data>
<Data></Data>
<Data></Data>
</Address>
</StoredProcedure>
this would be your stored procedure
INSERT INTO Users (UserName) SELECT(UserName) FROM OPENXML(#idoc,'StoredProcedure/User',2)
WITH ( UserName NVARCHAR(256))
where this would provide idoc variable value and #doc is the input to the stored procedure
DECLARE #idoc INT
--Create an internal representation of the XML document.
EXEC sp_xml_preparedocument #idoc OUTPUT, #doc
using similar technique you would run 3 inserts in single stored procedure. Note that it is single call to database and multiple address elements will be inserted in single call to this stored procedure.
Update
just not to mislead you here is a complete stored procedure for you do understand what you are going to do
USE [DBNAME]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER OFF
GO
CREATE PROCEDURE [dbo].[procedure_name]
#doc [ntext]
WITH EXECUTE AS CALLER
AS
DECLARE #idoc INT
DECLARE #RowCount INT
SET #ErrorProfile = 0
--Create an internal representation of the XML document.
EXEC sp_xml_preparedocument #idoc OUTPUT, #doc
BEGIN TRANSACTION
INSERT INTO Users (UserName)
SELECT UserName FROM OPENXML(#idoc,'StoredProcedure/User',2)
WITH ( UserName NVARCHAR(256) )
-- Insert Address
-- Insert Profile
SELECT #ErrorProfile = ##Error
IF #ErrorProfile = 0
BEGIN
COMMIT TRAN
END
ELSE
BEGIN
ROLLBACK TRAN
END
EXEC sp_xml_removedocument #idoc
Have you noticed any performance problems, what you are trying to do is very straight forward and many applications do this day in day out. Be careful not to be drawn into any premature optimization.
Database inserts should be very cheep, as you have suggested create a new transaction scope, open you connection, run your inserts, commit the transaction and finally dispose everything.
using (var tran = new TransactionScope())
using (var conn = new SqlConnection(YourConnectionString))
using (var insetCommand1 = conn.CreateCommand())
using (var insetCommand2 = conn.CreateCommand())
{
insetCommand1.CommandText = \\SQL to insert
insetCommand2.CommandText = \\SQL to insert
insetCommand1.ExecuteNonQuery();
insetCommand2.ExecuteNonQuery();
tran.Complete();
}
Bundling all your logic into a stored procedure and using XML gives you added complications, you will need to have additional logic in your database, you now have to transform your entities into an XML blob and you code has become harder to unit test.
There are a number of things you can do to make the code easier to use. The first step would be to push your database logic into a reusable database layer and use the concept of a repository to read and write your objects from the database.
You could of course make your life a lot easier and have a look at any of the ORM (Object-relational mapping) libraries that are available. They take away the pain of talking to the database and handle that for you.
I have a DataTable in memory that I need to dump straight into a SQL Server temp table.
After the data has been inserted, I transform it a little bit, and then insert a subset of those records into a permanent table.
The most time consuming part of this operation is getting the data into the temp table.
Now, I have to use temp tables, because more than one copy of this app is running at once, and I need a layer of isolation until the actual insert into the permanent table happens.
What is the fastest way to do a bulk insert from a C# DataTable into a SQL Temp Table?
I can't use any 3rd party tools for this, since I am transforming the data in memory.
My current method is to create a parameterized SqlCommand:
INSERT INTO #table (col1, col2, ... col200) VALUES (#col1, #col2, ... #col200)
and then for each row, clear and set the parameters and execute.
There has to be a more efficient way. I'm able to read and write the records on disk in a matter of seconds...
SqlBulkCopy will get the data in very fast.
I blogged not that long ago how to maximise performance. Some stats and examples in there. I compared 2 techniques, 1 using an SqlDataAdapter and 1 using SqlBulkCopy - bottom line was for bulk inserting 100K records, the data adapter approach took ~25 seconds compared to only ~0.8s for SqlBulkCopy.
You should use the SqlBulkCopy class.