I need to programmatically insert tens of millions of records into a Postgres database. Presently, I'm executing thousands of insert statements in a single query.
Is there a better way to do this, some bulk insert statement I do not know about?
PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).
There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:
INSERT INTO films (code, title, did, date_prod, kind) VALUES
('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.
One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.
UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
);
without VALUES using subselect with additional existance check:
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
SELECT UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
SELECT 1 FROM tablename tt
WHERE tt.fieldname1=temptable.fieldname1
);
the same syntax to bulk updates:
UPDATE tablename
SET fieldname1=temptable.data
FROM (
SELECT UNNEST(ARRAY[1,2]) AS id,
UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;
((this is a WIKI you can edit and enhance the answer!))
The external file is the best and typical bulk-data
The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.
Bulk insert with some transformation
In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.
CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like
fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...
You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by
INSERT INTO tablename (x, y, z)
SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms
FROM tmp_tablename_fdw
-- WHERE condictions
;
You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:
CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');
JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:
INSERT INTO tablename (fname, metadata, content)
SELECT fname, meta, j -- do any data transformation here
FROM jsonb_read_files('myRawData%.json')
-- WHERE any_condiction_here
;
where the function jsonb_read_files() reads all files of a folder, defined by a mask:
CREATE or replace FUNCTION jsonb_read_files(
p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
WITH t AS (
SELECT (row_number() OVER ())::int id,
f AS fname,
p_fpath ||'/'|| f AS f
FROM pg_ls_dir(p_fpath) t(f)
WHERE f LIKE p_flike
) SELECT id, fname,
to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
pg_read_file(f)::jsonb
FROM t
$f$ LANGUAGE SQL IMMUTABLE;
Lack of gzip streaming
The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:
gunzip remote_or_local_file.csv.gz | convert_to_sql | psql
So ideal (future) is a server option for format .csv.gz.
Note after #CharlieClark comment: currently (2022) nothing to do, the best alternative seems pgloader STDIN:
gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo
You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.
Here is an example recipe in Python, using psycopg2 with binary input.
It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.
My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table.
Then I check what can be checked: duplicates, keys that already exist in the target, etc.
Then I just do a do insert into target select * from tmp or similar.
If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)
I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.
$ createdb test
$ csvsql --db postgresql:///test --insert examples/*.csv
I implemented very fast Postgresq data loader with native libpq methods.
Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/
May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html
As others have noted, when importing data into Postgres, things will be slowed by the checks that Postgres is designed to do for you. Also, you often need to manipulate the data in one way or another so that it's suitable for use. Any of this that can be done outside of the Postgres process will mean that you can import using the COPY protocol.
For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.
The query below can create test table with generate_series column which has 10000 rows. *I usually create such test table to test query performance and you can check generate_series():
CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
count
-------
10000
(1 row)
postgres=# SELECT * FROM test;
generate_series
-----------------
1
2
3
4
5
6
-- More --
And, run the query below to insert 10000 rows if you've already had test table:
INSERT INTO test (generate_series) SELECT generate_series(1, 10000);
I have a web application that is written in MVC.Net using C# and LINQ-to-SQL (SQL Server 2008 R2).
I'd like to query the database for some values, and also insert those values into another table for later use. Obviously, I could do a normal select, then take those results and do a normal insert, but that will result in my application sending the values back to the SQL server, which is a waste as the server is where the values came from.
Is there any way I can get the select results in my application and insert them into another table without the information making a roundtrip from the the SQL server to my application and back again?
It would be cool if this was in one query, but that's less important than avoiding the roundtrip.
Assume whatever basic schema you like, I'll be extrapolating your simple example to a much more complex query.
Can I Insert the Results of a Select Statement Into Another Table Without a Roundtrip?
From a "single-query" and/or "avoid the round-trip" perspective: Yes.
From a "doing that purely in Linq to SQL" perspective: Well...mostly ;-).
The three pieces required are:
The INSERT...SELECT construct:
By using this we get half of the goal in that we have selected data and inserted it. And this is the only way to keep the data entirely at the database server and avoid the round-trip. Unfortunately, this construct is not supported by Linq-to-SQL (or Entity Framework): Insert/Select with Linq-To-SQL
The T-SQL OUTPUT clause:
This allows for doing what is essentially the tee command in Unix shell scripting: save and display the incoming rows at the same time. The OUTPUT clause just takes the set of inserted rows and sends it back to the caller, providing the other half of the goal. Unfortunately, this is also not supported by Linq-to-SQL (or Entity Framework). Now, this type of operation can also be achieved across multiple queries when not using OUTPUT, but there is really nothing gained since you then either need to a) create a temp table to dump the initial results into that will be used to insert into the table and then selected back to the caller, or b) have some way of knowing which rows that were just inserted into the table are new so that they can be properly selected back to the caller.
The DataContext.ExecuteQuery<TResult> (String, Object[]) method:
This is needed due to the two required T-SQL pieces not being supported directly in Linq-to-SQL. And even if the clunky approach to avoiding the OUTPUT clause is done (assuming it could be done in pure Linq/Lambda expressions), there is still no way around the INSERT...SELECT construct that would not be a round-trip.
Hence, multiple queries that are all pure Linq/Lambda expressions equates to a round-trip.
The only way to truly avoid the round-trip should be something like:
var _MyStuff = db.ExecuteQuery<Stuffs>(#"
INSERT INTO dbo.Table1 (Col1, Col2, Col2)
OUTPUT INSERTED.*
SELECT Col1, Col2, Col3
FROM dbo.Table2 t2
WHERE t2.Col4 = {0};",
_SomeID);
And just in case it helps anyone (since I already spent the time looking it up :), the equivalent command for Entity Framework is: Database.SqlQuery<TElement> (String, Object[])
try this query according your requirement
insert into IndentProcessDetails (DemandId,DemandMasterId,DemandQty) ( select DemandId,DemandMasterId,DemandQty from DemandDetails)
I'm looking for an efficient way of inserting records into SQL server for my C#/MVC application. Anyone know what the best method would be?
Normally I've just done a while loop and insert statement within, but then again I've not had quite so many records to deal with. I need to insert around half a million, and at 300 rows a minute with the while loop, I'll be here all day!
What I'm doing is looping through a large holding table, and using it's rows to create records in a different table. I've set up some functions for lookup data which is necessary for the new table, and this is no doubt adding to the drain.
So here is the query I have. Extremely inefficient for large amounts of data!
Declare #HoldingID int
Set #HoldingID = (Select min(HoldingID) From HoldingTable)
While #JourneyHoldingID IS NOT NULL
Begin
Insert Into Journeys (DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
Where HoldingID = #HoldingID
Set #HoldingID = (Select MIN(HoldingID) From Holding Where HoldingID > #HoldingID)
End
I've heard about set-based approaches - is there anything that might work for the above problem?
If you want to insert a lot of data into a MSSQL Server then you should use BULK INSERTs - there is a command line tool called the bcp utility for this, and also a C# wrapper for performing Bulk Copy Operations, but under the covers they are all using BULK INSERT.
Depending on your application you may want to insert your data into a staging table first, and then either MERGE or INSERT INTO SELECT... to transfer those rows from the staging table(s) to the target table(s) - if you have a lot of data then this will take some time, however will be a lot quicker than performing the inserts individually.
If you want to speed this up then are various things that you can do such as changing the recovery model or tweaking / removing triggers and indexes (depending on whether or not this is a live database or not). If its still really slow then you should look into doing this process in batches (e.g. 1000 rows at a time).
This should be exactly what you are doing now.
Insert Into Journeys(DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
ORDER BY HoldingID ASC
you (probably) are able to do it in one statement of the form
INSERT INTO JOURNEYS
SELECT * FROM HOLDING;
Without more information about your schema it is difficult to be absolutely sure.
SQLServer 2008 introduced Table Parameters. These allow you to insert multiple rows in a single trip to the database (send it as a large blob). Without using a temporary table. This article describes how it works (step four in the article)
http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
It differs from bulk inserts in that you do not need special utilities and that all constraints and foreign keys are checked.
I quadrupled my throughput using this and parallelizing the inserts. Now at 15.000 inserts/second in the same table sustained. Regular table with indexes and over a billion rows.
I've got a call to a stored procedure, that is basically an INSERT stored procedure. It inserts into Table A, then into Table B with the identity from Table A.
Now, i need to call this stored procedure N amount of times from my application code.
Is there any way i can batch this? At the moment it's doing N round trips to the DB, i would like it to be one.
The only approach i can think of is to pass a the entire list of items across the wire, via an User Defined Table Type.
But the problem with this approach is that i will need a CURSOR in the sproc to loop through each item in order to do the insert (because of the identity field).
Basically, can we batch DbCommand.ExecuteNonQuery() with EF 4.2?
Or can we do it with something like Dapper?
You can keep it like that and in the stored procedure just do a MERGE between your target table and the table parameter. Because you are always coming with new records, the MERGE will enter only on the INSERT branch.
In this case, using MERGE like this is an easy way of doing batch inserts without a cursor.
Also, another way which also avoids the use of a cursor is to use a INSERT from SELECT statement in the SP.
I have implemented an import functionality which takes data from a csv file in an Asp.Net appication. The file of the size can vary from a few kb's to a max of 10 MB.
However when an import occurs and if the file size is > 50000 it takes around 20 MINS .
Which is way too much of a time. I need to perform an import for around 300000 records within a timespan of 2-3 Mins .
I know that the import to a database also depends on the physical memory of the db server .I create insert scripts in bulk and execute . I also know using SqlBulkCopy would also be another option but in my case its just not the inserting of product's that take place but also update and delete that is a field called "FUNCTION CODE" which decides whether to Insert,Update Or Delete.
Any suggestions regarding as to how to go about this would be greatly appreciated.
One approach towards this would be to implement multiple threads which carry out processes simultaneosly ,but i have never implemented threading till date and hence am not aware of the complication i would incur by implementing the same.
Thanks & Regards,
Francis P.
SqlBulkCopy is definitely going to be fastest. I would approach this by inserting the data into a temporary table on the database. Once the data is in the temp table, you could use SQL to merge/insert/delete accordingly.
I guess you are using SQL Server...
If you are using 2005/2008 consider using SSIS to process the file. Technet
Importing huge amount of data within the asp.net process is not the best thing you can do. You might upload the file and start a process that is doing the magic for you.
If this is a repeated process and the file is uploaded via asp.net plus you are doing some decision making on the data to decide insert/update or delete, then try out http://www.codeproject.com/KB/database/CsvReader.aspx it is this fast csv reader. Its quite quick and economical with memory
You are doing all your database queries with 1 connection sequentially. So for every insert/update/delete you are sending the command through the wire, wait for the db to do it's thing, and then wake up again when something is sent back.
Databases are optimized for heavy parallel access. So there are 2 easy routes for a significant speedup:
Open X connections to the database (where you have to tweak X but just start with 5) and either: spin up 5 threads who each do a chunk of the same work you were doing.
or: use asynchronous calls and when a callback arrives shoot in the next query.
I suggest using the XML functionality in SQL Server 2005/2008, which will allow you to bulk insert and bulk update. I'd take the following approach:
Process the entire file into an in-memory data structure.
Create a single XML document from this structure to pass to a stored proc.
Create a stored proc to load data from the XML document into a temporary table, then perform the inserts and updates. See below for guidance on creating the stored proc.
There are numerous advantages to this approach:
The entire operation is completed in one database call, although if your dataset is really large you may want to batch it.
You can wrap all the database writes into a single transaction easily, and roll back if anything fails.
You are not using any dynamic SQL, which could have posed a security risk.
You can return the IDs of the inserted, updated and/or deleted records using the OUTPUT clause.
In terms of the stored proc you will need, something like the following should work:
CREATE PROCEDURE MyBulkUpdater
(
#p_XmlData VARCHAR(MAX)
)
AS
DECLARE #hDoc INT
EXEC sp_xml_preparedocument #hDoc OUTPUT, #p_XmlData
-- Temporary table, should contain the same schema as the table you want to update
CREATE TABLE #MyTempTable
(
-- ...
)
INSERT INTO #MyTempTable
(
[Field1],
[Field2]
)
SELECT
XMLData.Field1,
XMLData.Field2
FROM OPENXML (#hdoc, 'ROOT/MyRealTable', 1)
WITH
(
[Field1] int,
[Field2] varchar(50),
[__ORDERBY] int
) AS XMLData
EXEC sp_xml_removedocument #hDoc
Now you can simply insert, update and delete your real table from your temporary table as required eg
INSERT INTO MyRealTable (Field1, Field2)
SELECT Field1, Field2
FROM #MyTempTable
WHERE ...
UPDATE MyRealTable
SET rt.Field2 = tt.Field2
FROM MyRealTable rt
JOIN MyTempTable tt ON tt.Field1 = MyRealTable.Field1
WHERE ...
For an example of the XML you need to pass in, you can do:
SELECT TOP 1 *, 0 AS __ORDERBY FROM MyRealTable AS MyRealTable FOR XML AUTO, ROOT('ROOT')
For more info, see OPENXML, sp_xml_preparedocument and sp_xml_removedocument.