how to retrieve thousands of rows from a stored procedure efficiently - c#

I am using VSTS 2008 + C# + .Net 3.0 + ADO.Net + SQL Server 2008. And from ADO.Net I am invoking a stored procedure from SQL Server side. The stored procedure is like this,
SELECT Table1.col2
FROM Table1
LEFT JOIN Table2 USING (col1)
WHERE Table2.col1 IS NULL
My question is, how to retrieve the returned rows (Table1.col2 in my sample) efficiently? My result may return up to 5,000 rows and the data type for Table1.col2 is nvarchar (4000).
thanks in advance,
George

You CANNOT - you can NEVER retrieve that much data efficiently....
The whole point of being efficient is to limit the data you retrieve - only those columns that you really need (no SELECT *, but SELECT (list of fields), which you already do), and only as much rows as you can handle easily.
For instance, you don't want to fill a drop down or listbox where the user needs to pick a single value with thousands of entries - that's just not feasible.
So I guess my point really is: if you really, truly need to return 5000 rows or more, it'll just take its time. There's not much you can do about that (if you transmit 5000 rows with 5000 bytes per row, that's 25'000'000 bytes or 25 megabytes - no magic go make that go fast).
It'll only go really fast if you find a way to limit the number of rows returned to 10, 20, 50 or so. Think: server-side paging!! :-)
Marc

You don't say what you want to do with the data. However, assuming you need to process the results in .NET then reading the results using an SqlDataReader would be the most efficient way.

I'd use exists for one.
SELECT
Table1.col2
FROM
Table1
WHERE
NOT EXISTS (SELECT *
FROM
Table2
WHERE
Table2.col1 = Table1.col1)
The query can be efficient (assume col1 is indexed but covers cols (very wide index of course), but you still have to shovel a lot of data over the network.
It depends what you mean by performance. 5000 rows isn't much for a report but it's a lot for a combo box

Related

Bulk Insert in PostgreSql using c# [duplicate]

I need to programmatically insert tens of millions of records into a Postgres database. Presently, I'm executing thousands of insert statements in a single query.
Is there a better way to do this, some bulk insert statement I do not know about?
PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).
There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:
INSERT INTO films (code, title, did, date_prod, kind) VALUES
('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.
One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.
UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
);
without VALUES using subselect with additional existance check:
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
SELECT UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
SELECT 1 FROM tablename tt
WHERE tt.fieldname1=temptable.fieldname1
);
the same syntax to bulk updates:
UPDATE tablename
SET fieldname1=temptable.data
FROM (
SELECT UNNEST(ARRAY[1,2]) AS id,
UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;
      ((this is a WIKI you can edit and enhance the answer!))
The external file is the best and typical bulk-data
The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.
Bulk insert with some transformation
In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.
CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like
fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...
You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by
INSERT INTO tablename (x, y, z)
SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms
FROM tmp_tablename_fdw
-- WHERE condictions
;
You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:
CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');
JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:
INSERT INTO tablename (fname, metadata, content)
SELECT fname, meta, j -- do any data transformation here
FROM jsonb_read_files('myRawData%.json')
-- WHERE any_condiction_here
;
where the function jsonb_read_files() reads all files of a folder, defined by a mask:
CREATE or replace FUNCTION jsonb_read_files(
p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
WITH t AS (
SELECT (row_number() OVER ())::int id,
f AS fname,
p_fpath ||'/'|| f AS f
FROM pg_ls_dir(p_fpath) t(f)
WHERE f LIKE p_flike
) SELECT id, fname,
to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
pg_read_file(f)::jsonb
FROM t
$f$ LANGUAGE SQL IMMUTABLE;
Lack of gzip streaming
The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:
gunzip remote_or_local_file.csv.gz | convert_to_sql | psql
So ideal (future) is a server option for format .csv.gz.
Note after #CharlieClark comment: currently (2022) nothing to do, the best alternative seems pgloader STDIN:
gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo
You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.
Here is an example recipe in Python, using psycopg2 with binary input.
It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.
My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table.
Then I check what can be checked: duplicates, keys that already exist in the target, etc.
Then I just do a do insert into target select * from tmp or similar.
If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)
I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.
$ createdb test
$ csvsql --db postgresql:///test --insert examples/*.csv
I implemented very fast Postgresq data loader with native libpq methods.
Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/
May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html
As others have noted, when importing data into Postgres, things will be slowed by the checks that Postgres is designed to do for you. Also, you often need to manipulate the data in one way or another so that it's suitable for use. Any of this that can be done outside of the Postgres process will mean that you can import using the COPY protocol.
For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.
The query below can create test table with generate_series column which has 10000 rows. *I usually create such test table to test query performance and you can check generate_series():
CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
count
-------
10000
(1 row)
postgres=# SELECT * FROM test;
generate_series
-----------------
1
2
3
4
5
6
-- More --
And, run the query below to insert 10000 rows if you've already had test table:
INSERT INTO test (generate_series) SELECT generate_series(1, 10000);

Does anyone know of a way to paginate a call to GetSchema from C#?

I'm using the ADO.NET provider function "GetSchema" to fetch meta data out of a Sql Server database (and an Informix system as well) and want to know if there is anyway to paginate the results. I ask because one of the systems has over 3,000 tables (yes, three thousand) and twice that many views and let's not even talk about the stored procedures.
Needless to say, trying to bring down that list in one shot is too much for the VM I have running (a mere 4GB of memory). I'm already aware of the restrictions that can be applied, these are all tables in the "dbo" schema so there isn't much else that I'm aware of for limiting the result set before it gets to my client.
Instead of using GetSchema I suggest to use the more flexible INFORMATION_SCHEMA system views. These views already divide the information about the Tables, StoredProcedures and Views and you can write a specific query to retrieve your data in a paginated way.
For example to retrieve the first 100 rows of table names you could write a query like this
SELECT *
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY TABLE_NAME) AS RowNum, *
FROM INFORMATION_SCHEMA.TABLES
) AS TableWithRowNum
WHERE RowNum >= 0
AND RowNum < 100
ORDER BY RowNum
Following queries could be easily prepared changing the min and max values used by the query.
The same code could be applied for StoredProcedures (using INFORMATION_SCHEMA.ROUTINES WHERE ROUTINE_TYPE = 'PROCEDURE') or the views using INFORMATION_SCHEMA.VIEWS
Note, if you are using Sql Server 2012 and later the first query could be rewritten to use this syntax
SELECT *
FROM INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_NAME
OFFSET 0 ROWS FETCH NEXT 100 ROWS ONLY
And the C# code could also use parameters for the FIRST (0) and COUNT(100) values

One big query or lots of smaller queries

I need a bit of advice when it comes to making efficient use of database resource.
At the moment, I'm building an ordering system that takes an uploaded file, and runs through that file adding each line to an order.
At the same time this is done, the app checks that the product code requested is available to sell to that customer.
Given that the file can contain upwards of 200 lines (and thus that many requests to the database to check), I'm eager to know if, it's more efficient to make a single request to the database for all the product codes available, and then run the check against that list, even though there will be roughly 2000 codes in that list.
So, either 200 sequential one result requests, or a single 2000 result request.
The site will be handling about 130 uploads within a 4-5 hour period, and must traverse a VPN from Azure to our database server.
This looks like another case of Permature Optimization (tam tam taaaaam).
You don't know that you have a problem, and yet you're trying to solve it. The first thing you should see is if there's a real performance problem here. My guess is - there isn't. You're going to read 2000 records and write 200 records once every few minutes. That's really not something to worry about.
But don't take my word for it, try it out. See how long it takes you to load those 2000 records and write those 200 records. If there's a problem, try to optimize.
By the way, optimizing this by breaking the request into 200 smaller requests is unlikely to work. Let's cross this bridge when you get there.
It will definitely be more efficient to make a single query that gets 2000 rows than to make 200 queries that gets a single row. For the single row queries the actual data would be a minor part of the traffic, it would be mostly overhead.
Another alternative would be to put that check in the query that adds the line to the order, that way you don't need a separate query to check the product first. If the product can't be sold to the customer the query would simply not insert any record, and it can return the number of records added so that the calling code can determine if the line was added or not.
Example:
create procedure AddOrderLine
#OrderId int,
#ProductId int,
#Quantity int
as
set nocount on
insert into OrderLines (OrderId, ProductId, Quantity)
select
o.OrderId,
#ProductId,
#Quantity
from
Orders o
inner join AllowedProducts a on a.CustomerId = o.CustomerId and a.ProductId = #ProductId
where
OrderId = #OrderId
return ##rowcount

Best way to check if multiple records exist in database

I am creating an application that takes data from a text file which has sales data from Amazon market place.The market place has items with different names compared to the data in our main database. The application accepts the text file as input and it needs to check if the item exists in our database. If not present I should throw an option to save the item to a Master table or to Sub item table and map it to a master item. My question is if the text file has 100+ items should I hit the database each time to check if the data exists there.Is there any better way of doing that so that we can minimize the database hits.
I have two options that i have used earlier
Hit database and check if it exists in table.
Fill the data in a DataTable and use DataTable.Select to check if it exists.
Can some one tell me the best way to do this?. I have to check two tables (master table, subItem table), maybe 1 at a time. Thanks.
Update:
#Downvoters add an comment .
i am not asking you whats the way to check if an item exists in database.I just want to know the best way of doing that. Should I be hitting database 1000 times if an file has 1000 items? That's my question.
The current query I use:
if exists (select * from [table] where itemname= [itemname] )
select 'True'
else
select 'False'
return
(From Chat)
I would create a Stored Procedure which takes a table valued parameter of all the items that you want to check. You can then use a join (a couple of options here)* to return a result set of items and whether each one exists or not. You can use TVP's from ADO like this.
It will certainly handle the 100 to 1000 row range mentioned in your post. To be honest, I haven't used it in the 1M+ range.
in newer versions of SQL Server, I would prefer TVP's over using an xml input parameter, as it is really quite cumbersome to pack the xml in your .Net code and then unpack it again in your SPROC.
(*) Re Joins : With the result set, you can either just inner join the TVP to your items / product table and check in .Net if the row doesn't exist, or you can do an left outer join with the TVP as the left table, and e.g. ISNULL() missing items to 0 / 'false' etc.
Make it as batch of 100 items to the database. probably a stored procedure might help, since repetitive queries has to be fired. If the data is not changed frequently, you can consider caching. I assume you will be making service calls from ur .net application, so ingest a xml from back end, in batches. Consider increasing batch size based on the filesize.
If your entire application is local, batch size size may very high, as there is no netowrk oberhead, still dont make 100 calls to db.
Try like this
SELECT EXISTS(SELECT * FROM table1 WHERE itemname= [itemname])
SELECT EXISTS(SELECT 1 FROM table1 WHERE itemname= [itemname])

Efficient insert statement

I'm looking for an efficient way of inserting records into SQL server for my C#/MVC application. Anyone know what the best method would be?
Normally I've just done a while loop and insert statement within, but then again I've not had quite so many records to deal with. I need to insert around half a million, and at 300 rows a minute with the while loop, I'll be here all day!
What I'm doing is looping through a large holding table, and using it's rows to create records in a different table. I've set up some functions for lookup data which is necessary for the new table, and this is no doubt adding to the drain.
So here is the query I have. Extremely inefficient for large amounts of data!
Declare #HoldingID int
Set #HoldingID = (Select min(HoldingID) From HoldingTable)
While #JourneyHoldingID IS NOT NULL
Begin
Insert Into Journeys (DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
Where HoldingID = #HoldingID
Set #HoldingID = (Select MIN(HoldingID) From Holding Where HoldingID > #HoldingID)
End
I've heard about set-based approaches - is there anything that might work for the above problem?
If you want to insert a lot of data into a MSSQL Server then you should use BULK INSERTs - there is a command line tool called the bcp utility for this, and also a C# wrapper for performing Bulk Copy Operations, but under the covers they are all using BULK INSERT.
Depending on your application you may want to insert your data into a staging table first, and then either MERGE or INSERT INTO SELECT... to transfer those rows from the staging table(s) to the target table(s) - if you have a lot of data then this will take some time, however will be a lot quicker than performing the inserts individually.
If you want to speed this up then are various things that you can do such as changing the recovery model or tweaking / removing triggers and indexes (depending on whether or not this is a live database or not). If its still really slow then you should look into doing this process in batches (e.g. 1000 rows at a time).
This should be exactly what you are doing now.
Insert Into Journeys(DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
ORDER BY HoldingID ASC
you (probably) are able to do it in one statement of the form
INSERT INTO JOURNEYS
SELECT * FROM HOLDING;
Without more information about your schema it is difficult to be absolutely sure.
SQLServer 2008 introduced Table Parameters. These allow you to insert multiple rows in a single trip to the database (send it as a large blob). Without using a temporary table. This article describes how it works (step four in the article)
http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
It differs from bulk inserts in that you do not need special utilities and that all constraints and foreign keys are checked.
I quadrupled my throughput using this and parallelizing the inserts. Now at 15.000 inserts/second in the same table sustained. Regular table with indexes and over a billion rows.

Categories