I am using SqlDataReader for data migration. How can I increase the number of records inserting to the destination at a time?
I want to increase the number of records inserting to the destination at a time.
Then this is unrelated to SqlDataReader, and you'll need to look at whatever tool you're using for the insert. If you're using SqlBulkCopy, then this should be as simple as changing the .BatchSize. If you're using other mechanisms, you'll have to be specific. For example, if you're using an SP to do the inserts that only handles one row at a time, one option might be to use MARS and overlapping async operations; I have a utility method that I use for this type of sequential fixed-depth overlapping (which is very different to what Parallel.ForEach would do, for example, even with a fixed max-DOP). If you're using an insert that works via TDS-based table-parameters, then: just buffer that much data locally before calling the operation. If you're using an ORM such as EF: refer to the ORM's insert documentation.
But: to emphasize: the one thing that doesn't get a vote on this is: the data-reader.
Related
I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.
Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).
I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.
Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog
I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data
I've got some text data that I'm loading into a SQL Server 2005 database using Linq-to-SQL using this method (psuedo-code):
Create a DataContext
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record);
}
dataContext.SubmitChanges();
The code is a little C# console application. This works fine so far, but I'm about to do an import of the real data (rather than a test subset) and this contains about 2 million rows instead of the 1000 I've tested. Am I going to have to do some clever batching or something similar to avoid the code falling over or performing woefully, or should Linq-to-SQL handle this gracefully?
It looks like this would work however the changes (and thus memory) that are kept by the DataContext are going to grow with each InsertOnSubmit. Maybe it's adviseable to perform a SubmitChanges every 100 records?
I would also take a look at SqlBulkCopy to see if it doesn't fit your usecase better.
IF you need to do bulk inserts, you should check out SqlBulkCopy
Linq-to-SQL is not really suited for doing large-scale bulk inserts.
You would want to call SubmitChanges() every 1000 records or so to flush the changes so far otherwise you'll run out of memory.
If you want performance, you might want to bypass Linq-To-SQL and go for System.Data.SqlClient.SqlBulkCopy instead.
Just for the record I did as marc_s and Peter suggested and chunked the data. It's not especially fast (it took about an hour and a half as Debug configuration, with the debugger attached and quite a lot of console progress output), but it's perfectly adequate for our needs:
Create a DataContext
numRows = 0;
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record)
// Submit the changes in thousand row batches
if (numRows % 1000 == 999)
dataContext.SubmitChanges()
numRows++
}
dataContext.SubmitChanges()
I have a SP I want to execute and save the groos result aside (in a class field).
Later on I want to acquire the values of some columns for some rows from this result.
What returned types are possible? Which one is the most sutiable for my goal?
I know there are DataSet, DataReader, resultSet. what else?
What is the main difference between them ?
If you want to store the results and use them later (as you have written), you may use the heavy data sets or fill the lightweight lists with custom container types via the data reader.
Or in case you want to consume the results immediately, go on with the data reader.
Result set is the old VB6 class AFAIK or the current Java interface.
The traditional way to get data is by using the the classes in System.Data.SqlClient namespace. You can use the DataReader which is a read only forward type of cursor, fast and efficient when you just want to read a recordset. DataReader is bindable but you read it one record at the time and therefore don't have the options of going back, for instance. If the recordset is very big the reader is also good because it stores just one record at the time in memory.
You can use the DataAdapter and get a DataSet and then you have a complete control of all the data within the DataSet-class. It is heavier on the system but very powerful when you need to work with the data in you application. You can also use DataSet if the query returns more than one recordset.
So it really depends on what you need to do with the data after getting it from the database. If you just need to read it into something else, use DataReader otherwise DataSet.
i have a query that return only one row (always) and i want to convert this row to class object (lets say obi)
i have a feeling that using data table to this kind of query is to much
but i dont realy know which other data object to use
data reader?
is there a way to execute sql command to data row ?
DataReader is the best choice here - DataAdapters and DataSets may be overkill for a single row, although, that said, if performance is not critical then keeping-it-simple isn't a bad thing. You don't need to go from DataReader -> DataRow -> your object, just read the values off of the DataReader and you're done.
A datareader lets you query individual fields. If you want the row as a single object, I believe the DataTable/DataRowView family of objects is in fact the way to go.
You might seriously consider taking a look at Linq-to-Sql or Linq-to-Entities.
The appeal of these frameworks is they provide automatic serialization of your database data into objects, abstract away many of the mundane details of connection management, and have better compile-time support by providing strongly-typed properties which you can use without string keys or column ordinals.
When using Linq, the difference between retrieving a single row vs. retrieving multiple rows often only involves appending .Single() or .First() to your query.
At any rate, if you already use or are willing to learn one of these frameworks, you may see the bulk and difficulty of data access code reduce substantially.
With respect to DataReader vs. DataSet/DataTable, it is correct that it takes more cycles to allocate and populate a data table; however, I highly doubt you will notice the difference unless creating an extremely high volume of database calls.
In case it is helpful, here are documentation examples of data access using data readers and data sets.
DataReader
DataSet
Nightly, I need to fill a SQL Server 2005 table from an ODBC source with over 8 million records. Currently I am using an insert statement from linked server with syntax select similar to this:
Insert Into SQLStagingTable from Select * from OpenQuery(ODBCSource, 'Select * from SourceTable')
This is really inefficient and takes hours to run. I'm in the middle of coding a solution using SqlBulkInsert code similar to the code found in this question.
The code in that question is first populating a datatable in memory and then passing that datatable to the SqlBulkInserts WriteToServer method.
What should I do if the populated datatable uses more memory than is available on the machine it is running (a server with 16GB of memory in my case)?
I've thought about using the overloaded ODBCDataAdapter fill method which allows you to fill only the records from x to n (where x is the start index and n is the number of records to fill). However that could turn out to be an even slower solution than what I currently have since it would mean re-running the select statement on the source a number of times.
What should I do? Just populate the whole thing at once and let the OS manage the memory? Should I populate it in chunks? Is there another solution I haven't thought of?
The easiest way would be to use ExecuteReader() against your odbc data source and pass the IDataReader to the WriteToServer(IDataReader) overload.
Most data reader implementations will only keep a very small portion of the total results in memory.
SSIS performs well and is very tweakable. In my experience 8 million rows is not out of its league. One of my larger ETLs pulls in 24 million rows a day and does major conversions and dimensional data warehouse manipulations.
If you have indexes on the destination table, you might consider disabling those till the records get inserted?