I'm trying to write a program to convert a large amount of data from a legacy SQL Express system to a newer system based on SQL CE. Here's a quick snapshot of what's going on:
Most of the tables in the SQL Express install are small (< 10K records)
One table is --extremely-- large, and is well over 1 million records
For the smaller tables I can use LINQ just fine -- but the large table gives me problems. The standard way of:
foreach(var dataRow in ...)
{
table.InsertOnSubmit(dataRow);
}
database.SubmitChanges();
Is painfully slow and takes several hours to complete. I've even tried doing some simple "bulk" operations to try and eliminate one giant insertion at the end of the loop, ie:
foreach(var dataRow in ...)
{
if(count == BULK_LIMIT)
{
count = 0;
database.SubmitChanges();
}
count++;
table.InsertOnSubmit(dataRow);
}
// Final submit, to catch the last BULK_LIMIT item block
database.SubmitChanges();
I've tried a variety of bulk sizes, from relatively small values like 1K-5K to larger sizes up to 300K.
Ultimately I'm stuck and the process takes roughly the same amount of time (several hours) regardless of the bulk size.
So - does anyone know of a way to crank up the speed? The typical solution would be to use SqlBulkCopy, but that isn't compatible with SQL CE.
A couple of notes:
Yes I really do want all the records in SQL CE, and yes I've setup the connection to allow the database to max out at 4 GB.
Yes I really do need every last of the 1M+ records.
The stuff in each data row is all primitive, and is a mix of strings and timestamps.
The size of the legacy SQL Express database is ~400 MB.
Thanks in advance - all help is appreciated!
-- Dan
Use a parameterised INSERT statement: Prepare a command, set the parameter values in a loop and reuse the same command for each INSERT.
Remove any indexes and re-apply after you have performed all INSERTs.
Update: Chris Tacke has the fastest solution here using SqlCeResultset: Bulk Insert In SQLCE
Related
I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.
"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).
You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.
I have a data acquisition system that reads values from some industrial devices and records values into Microsoft SQL Server 2008 R2 database. Data record interval is approximately 20 seconds. Every record data contains approximately 600 bytes of data.
Now I need to insert data from a new hardware but this time record interval has to be 1 second. In other words I insert 1 record of 600 bytes into SQL server database in every second.
I have two questions:
Is there any possible problem that I may run into while inserting data in every second? I think Microsoft SQL server is quite OK for this frequency of insertion but I am not sure for a long-period.
Program is a long running application. I clear the data table approximately every week. When I record data in every second I will have 3600 rows in the table every hour and 86400 rows every day and approximately 600K rows at the end of week. Is this OK for a good level of reading data? Or should I try to change my approach in order not to have such amount of rows in the table?
By the way I use LinqToSQL for all my database operations and C# for programming.
Is there any possible problem that I may run into while inserting data in every second? I think Microsoft SQL server is quite OK for this frequency of insertion but I am not sure for a long-period.
If database is properly designed than you should not run into any problem. We save GIS data at much greater speed without any issue.
Is this OK for a good level of reading data? Or should I try to change my approach in order not to have such amount of rows in the table?
It depends, if you need all the data than how can you change the approach? if you don't need it why do you save it?
First of all, you must think about existing indexes on tables in which you insert data, because indexes slowing down insert process. Second, if you have FULL recovery model, then every insert process will be written in transaction log, and your log file will rapidly rise.
Think about change your recovery model to SIMPLE, and to disable your indexes.
Of course, selecting rows from that table will be slower, but I don't know what is your requests.
Based on my thesis experience in college, if your system is fully stable and doesn't crash or overflow or etc. You can use SqlBulkCopy to avoid I/O operation per record.
This is sample code of bulk copy for DataTable and this method should call every 1 hour:
private void SaveNewData()
{
if (cmdThesis.Connection.State == ConnectionState.Open)
{
cmdThesis.Connection.Close();
}
SqlBulkCopy bulkCopy = new SqlBulkCopy(#"Data Source=.;Initial Catalog=YourDb;Integrated Security=True");
bulkCopy.BatchSize = 3000;
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col1", "Col1"));
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col2", "Col2"));
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col3", "Col3"));
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col4", "Col4"));
bulkCopy.DestinationTableName = "DestinationTable";
bulkCopy.WriteToServer(Result);
Result.Rows.Clear();
}
Although I think you should be ok, since you are apparently using a .NET platform, you can check out StreamInsight: http://technet.microsoft.com/en-us/library/ee391416.aspx
I have a database Table (in MS-Access) of GPS information with a record of Speed, location (lat/long) and bearing of a vehicle for every second. There is a field that shows time like this 2007-09-25 07:59:53. The problem is that this table has has merged information from several files that were collected on this project. So, for example, 2007-09-25 07:59:53 to 2007-09-25 08:15:42 could be one file and after a gap of more than 10 seconds, the next file will start, like 2007-09-25 08:15:53 to 2007-09-25 08:22:12. I need to populate a File number field in this table and the separating criterion for each file will be that the gap in time from the last and next file is more than 10 sec. I did this using C# code by iterating over the table and comparing each record to the next and changing file number whenever the gap is more than 10 sec.
My question is, should this type of problem be solved using programming or is it better solved using a SQL query? I can load the data into a database like SQL Server, so there is no limitation to what tool I can use. I just want to know the best approach.
If it is better to solve this using SQL, will I need to use cursors?
When solving this using programming (for example C#) what is an efficient way to update a Table when 20000+ records need to be updated based on an updated DataSet? I used the DataAdapter.Update() method and it seemed to take a long time to update the table (30 mins or so).
Assuming SQL Server 2008 and CTEs from your comments:
The best time to use SQL is generally when you are comparing or evaluating large sets of data.
Iterative programming languages like C# are better suited to more expansive analysis of individual records or analysis of rows one at a time (*R*ow *B*y *A*gonizing *R*ow).
For examples of recursive CTEs, see here. MS has a good reference.
Also, depending on data structure, you could do this with a normal JOIN:
SELECT <stuff>
FROM MyTable T
INNER JOIN MyTable T2
ON t2.timefield = DATEADD(minute, -10, t.timefield)
WHERE t2.pk = (SELECT MIN(pk) FROM MyTable WHERE pk > t.pk)
I've got some text data that I'm loading into a SQL Server 2005 database using Linq-to-SQL using this method (psuedo-code):
Create a DataContext
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record);
}
dataContext.SubmitChanges();
The code is a little C# console application. This works fine so far, but I'm about to do an import of the real data (rather than a test subset) and this contains about 2 million rows instead of the 1000 I've tested. Am I going to have to do some clever batching or something similar to avoid the code falling over or performing woefully, or should Linq-to-SQL handle this gracefully?
It looks like this would work however the changes (and thus memory) that are kept by the DataContext are going to grow with each InsertOnSubmit. Maybe it's adviseable to perform a SubmitChanges every 100 records?
I would also take a look at SqlBulkCopy to see if it doesn't fit your usecase better.
IF you need to do bulk inserts, you should check out SqlBulkCopy
Linq-to-SQL is not really suited for doing large-scale bulk inserts.
You would want to call SubmitChanges() every 1000 records or so to flush the changes so far otherwise you'll run out of memory.
If you want performance, you might want to bypass Linq-To-SQL and go for System.Data.SqlClient.SqlBulkCopy instead.
Just for the record I did as marc_s and Peter suggested and chunked the data. It's not especially fast (it took about an hour and a half as Debug configuration, with the debugger attached and quite a lot of console progress output), but it's perfectly adequate for our needs:
Create a DataContext
numRows = 0;
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record)
// Submit the changes in thousand row batches
if (numRows % 1000 == 999)
dataContext.SubmitChanges()
numRows++
}
dataContext.SubmitChanges()
Nightly, I need to fill a SQL Server 2005 table from an ODBC source with over 8 million records. Currently I am using an insert statement from linked server with syntax select similar to this:
Insert Into SQLStagingTable from Select * from OpenQuery(ODBCSource, 'Select * from SourceTable')
This is really inefficient and takes hours to run. I'm in the middle of coding a solution using SqlBulkInsert code similar to the code found in this question.
The code in that question is first populating a datatable in memory and then passing that datatable to the SqlBulkInserts WriteToServer method.
What should I do if the populated datatable uses more memory than is available on the machine it is running (a server with 16GB of memory in my case)?
I've thought about using the overloaded ODBCDataAdapter fill method which allows you to fill only the records from x to n (where x is the start index and n is the number of records to fill). However that could turn out to be an even slower solution than what I currently have since it would mean re-running the select statement on the source a number of times.
What should I do? Just populate the whole thing at once and let the OS manage the memory? Should I populate it in chunks? Is there another solution I haven't thought of?
The easiest way would be to use ExecuteReader() against your odbc data source and pass the IDataReader to the WriteToServer(IDataReader) overload.
Most data reader implementations will only keep a very small portion of the total results in memory.
SSIS performs well and is very tweakable. In my experience 8 million rows is not out of its league. One of my larger ETLs pulls in 24 million rows a day and does major conversions and dimensional data warehouse manipulations.
If you have indexes on the destination table, you might consider disabling those till the records get inserted?