Microsoft SQL Server insert data into large table at every second

Microsoft SQL Server insert data into large table at every second - c#

I have a data acquisition system that reads values from some industrial devices and records values into Microsoft SQL Server 2008 R2 database. Data record interval is approximately 20 seconds. Every record data contains approximately 600 bytes of data.
Now I need to insert data from a new hardware but this time record interval has to be 1 second. In other words I insert 1 record of 600 bytes into SQL server database in every second.
I have two questions:
Is there any possible problem that I may run into while inserting data in every second? I think Microsoft SQL server is quite OK for this frequency of insertion but I am not sure for a long-period.
Program is a long running application. I clear the data table approximately every week. When I record data in every second I will have 3600 rows in the table every hour and 86400 rows every day and approximately 600K rows at the end of week. Is this OK for a good level of reading data? Or should I try to change my approach in order not to have such amount of rows in the table?
By the way I use LinqToSQL for all my database operations and C# for programming.

Is there any possible problem that I may run into while inserting data in every second? I think Microsoft SQL server is quite OK for this frequency of insertion but I am not sure for a long-period.
If database is properly designed than you should not run into any problem. We save GIS data at much greater speed without any issue.
Is this OK for a good level of reading data? Or should I try to change my approach in order not to have such amount of rows in the table?
It depends, if you need all the data than how can you change the approach? if you don't need it why do you save it?

First of all, you must think about existing indexes on tables in which you insert data, because indexes slowing down insert process. Second, if you have FULL recovery model, then every insert process will be written in transaction log, and your log file will rapidly rise.
Think about change your recovery model to SIMPLE, and to disable your indexes.
Of course, selecting rows from that table will be slower, but I don't know what is your requests.

Based on my thesis experience in college, if your system is fully stable and doesn't crash or overflow or etc. You can use SqlBulkCopy to avoid I/O operation per record.
This is sample code of bulk copy for DataTable and this method should call every 1 hour:
private void SaveNewData()
{
if (cmdThesis.Connection.State == ConnectionState.Open)
{
cmdThesis.Connection.Close();
}
SqlBulkCopy bulkCopy = new SqlBulkCopy(#"Data Source=.;Initial Catalog=YourDb;Integrated Security=True");
bulkCopy.BatchSize = 3000;
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col1", "Col1"));
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col2", "Col2"));
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col3", "Col3"));
bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Col4", "Col4"));
bulkCopy.DestinationTableName = "DestinationTable";
bulkCopy.WriteToServer(Result);
Result.Rows.Clear();
}

Although I think you should be ok, since you are apparently using a .NET platform, you can check out StreamInsight: http://technet.microsoft.com/en-us/library/ee391416.aspx

Related

What is the best way to load huge result set in memory?

I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.
Considering below are the queries to pull data from source and target:
Sql Server -
select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle -
select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in Source : 12377200
Records in Target : 12266800
Following are the approaches i have tried with some statistics:
1) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:48:32
There is no index on Id Column.
Major time is spent here:
var dr = command.ExecuteReader();
Problems:
There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.
2) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 02:02:48
There is no index on Id Column.
3) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 00:39:40
Index is present on Id column.
4) open data reader approach for reading source and target data:
Total jobs = 1
Index : Yes
Time: 00:01:43
5) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Index : Yes
Time: 00:25:12
I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.
We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.
Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.
Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41
Target Sql Query Execution time in sql server management studio:00:01:40
What will be the best way to read huge result set in memory?
Code:
static void Main(string[] args)
{
// Running 3 jobs in parallel
//Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare())
//};
Compare();//Run single job
Console.ReadKey();
}
public static string Compare()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var srcConnection = new SqlConnection("Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
var tgtConnection = new SqlConnection("Target Connection String");
tgtConnection.Open();
var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(#"d\.hh\:mm\:ss");
Console.WriteLine(a);
return a;
}
private static IDataReader GetReader(SqlCommand command)
{
command.CommandTimeout = 0;
return command.ExecuteReader();//Culprit
}

There is nothing (I know of) faster than a DataReader for fetching db records.
Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.
If you want faster you can:
jdwend's suggestion:
Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class
Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/
The easiest (and IMO the best for a SELECT * all) is to throw hardware at it:
https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/
Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.

This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.
// don't forget this using statements
using System.Data;
using System.Data.SqlClient;
// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
cn.Open();
using (var cmd = cn.CreateCommand())
{
cmd.CommandText = "Select [Fields] From [Table] etc etc";
// your SQL statement here.
using (var adapter = new SqlDataAdapter(cmd))
{
adapter.Fill(table);
} // dispose adapter
} // dispose cmd
cn.Close();
} // dispose cn
foreach(DataRow row in table.Rows)
{
// do something with the data set.
}

I think I would deal with this problem in a different way.
But before lets make some assumptions:
According to your question description, you will get data from SQL Server and Oracle
Each query will return a bunch of data
You do not specify what is the point of getting all that data in memory, neither the use of it.
I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.
Having these foundation points I would process the following:
Think at this problem as a data processing
Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
As soon as your logic completes all the data loading (import migration), then you can start processing it.
Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.

If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).
If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.
After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.

Technicalities aside, I think there is a more fundamental problem here.
select [...] order by LinkedColumn
I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.
We would like to keep database server as free as possible
If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!
...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.

I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.
I took a different approach.
We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.
I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.
That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.
The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.
When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.
So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.

If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.
A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified

best way to split up a long file. Programming or SQL?

I have a database Table (in MS-Access) of GPS information with a record of Speed, location (lat/long) and bearing of a vehicle for every second. There is a field that shows time like this 2007-09-25 07:59:53. The problem is that this table has has merged information from several files that were collected on this project. So, for example, 2007-09-25 07:59:53 to 2007-09-25 08:15:42 could be one file and after a gap of more than 10 seconds, the next file will start, like 2007-09-25 08:15:53 to 2007-09-25 08:22:12. I need to populate a File number field in this table and the separating criterion for each file will be that the gap in time from the last and next file is more than 10 sec. I did this using C# code by iterating over the table and comparing each record to the next and changing file number whenever the gap is more than 10 sec.
My question is, should this type of problem be solved using programming or is it better solved using a SQL query? I can load the data into a database like SQL Server, so there is no limitation to what tool I can use. I just want to know the best approach.
If it is better to solve this using SQL, will I need to use cursors?
When solving this using programming (for example C#) what is an efficient way to update a Table when 20000+ records need to be updated based on an updated DataSet? I used the DataAdapter.Update() method and it seemed to take a long time to update the table (30 mins or so).

Assuming SQL Server 2008 and CTEs from your comments:
The best time to use SQL is generally when you are comparing or evaluating large sets of data.
Iterative programming languages like C# are better suited to more expansive analysis of individual records or analysis of rows one at a time (*R*ow *B*y *A*gonizing *R*ow).
For examples of recursive CTEs, see here. MS has a good reference.
Also, depending on data structure, you could do this with a normal JOIN:
SELECT <stuff>
FROM MyTable T
INNER JOIN MyTable T2
ON t2.timefield = DATEADD(minute, -10, t.timefield)
WHERE t2.pk = (SELECT MIN(pk) FROM MyTable WHERE pk > t.pk)

Can I do a very large insert with Linq-to-SQL?

I've got some text data that I'm loading into a SQL Server 2005 database using Linq-to-SQL using this method (psuedo-code):
Create a DataContext
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record);
}
dataContext.SubmitChanges();
The code is a little C# console application. This works fine so far, but I'm about to do an import of the real data (rather than a test subset) and this contains about 2 million rows instead of the 1000 I've tested. Am I going to have to do some clever batching or something similar to avoid the code falling over or performing woefully, or should Linq-to-SQL handle this gracefully?

It looks like this would work however the changes (and thus memory) that are kept by the DataContext are going to grow with each InsertOnSubmit. Maybe it's adviseable to perform a SubmitChanges every 100 records?
I would also take a look at SqlBulkCopy to see if it doesn't fit your usecase better.

IF you need to do bulk inserts, you should check out SqlBulkCopy
Linq-to-SQL is not really suited for doing large-scale bulk inserts.

You would want to call SubmitChanges() every 1000 records or so to flush the changes so far otherwise you'll run out of memory.
If you want performance, you might want to bypass Linq-To-SQL and go for System.Data.SqlClient.SqlBulkCopy instead.

Just for the record I did as marc_s and Peter suggested and chunked the data. It's not especially fast (it took about an hour and a half as Debug configuration, with the debugger attached and quite a lot of console progress output), but it's perfectly adequate for our needs:
Create a DataContext
numRows = 0;
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record)
// Submit the changes in thousand row batches
if (numRows % 1000 == 999)
dataContext.SubmitChanges()
numRows++
}
dataContext.SubmitChanges()

Bulk Operations with SQL CE / LINQ

I'm trying to write a program to convert a large amount of data from a legacy SQL Express system to a newer system based on SQL CE. Here's a quick snapshot of what's going on:
Most of the tables in the SQL Express install are small (< 10K records)
One table is --extremely-- large, and is well over 1 million records
For the smaller tables I can use LINQ just fine -- but the large table gives me problems. The standard way of:
foreach(var dataRow in ...)
{
table.InsertOnSubmit(dataRow);
}
database.SubmitChanges();
Is painfully slow and takes several hours to complete. I've even tried doing some simple "bulk" operations to try and eliminate one giant insertion at the end of the loop, ie:
foreach(var dataRow in ...)
{
if(count == BULK_LIMIT)
{
count = 0;
database.SubmitChanges();
}
count++;
table.InsertOnSubmit(dataRow);
}
// Final submit, to catch the last BULK_LIMIT item block
database.SubmitChanges();
I've tried a variety of bulk sizes, from relatively small values like 1K-5K to larger sizes up to 300K.
Ultimately I'm stuck and the process takes roughly the same amount of time (several hours) regardless of the bulk size.
So - does anyone know of a way to crank up the speed? The typical solution would be to use SqlBulkCopy, but that isn't compatible with SQL CE.
A couple of notes:
Yes I really do want all the records in SQL CE, and yes I've setup the connection to allow the database to max out at 4 GB.
Yes I really do need every last of the 1M+ records.
The stuff in each data row is all primitive, and is a mix of strings and timestamps.
The size of the legacy SQL Express database is ~400 MB.
Thanks in advance - all help is appreciated!
-- Dan

Use a parameterised INSERT statement: Prepare a command, set the parameter values in a loop and reuse the same command for each INSERT.
Remove any indexes and re-apply after you have performed all INSERTs.
Update: Chris Tacke has the fastest solution here using SqlCeResultset: Bulk Insert In SQLCE

What's the best way to use SqlBulkCopy to fill a really large table?

Nightly, I need to fill a SQL Server 2005 table from an ODBC source with over 8 million records. Currently I am using an insert statement from linked server with syntax select similar to this:
Insert Into SQLStagingTable from Select * from OpenQuery(ODBCSource, 'Select * from SourceTable')
This is really inefficient and takes hours to run. I'm in the middle of coding a solution using SqlBulkInsert code similar to the code found in this question.
The code in that question is first populating a datatable in memory and then passing that datatable to the SqlBulkInserts WriteToServer method.
What should I do if the populated datatable uses more memory than is available on the machine it is running (a server with 16GB of memory in my case)?
I've thought about using the overloaded ODBCDataAdapter fill method which allows you to fill only the records from x to n (where x is the start index and n is the number of records to fill). However that could turn out to be an even slower solution than what I currently have since it would mean re-running the select statement on the source a number of times.
What should I do? Just populate the whole thing at once and let the OS manage the memory? Should I populate it in chunks? Is there another solution I haven't thought of?

The easiest way would be to use ExecuteReader() against your odbc data source and pass the IDataReader to the WriteToServer(IDataReader) overload.
Most data reader implementations will only keep a very small portion of the total results in memory.

SSIS performs well and is very tweakable. In my experience 8 million rows is not out of its league. One of my larger ETLs pulls in 24 million rows a day and does major conversions and dimensional data warehouse manipulations.

If you have indexes on the destination table, you might consider disabling those till the records get inserted?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.