I have a .NET Core C# console application that performs a large number of calculations and then writes the results to a SQL Server 2016 Developer edition database using Dapper (and Dapper.Contrib). The issue I'm having is that when I run a lot of items in parallel (greater than 1000, for example), I start getting intermittent connection failures on the .Open() call, saying
A network-related or instance-specific error occurred...
This often happens after several thousand rows have already been inserted successfully.
A simplified version of the code would look like the following:
Parallel.ForEach(collection, (item) =>
{
var results = item.Calculate(parameters);
dal.Results.Insert(results);
allResults.AddRange(results);
});
And inside the Insert method, it looks like this:
public override void Insert(IEnumerable<Result> entities)
{
using (var connection = GetConnection())
{
connection.Open();
using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
{
connection.Insert(entities, transaction);
transaction.Commit();
}
}
}
Some other things about the code that I don't think are affecting this but might be relevant:
dal.Results is simply a repository that contains that Insert() method and is preinitialized with a connection string that is used to instantiate a new SqlConnection(connectionString) every time GetConnection() is called.
allResults is a ConcurrentBag<Result> that I'm using to store all the results for later use outside the Parallel.ForEach
I'm using a transaction because it seems to perform better this way, but I'm open to suggestions if that could be causing problems.
Thanks in advance for any guidance on this issue!
There is no advantage to execute heavily IO-bound db-operations in parallel.
You should create fever but bigger bunches of data to be inserted with minimun amount of database transactions. That can be achieve with several ways:
With SQL Bulk Insert operations provided by .NET framework
By using external library which is specialized on high-speed Bulk operations
By crafting sql stored procedure which takes array of data as parameter. More information about table-valued parameters can be found in https://learn.microsoft.com/en-us/sql/relational-databases/tables/use-table-valued-parameters-database-engine
So try following: Execute CPU-intensive calculations in parallel loop and save allResults into database after loop.
Related
I have read and implemented several different versions of Microsofts suggested methods for querying a SQL Server database. In all that I have read, each query is surrounded by a using statement, e.g. In some method DoQuery:
List<List<string>> DoQuery(string cStr, string query)
{
using(SqlConnection c = new SqlConnection(cStr))
{
c.Open();
using(SqlCommand cmd = new SqlCommand(queryStr, c))
{
using(SqlDataReader reader = cmd.ExecuteReader())
{
while (reader.Read() )
{
...
//read columns and put into list to return
}
// close all of the using blocks
}
}
}
// return the list of rows containing the list of column values.
}
I need to run this code several hundreds of times for different query strings against the same database. It seems that creating a new connection each time would be inefficient and dropping it each time wasteful.
How should I structure this so that it is efficient? When I tried not using a using block and passing the connection into the DoQuery method, I got messages about the connection had not been closed. If I closed it after the query, then I got messages about it wasn't open.
I'm also trying to improve this because I keep getting somewhat random
IOException: Unable to read data from the transport connection: Operation on non-blocking socket would block.
I'm the only user of the database at this time and I'm not doing anything in multiple threads or async, etc. Just looping through query strings and running DoQuery on them.
Could my structure be part of that problem, i.e. not releasing the resources fast enough and thereby seeing the connection blocked?
I'm stuck here on efficiency and this blocking problem. Thanks in advance.
As it turns out, the query structure was fine and the queries were fine. The problem was that I had an ‘order by X desc’ on each query and that column was not indexed. This caused a full table scan to order the rows even if only returning 2. The table has about 3 million rows and I thought it could handle that better than it does. It timed out with 360 second connection timeout! I indexed the column and no more ‘blocking’ nonsense, which BTW, is a horrible message to return when it was actually a timeout. The queries now run fine if I index every column that appears in a where clause.
I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.
Considering below are the queries to pull data from source and target:
Sql Server -
select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle -
select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in Source : 12377200
Records in Target : 12266800
Following are the approaches i have tried with some statistics:
1) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:48:32
There is no index on Id Column.
Major time is spent here:
var dr = command.ExecuteReader();
Problems:
There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.
2) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 02:02:48
There is no index on Id Column.
3) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 00:39:40
Index is present on Id column.
4) open data reader approach for reading source and target data:
Total jobs = 1
Index : Yes
Time: 00:01:43
5) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Index : Yes
Time: 00:25:12
I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.
We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.
Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.
Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41
Target Sql Query Execution time in sql server management studio:00:01:40
What will be the best way to read huge result set in memory?
Code:
static void Main(string[] args)
{
// Running 3 jobs in parallel
//Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare())
//};
Compare();//Run single job
Console.ReadKey();
}
public static string Compare()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var srcConnection = new SqlConnection("Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
var tgtConnection = new SqlConnection("Target Connection String");
tgtConnection.Open();
var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(#"d\.hh\:mm\:ss");
Console.WriteLine(a);
return a;
}
private static IDataReader GetReader(SqlCommand command)
{
command.CommandTimeout = 0;
return command.ExecuteReader();//Culprit
}
There is nothing (I know of) faster than a DataReader for fetching db records.
Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.
If you want faster you can:
jdwend's suggestion:
Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class
Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/
The easiest (and IMO the best for a SELECT * all) is to throw hardware at it:
https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/
Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.
This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.
// don't forget this using statements
using System.Data;
using System.Data.SqlClient;
// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
cn.Open();
using (var cmd = cn.CreateCommand())
{
cmd.CommandText = "Select [Fields] From [Table] etc etc";
// your SQL statement here.
using (var adapter = new SqlDataAdapter(cmd))
{
adapter.Fill(table);
} // dispose adapter
} // dispose cmd
cn.Close();
} // dispose cn
foreach(DataRow row in table.Rows)
{
// do something with the data set.
}
I think I would deal with this problem in a different way.
But before lets make some assumptions:
According to your question description, you will get data from SQL Server and Oracle
Each query will return a bunch of data
You do not specify what is the point of getting all that data in memory, neither the use of it.
I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.
Having these foundation points I would process the following:
Think at this problem as a data processing
Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
As soon as your logic completes all the data loading (import migration), then you can start processing it.
Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.
If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).
If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.
After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.
Technicalities aside, I think there is a more fundamental problem here.
select [...] order by LinkedColumn
I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.
We would like to keep database server as free as possible
If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!
...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.
I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.
I took a different approach.
We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.
I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.
That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.
The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.
When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.
So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.
A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified
I'm right now using SQL server 2008 in my project to store and fetch data . this is going perfect till now . I can fetch 20000 records in less than 50ms (JSON) . but facing a problem with inserts stuff . in my project I need to be able to insert something like 100000 records every minute . and this is seems to be very slow with SQL server .
I've tried to use another database (NOSQL DB) like mongoDB which are very fast in storing data (5s) comparing to SQLServer(270s) but not fast as sql in fetching data(20000 => 180ms) .
So I'm asking here if there any way to make SQL faster in storing . or to make mongoDB faster in fetching ( I'm not an expert in mongoDB I know the very basic things about it ) .
public static void ExecuteNonQuery(string sql)
{
SqlConnection con = GetConnection();
con.Open();
SqlCommand cmd = new SqlCommand(sql, con);
try
{
cmd.ExecuteNonQuery();
}
finally
{
con.Close();
}
}
SQL's Insert function
public IEnumerable<T> GetRecords<T>(System.Linq.Expressions.Expression<Func<T, bool>> expression, int from, int to) where T : class, new()
{
return _db.GetCollection<T>(collectionName).Find<T>(expression).Skip(from).Limit(to).Documents;
}
Mongo's Select function ( MongoDB 1.6 )
Update
: data structure : (int) Id , (string) Data
I guess that you are executing each insert in a transaction of its own (an implicit transaction might have been created if you do not provide one explicitly). As SQL server needs to ensure that the transaction is committed to the hard drive each transaction has a overhead that is very significant.
To get things to go faster, try to perform many inserts (try with a thousand or so) in a single ExecuteNonQuery() call. Also do not open and close, but keep the connection open (thus being in the same transaction) for several inserts.
You should have a look at the SqlBulkCopy Class
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
MongoDB is very fast on reads and writes. 50k reads and writes per second is doable on commodity hardware - depending on the data size. In addition to that you always have the option to scale out with sharding and replica sets but as said: 20k operations per seconds
with MongoDB is nothing.
Generally the speed on inserting data into the database is a function on the complexity of the operation.
If your inserts are significantly slow, then it points to optimisation problems with the inserts. Identify exaxtly what SQL insert statements your program is generating and then use the database EXPLAIN function to figure out what operations the underlying database is using. This often gives you a clue as to how you need to change your setup to increase the speed of these operations.
It might mean you have to change your database, or it might mean batching your inserts into a single call rather than inserting each item separately.
I see you are setting up and closing the connection each time.. this takes a significant time in itself. Try using a persistent connection.
I'm a novice C# dev and I'm writing a database app the performs updates on two different tables and inserts on another two tables and each process is running on it's own separate thread. So I have two threads handling inserts on two different tables and two threads handling updates on two different tables. Each process is updating and inserting approximately 4 or 5 times per second so I don't close the connection until the complete session is over then I close the entire app. I wanted to know if I should be closing the connection after each insert and update even though I'm preforming these operations so frequently. 2nd, should I have each thread running on it's own connection and command object.
By the way I'm writing the app in C# and the database is MySQL. Also, as of now I'm using one connection and command object for all four threads. I keep getting an error message saying "There is already an open DataReader associated with this connection that must be closed first", that's why I'm asking if I should be using multiple connection and command objects.
Thanks
-Donld
If you enable connection pooling, it should enable optimal use of MySql connections for your scenario. Either way, generally the best pattern to follow is:
Acquire and open connection
Do work
Close/release connection
Something similar to (I'm a bit rusty on the class names for the MySql connector, so this may not be exactly correct, but you should get the general idea!):
private void DoMyPieceOfWork(int value1, int value2)
{
using(MySqlConnection connection = new MySqlConnection(
CONNECTION_STRING_GOES_HERE))
{
connection.Open();
using(MySqlCommand command = new MySqlCommand(
"INSERT INTO TABLE `blah` (Column1, Column2) VALUES #column1, #column2"))
{
command.Parameters.Add("#column1", MySqlType.Int).Value = value1;
command.Parameters.Add("#column2", MySqlType.Int).Value = value2;
command.ExecuteNonQuery();
}
connection.Close();
}
}
Of course this is a contrived, simplistic, example but the gist of it stands.
you have either to create a new connection for each thread, or (it's an idea) create a synchronized queue of command. Then process the queue in a single working thread.
You may also take a look as the Task class of the .Net framework 4
Update: Looks like the query does not throw any timeout. The connection is timing out.
This is a sample code for executing a query. Sometimes, while executing time consuming queries, it throws a timeout exception.
I cannot use any of these techniques:
1) Increase timeout.
2) Run it asynchronously with a callback. This needs to run in a synchronous manner.
please suggest any other techinques to keep the connection alive while executing a time consuming query?
private static void CreateCommand(string queryString,
string connectionString)
{
using (SqlConnection connection = new SqlConnection(
connectionString))
{
SqlCommand command = new SqlCommand(queryString, connection);
command.Connection.Open();
command.ExecuteNonQuery();
}
}
Since you are using ExecuteNonQuery which does not return any rows, you can try this polling based approach. It executes the query in an asyc manner (without callback)
but the application will wait (inside a while loop) until the query is complete. From MSDN. This should solve the timeout problem. Please try it out.
But, I agree with others that you should think more about optimizing the query to perform under 30 seconds.
IAsyncResult result = command.BeginExecuteNonQuery();
int count = 0;
while (!result.IsCompleted)
{
Console.WriteLine("Waiting ({0})", count++);
System.Threading.Thread.Sleep(1000);
}
Console.WriteLine("Command complete. Affected {0} rows.",
command.EndExecuteNonQuery(result));
You should first check your query to see if it's optimized and it isn't somehow running on missing indexes. 30 seconds is allot for most queries, even on large databases if they are properly tuned. If you have solid proof using the query plan that the query can't be executed any faster than that, then you should increase the timeout, there's no other way to keep the connection, that's the purpose of the timeout to terminate the connection if the query doesn't complete in that time frame.
I have to agree with Terrapin.
You have a few options on how to get your time down. First, if your company employs DBAs, I'd recommend asking them for suggestions.
If that's not an option, or if you want to try some other things first here are your three major options:
Break up the query into components that run under the timeout. This is probably the easiest.
Change the query to optimize the access path through the database (generally: hitting an index as closely as you can)
Change or add indices to affect your query's access path.
If you are constrained from using the default process of changing the timeout value you will most likely have to do a lot more work. The following options come to mind
Validate with your DBA's and another code review that you have truly optimized the query as best as possible
Work on the underlying DB structure to see if there is any gain you can get on the DB side, creating/modifying an idex(es).
Divide it into multiple parts, even if this means running procedures with multiple return parameters that simply call another param. (This option is not elegant, and honestly if your code REALLY is going to take this much time I would be going to management and re-discussing the 30 second timeout)
We recently had a similar issue on a SQL Server 2000 database.
During your query, run this query on your master database on the db server and see if there are any locks you should troubleshoot:
select
spid,
db_name(sp.dbid) as DBname,
blocked as BlockedBy,
waittime as WaitInMs,
lastwaittype,
waitresource,
cpu,
physical_io,
memusage,
loginame,
login_time,
last_batch,
hostname,
sql_handle
from sysprocesses sp
where (waittype > 0 and spid > 49) or spid in (select blocked from sysprocesses where blocked > 0)
SQL Server Management Studio 2008 also contains a very cool activity monitor which lets you see the health of your database during your query.
In our case, it was a networkio lock which kept the database busy. It was some legacy VB code which didn't disconnect its result set quick enough.
If you are prohibited from using the features of the data access API to allow a query to last more than 30 seconds, then we need to see the SQL.
The performance gains to be made by optimizing the use of ADO.NET are slight in comparison to the gains of optimizing the SQL.
And you already are using the most efficient method of executing SQL. Other techniques would be mind numbingly slower (although, if you did a quick retrieval of your rows and some really slow client side processing using DataSets, you might be able to get the initial retrieval down to less than 30 seconds, but I doubt it.)
If we knew if you were doing inserts, then maybe you should be using bulk insert. But we don't know the content of your sql.
This is an UGLY hack, but might help solve your problem temporarily until you can fix the real problem
private static void CreateCommand(string queryString,string connectionString)
{
int maxRetries = 3;
int retries = 0;
while(true)
{
try
{
using (SqlConnection connection = new SqlConnection(connectionString))
{
SqlCommand command = new SqlCommand(queryString, connection);
command.Connection.Open();
command.ExecuteNonQuery();
}
break;
}
catch (SqlException se)
{
if (se.Message.IndexOf("Timeout", StringComparison.InvariantCultureIgnoreCase) == -1)
throw; //not a timeout
if (retries >= maxRetries)
throw new Exception( String.Format("Timedout {0} Times", retries),se);
//or break to throw no error
retries++;
}
}
}
command.CommandTimeout *= 2;
That will double the default time-out, which is 30 seconds.
Or, put the value for CommandTimeout in a configuration file, so you can adjust it as needed without recompiling.
You should break your query up into multiple chunks that each execute within the timeout period.
If you absolutely cannot increase the timeout, your only option is to reduce the time of the query to execute within the default 30 second timeout.
I tend to dislike increasing the connection/command timeout since in my mind that would be a matter of taking care of the symptom, not the problem
have you thought about breaking the query down into several smaller chunks?
Also, have you ran your query against the Database Engine Tuning Advisor in:
Management Studio > Tools > Database Engine Tuning Advisor
Lastly, could we get a look at the query itself?
cheers
Have you tried wrapping your sql inside a stored procedure, they seem to have better memory management. Have seen timeouts like this before in plan sql statement with internal queries using classic ADO. i.e. select * from (select ....) t inner join somthingTable. Where the internal query was returning a very large number of results.
Other tips
1. Performing reads with the with(nolock) execution hint, it's dirty and I don't recommend it but it will tend to be faster.
2. Also look at the execution plan of the sql your trying to run and reduce the row scanning, the order in which you join tables.
3. look at adding some indexes to your tables for faster reads.
4. I've also found that deleting rows is very expensive, you could try and limit the number of rows per call.
5. Swap #table variables with #temporary tables has also worked for me in the past.
6. You may also have saved bad execution plan (heard, never seen).
Hope this helps
Update: Looks like the query does not
throw any timeout. The connection is
timing out.
I.o.w., even if you don't execute a query, the connection times out? because there are two time-outs: connection and query. Everybody seems to focus on the query, but if you get connection timeouts, it's a network problem and has nothing to do with the query: the connection first has to be established before a query can be ran, obviously.
It might be worth trying paging the results back.
just set sqlcommand's CommandTimeout property to 0, this will cause the command to wait until the query finishes...
eg:
SqlCommand cmd = new SqlCommand(spName,conn);
cmd.CommandType = CommandType.StoredProcedure;
cmd.CommandTimeout = 0;