ADO.Net DataReader timeout issue - c#

I am using ADO.Net + C# + VSTS 2008 + ADO.Net to connect to SQL Server 2008 Enterprise. I am using almost the same pattern/sample mentioned here -- using ADO.Net DataReader to retrieve data one entry (row) by one entry (row).
http://msdn.microsoft.com/en-us/library/haa3afyz.aspx
My question is, if I set the SqlCommand timeout in this sample,
1. I think the timeout applies to how much time we could use as maximum value to retrieve one specifc row, not the total timeout for the whole data entry-by-entry loop?
BTW: loop I mean,
while (reader.Read())
{
Console.WriteLine("{0}\t{1}", reader.GetInt32(0),
reader.GetString(1));
}
2.
and this timeout only matters how much time it takes to retrieve data entry from database, and this timeout has nothing to do with how much time we deal with each entry (e.g. if we set timeout to 20 seconds, and if it takes 1 second to retrieve one data entry from database, and it takes 30 seconds for my application logics to manipulate the data entry, timeout will never happen).
Correct understanding?

The command timeout that you can set applies to how long you give ADO.NET to do its job.
If you call cmdQuery.ExecuteNonQuery() which returns nothing but performs a SQL statement it's the time needed to perform that statement.
If you call cmdQuery.ExecuteReader() which returns a data reader, it's the time needed for ADO.NET to ste up / construct that data reader so that you can then use it.
If you call cmdQuery.ExecuteScalar() which returns a single scalar value, it's the time needed to execute the query and grab that single result.
If you use the dataAdapter.Fill() to fill a data table or data set, it's the time needed for ADO.NET to retrieve the data and then fill the data table or data set.
So overall : the timeout applies to the portion of the job that ADO.NET can do - execute the statement, fill a data set, return a scalar value.
Of course it does NOT apply to the time it takes YOU to iterate through the results (in case of a data reader). That wouldn't make sense at all...
Marc

Yes you are right. The CommandTimeout means the Time the Database needs to execute the command (any command)

Related

C# - Query still running in SQL Server Database even though CommandTimeout was set to terminate query after a set time

Wihin my C# code I am using the CommandTimeout function to ensure that any query that executes longer than 30s is terminated both from the server and database. However when listing the currently running queries on the database the query that was set to cancel after 30s runs well beyond 30s
using (SqlConnection connection = new SqlConnection(connectionString))
{
connection.Open();
SqlCommand sqlCommand = new SqlCommand(query, connection);
//Set Timeout to 30s
sqlCommand.CommandTimeout = 30;
SqlDataAdapter da = new SqlDataAdapter(sqlCommand);
da.Fill(response);
connection.Close();
da.Dispose();
}
Why is the query still running in the DB? Is my only option right now is to send another query from the server to kill the query (KILL [session_id]) after 30s?
EDIT: 300Mb of data is being returned for this query.
There are a number of posts on StackOverflow indicating that SqlCommand.CommandTimeout won't affect the behavior of SqlDataAdapter.Fill. Instead, you supposedly have to set the SqlDataAdapter's SelectCommand.CommandTimeout property.
However, there are other posts which seem to indicate that even this doesn't work. This one in particular makes me think that the query will only be canceled if the timeout occurs before the query starts yielding results. Once results start coming in, it appears to ignore all timeouts.
My recommendation would be to reconsider using SqlDataAdapter. Depending on your use case, maybe a library like Dapper would work better for you?
You may also want to consider reporting this as a defect to the .NET team. I've had mixed success in the past reporting such errors; it depends on whether the team wants to prioritize fixing the issue.
Update
It looks like this may be the intended, documented behavior, as Marc Gravell points out here.
lol: from the documentation
(https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlcommand.commandtimeout(v=vs.110).aspx)
For example, with a 30 second time out, if Read requires two network
packets, then it has 30 seconds to read both network packets. If you
call Read again, it will have another 30 seconds to read any data that
it requires.
So: this timeout resets itself every Read. So: the only way it'll trip
is if any single Read operation takes longer than 2s. As long as the
SQL Server manages to get at least one row onto the pipe in that time:
it won't timeout via either API.

What is the best way to load huge result set in memory?

I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.
Considering below are the queries to pull data from source and target:
Sql Server -
select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle -
select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in Source : 12377200
Records in Target : 12266800
Following are the approaches i have tried with some statistics:
1) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:48:32
There is no index on Id Column.
Major time is spent here:
var dr = command.ExecuteReader();
Problems:
There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.
2) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 02:02:48
There is no index on Id Column.
3) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 00:39:40
Index is present on Id column.
4) open data reader approach for reading source and target data:
Total jobs = 1
Index : Yes
Time: 00:01:43
5) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Index : Yes
Time: 00:25:12
I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.
We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.
Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.
Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41
Target Sql Query Execution time in sql server management studio:00:01:40
What will be the best way to read huge result set in memory?
Code:
static void Main(string[] args)
{
// Running 3 jobs in parallel
//Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare())
//};
Compare();//Run single job
Console.ReadKey();
}
public static string Compare()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var srcConnection = new SqlConnection("Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
var tgtConnection = new SqlConnection("Target Connection String");
tgtConnection.Open();
var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(#"d\.hh\:mm\:ss");
Console.WriteLine(a);
return a;
}
private static IDataReader GetReader(SqlCommand command)
{
command.CommandTimeout = 0;
return command.ExecuteReader();//Culprit
}
There is nothing (I know of) faster than a DataReader for fetching db records.
Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.
If you want faster you can:
jdwend's suggestion:
Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class
Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/
The easiest (and IMO the best for a SELECT * all) is to throw hardware at it:
https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/
Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.
This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.
// don't forget this using statements
using System.Data;
using System.Data.SqlClient;
// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
cn.Open();
using (var cmd = cn.CreateCommand())
{
cmd.CommandText = "Select [Fields] From [Table] etc etc";
// your SQL statement here.
using (var adapter = new SqlDataAdapter(cmd))
{
adapter.Fill(table);
} // dispose adapter
} // dispose cmd
cn.Close();
} // dispose cn
foreach(DataRow row in table.Rows)
{
// do something with the data set.
}
I think I would deal with this problem in a different way.
But before lets make some assumptions:
According to your question description, you will get data from SQL Server and Oracle
Each query will return a bunch of data
You do not specify what is the point of getting all that data in memory, neither the use of it.
I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.
Having these foundation points I would process the following:
Think at this problem as a data processing
Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
As soon as your logic completes all the data loading (import migration), then you can start processing it.
Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.
If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).
If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.
After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.
Technicalities aside, I think there is a more fundamental problem here.
select [...] order by LinkedColumn
I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.
We would like to keep database server as free as possible
If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!
...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.
I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.
I took a different approach.
We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.
I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.
That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.
The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.
When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.
So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.
A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified

SQL - Better two queries instead of one big one

I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.
Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).
I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.
Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog
I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data

Really odd DataReader performance issue

I have a SQL Server database and I'm using ADO.NET ExecuteReader to get a datareader. My stored procedure returns around 35,000 records.
The call to ExecuteReader is taking roughly 3 seconds to return the datareader.
I'm using code very similar to this to get my items.
using(var conn = new SqlConnection(MySQLHelper.ConnectionString)) {
conn.Open();
var sqlCommand = SqlHelper.CreateCommand(conn, "spGetItems");
using (var dr = sqlCommand.ExecuteReader()) {
while(dr.read){
var item = new Item{ID = dr.GetInt32(0), ItemName = dr.GetString(1)};
items.Add(item);
}
}
}
A majority of the reads is taking 0 milliseconds. However, intermitantly I'm getting a Read that takes about 5.5 seconds (5000+ milliseconds). I've looked at the data and could find nothing out of the ordinary. I think started looking at the frequency of the records that were taking so long.
This was interesting. While not completely consistent, they were close. The records that were taking a long time to load were as follows...
Record #s: 29, 26,26,27,27,29,30,28,27,27,30,30,26,27
So it looks like 26 to 30 records would read in 0 to a few milliseconds, and then it would take 5 seconds, then the next 26 to 30 records would again read as expected.
I'm at a complete loss here. I can post more code, but there isn't much to it. It's pretty simple code.
EDIT
None of my fields are varchar(max), or even close. My largest field is a numeric(28,12).
After modifying my stored procedure , I'm no longer having issues. I first modified it to Select TOP 100, then raised that to Top 1000, then 10,000 and then 100,000. I never had the issue with those. Then I removed to TOP and now I'm not having the issue I was earlier.
SqlDataReader buffers results sent to the client. See this page on MSDN for details:
When the results are sent back to the client, SQL Server puts as many result set rows as it can into each packet, minimizing the number of packets sent to the client.
I suspect that you're getting 26-30 records per packet. As you iterate through the records, you get a delay as new records are loaded.
I had a similar problem. The answer was to cast all the text fields using nvarchar(max) and then .NET ExecuteReader returned within a similar period to an Exec of the sproc in MS Studio. Note that the sproc didn't contain transactions but the .NET call was wrapped in a transaction.

using SQL server to store data

I'm right now using SQL server 2008 in my project to store and fetch data . this is going perfect till now . I can fetch 20000 records in less than 50ms (JSON) . but facing a problem with inserts stuff . in my project I need to be able to insert something like 100000 records every minute . and this is seems to be very slow with SQL server .
I've tried to use another database (NOSQL DB) like mongoDB which are very fast in storing data (5s) comparing to SQLServer(270s) but not fast as sql in fetching data(20000 => 180ms) .
So I'm asking here if there any way to make SQL faster in storing . or to make mongoDB faster in fetching ( I'm not an expert in mongoDB I know the very basic things about it ) .
public static void ExecuteNonQuery(string sql)
{
SqlConnection con = GetConnection();
con.Open();
SqlCommand cmd = new SqlCommand(sql, con);
try
{
cmd.ExecuteNonQuery();
}
finally
{
con.Close();
}
}
SQL's Insert function
public IEnumerable<T> GetRecords<T>(System.Linq.Expressions.Expression<Func<T, bool>> expression, int from, int to) where T : class, new()
{
return _db.GetCollection<T>(collectionName).Find<T>(expression).Skip(from).Limit(to).Documents;
}
Mongo's Select function ( MongoDB 1.6 )
Update
: data structure : (int) Id , (string) Data
I guess that you are executing each insert in a transaction of its own (an implicit transaction might have been created if you do not provide one explicitly). As SQL server needs to ensure that the transaction is committed to the hard drive each transaction has a overhead that is very significant.
To get things to go faster, try to perform many inserts (try with a thousand or so) in a single ExecuteNonQuery() call. Also do not open and close, but keep the connection open (thus being in the same transaction) for several inserts.
You should have a look at the SqlBulkCopy Class
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
MongoDB is very fast on reads and writes. 50k reads and writes per second is doable on commodity hardware - depending on the data size. In addition to that you always have the option to scale out with sharding and replica sets but as said: 20k operations per seconds
with MongoDB is nothing.
Generally the speed on inserting data into the database is a function on the complexity of the operation.
If your inserts are significantly slow, then it points to optimisation problems with the inserts. Identify exaxtly what SQL insert statements your program is generating and then use the database EXPLAIN function to figure out what operations the underlying database is using. This often gives you a clue as to how you need to change your setup to increase the speed of these operations.
It might mean you have to change your database, or it might mean batching your inserts into a single call rather than inserting each item separately.
I see you are setting up and closing the connection each time.. this takes a significant time in itself. Try using a persistent connection.

Categories