Related
When .NET first came out, I was one of many who complained about .NET's lack of deterministic finalization (class destructors being called on an unpredictable schedule). The compromise Microsoft came up with at the time was the using statement.
Although not perfect, I think using using is important to ensuring that unmanaged resources are cleaned up in a timely manner.
However, I'm writing some ADO.NET code and noticed that almost every class implements IDisposable. That leads to code that looks like this.
using (SqlConnection connection = new SqlConnection(connectionString))
using (SqlCommand command = new SqlCommand(query, connection))
using (SqlDataAdapter adapter = new SqlDataAdapter(command))
using (SqlCommandBuilder builder = new SqlCommandBuilder(adapter))
using (DataSet dataset = new DataSet())
{
command.Parameters.AddWithValue("#FirstValue", 2);
command.Parameters.AddWithValue("#SecondValue", 3);
adapter.Fill(dataset);
DataTable table = dataset.Tables[0];
foreach (DataRow row in table.Rows) // search whole table
{
if ((int)row["Id"] == 4)
{
row["Value2"] = 12345;
}
else if ((int)row["Id"] == 5)
{
row.Delete();
}
}
adapter.Update(table);
}
I strongly suspect I do not need all of those using statements. But without understanding the code for each class in some detail, it's hard to be certain which ones I can leave out. The results is kind of ugly and detracts from the main logic in my code.
Does anyone know why all these classes need to implement IDisposable? (Microsoft has many code examples online that don't worry about disposing many of these objects.) Are other developers writing using statements for all of them? And, if not, how do you decide which ones can do without?
One part of the problem here is that ADO.NET is an abstract provider model. We don't know what a specific implementation (a specific ADO.NET provider) will need with regards to disposal. Sure, we can reasonably assume the connection and transaction need to be disposed, but the command? Maybe. Reader? Probably, not least because one of the command-flags options allows you to associate a connection's lifetime to the reader (so the connection closes when the reader does, which should logically extend to disposal).
So overall, I think it is probably fine.
Most of the time, people aren't messing with ADO.NET by hand, and any ORM tools (or micro-ORM tools, such as "Dapper") will get this right for you without you having to worry about it.
I will openly confess that on the few occasions when I've used DataTable (seriously, it is 2018 - that shouldn't be your default model for representing data, except for some niche scenarios): I haven't disposed them. That one kinda makes no sense :)
If a class implements IDisposable, then you should always make sure that it gets disposed, without making any assumptions about its implementation.
I think that's the minimal amount of code that you could possibly write that ensures that your objects will be disposed. I see absolutely no problem with it, and it does not distract me at all from the main logic of my code.
If reducing this code is of paramount importance to you, then you can come up with your own replacement (wrapper) of SqlConnection whose child classes do not implement IDisposable and instead get automatically destroyed by the connection once the connection is closed. However, that would be a huge amount of work, and you would lose some precision with regards to when something gets disposed.
Yes. almost anything in ADO.Net is implementing IDisposable, whether that's actually needed - in case of, say, SqlConnection(because of connection pooling) or not - in case of, say, DataTable (Should I Dispose() DataSet and DataTable?
).
The problem is, as noted in the question:
without understanding the code for each class in some detail, it's hard to be certain which ones I can leave out.
I think that this alone is a good enough reason to keep everything inside a using statement - in one word: Encapsulation.
In many words:
It shouldn't be necessary to get intimately familiar, or even remotely familiar for that matter, with the implementation of every class you are working with. You only need to know the surface area - meaning public methods, properties, events, indexers (and fields, if that class has public fields). from the user of a class point of view, anything other then it's public surface area is an implementation detail.
Regarding all the using statements in your code - You can write them only once by creating a method that will accept an SQL statement, an Action<DataSet> and a params array of parameters. Something like this:
void DoStuffWithDataTable(string query, Action<DataTable> action, params SqlParameter[] parameters)
{
using (SqlConnection connection = new SqlConnection(connectionString))
using (SqlCommand command = new SqlCommand(query, connection))
using (SqlDataAdapter adapter = new SqlDataAdapter(command))
using (SqlCommandBuilder builder = new SqlCommandBuilder(adapter))
using (var table = new DataTable())
{
foreach(var param in parameters)
{
command.Parameters.Add(param);
}
// SqlDataAdapter has a fill overload that only needs a data table
adapter.Fill(table);
action();
adapter.Update(table);
}
}
And you use it like that, for all the actions you need to do with your data table:
DoStuffWithDataTable(
"Select...",
table =>
{ // of course, that doesn't have to be a lambda expression here...
foreach (DataRow row in table.Rows) // search whole table
{
if ((int)row["Id"] == 4)
{
row["Value2"] = 12345;
}
else if ((int)row["Id"] == 5)
{
row.Delete();
}
}
},
new SqlParameter[]
{
new SqlParameter("#FirstValue", 2),
new SqlParameter("#SecondValue", 3)
}
);
This way your code is "safe" with regards of disposing any IDisposable, while you have only written the plumbing code for that just once.
I'd like to ask a question. I've been trying to find some information regarding transactions with multiple connections, but I haven't been able to find any good source of information.
Now for what I'm trying to do. I have code that looks like this:
using (var Connection1 = m_Db.CreateConnection())
using (var Connection2 = m_Db.CreateConnection())
{
Connection1.DoRead(..., (IDataReader Reader) =>
{
// Do stuff
Connection2.DoWrite(...);
Connection2.DoRead(..., (IDataReader Reader) =>
{
// Do more stuff
using (var Connection3 = m_Db.CreateConnection())
{
Connection3.DoWrite(...);
Connection3.Commit(); // Is this even right?
}
});
});
Connection1.DoRead(..., (IDataReader) =>
{
// Do yet more stuff
});
Connection1.Commit();
Connection2.Commit();
}
Each CreateConnection creates a new transaction using MySqlConnection::BeginTransaction. The CreateConnection method creates a Connection object which wraps a MySqlConnection. The DoRead function executes some SQL, and disposes the IDataReader when done.
Every Connection will do a Rollback when disposed.
Now for some notes:
I have ONE server with multiple databases.
I am running MySql server with InnoDB databases.
I am doing both reads and writes to these databases.
For performance reasons and not to mess up the database, I am using transactions.
The code is (at least, for now) entirely serial. There are NO concurrent threads. All inserts and queries are done in serial fashion.
I use multiple connections to the database because a read or write is not allowed while another read is in progress (basically the reader object has not yet been disposed).
I basically want every connection to see all changes. So for example, after Connection 3 does some writes, Connection 1 should see those. But the data should be in the transaction and not written to the database (yet).
Now, as for my questions:
Does this work? Will everything ONLY be committed only once the last Commit function is called? Should I use another approach?
Is this right? Is my approach completely and utterly wrong and silly?
Any drawbacks? Especially regarding performance.
Thanks.
Welp, it seems no one knows. But that's okay.
For now, I just went with the method of just using one connection and reading all the results into a List>, then closing the reader, thereby avoiding the problem of having to use multiple connections.
Might there be performance problems? Maybe, but it's better than having to deal with uncertainty and deadlocks.
I have a SQL Server database with 500,000 records in table main. There are also three other tables called child1, child2, and child3. The many to many relationships between child1, child2, child3, and main are implemented via the three relationship tables: main_child1_relationship, main_child2_relationship, and main_child3_relationship. I need to read the records in main, update main, and also insert into the relationship tables new rows as well as insert new records in the child tables. The records in the child tables have uniqueness constraints, so the pseudo-code for the actual calculation (CalculateDetails) would be something like:
for each record in main
{
find its child1 like qualities
for each one of its child1 qualities
{
find the record in child1 that matches that quality
if found
{
add a record to main_child1_relationship to connect the two records
}
else
{
create a new record in child1 for the quality mentioned
add a record to main_child1_relationship to connect the two records
}
}
...repeat the above for child2
...repeat the above for child3
}
This works fine as a single threaded app. But it is too slow. The processing in C# is pretty heavy duty and takes too long. I want to turn this into a multi-threaded app.
What is the best way to do this? We are using Linq to Sql.
So far my approach has been to create a new DataContext object for each batch of records from main and use ThreadPool.QueueUserWorkItem to process it. However these batches are stepping on each other's toes because one thread adds a record and then the next thread tries to add the same one and ... I am getting all kinds of interesting SQL Server dead locks.
Here is the code:
int skip = 0;
List<int> thisBatch;
Queue<List<int>> allBatches = new Queue<List<int>>();
do
{
thisBatch = allIds
.Skip(skip)
.Take(numberOfRecordsToPullFromDBAtATime).ToList();
allBatches.Enqueue(thisBatch);
skip += numberOfRecordsToPullFromDBAtATime;
} while (thisBatch.Count() > 0);
while (allBatches.Count() > 0)
{
RRDataContext rrdc = new RRDataContext();
var currentBatch = allBatches.Dequeue();
lock (locker)
{
runningTasks++;
}
System.Threading.ThreadPool.QueueUserWorkItem(x =>
ProcessBatch(currentBatch, rrdc));
lock (locker)
{
while (runningTasks > MAX_NUMBER_OF_THREADS)
{
Monitor.Wait(locker);
UpdateGUI();
}
}
}
And here is ProcessBatch:
private static void ProcessBatch(
List<int> currentBatch, RRDataContext rrdc)
{
var topRecords = GetTopRecords(rrdc, currentBatch);
CalculateDetails(rrdc, topRecords);
rrdc.Dispose();
lock (locker)
{
runningTasks--;
Monitor.Pulse(locker);
};
}
And
private static List<Record> GetTopRecords(RecipeRelationshipsDataContext rrdc,
List<int> thisBatch)
{
List<Record> topRecords;
topRecords = rrdc.Records
.Where(x => thisBatch.Contains(x.Id))
.OrderBy(x => x.OrderByMe).ToList();
return topRecords;
}
CalculateDetails is best explained by the pseudo-code at the top.
I think there must be a better way to do this. Please help. Many thanks!
Here's my take on the problem:
When using multiple threads to insert/update/query data in SQL Server, or any database, then deadlocks are a fact of life. You have to assume they will occur and handle them appropriately.
That's not so say we shouldn't attempt to limit the occurence of deadlocks. However, it's easy to read up on the basic causes of deadlocks and take steps to prevent them, but SQL Server will always surprise you :-)
Some reason for deadlocks:
Too many threads - try to limit the number of threads to a minimum, but of course we want more threads for maximum performance.
Not enough indexes. If selects and updates aren't selective enough SQL will take out larger range locks than is healthy. Try to specify appropriate indexes.
Too many indexes. Updating indexes causes deadlocks, so try to reduce indexes to the minimum required.
Transaction isolational level too high. The default isolation level when using .NET is 'Serializable', whereas the default using SQL Server is 'Read Committed'. Reducing the isolation level can help a lot (if appropriate of course).
This is how I might tackle your problem:
I wouldn't roll my own threading solution, I would use the TaskParallel library. My main method would look something like this:
using (var dc = new TestDataContext())
{
// Get all the ids of interest.
// I assume you mark successfully updated rows in some way
// in the update transaction.
List<int> ids = dc.TestItems.Where(...).Select(item => item.Id).ToList();
var problematicIds = new List<ErrorType>();
// Either allow the TaskParallel library to select what it considers
// as the optimum degree of parallelism by omitting the
// ParallelOptions parameter, or specify what you want.
Parallel.ForEach(ids, new ParallelOptions {MaxDegreeOfParallelism = 8},
id => CalculateDetails(id, problematicIds));
}
Execute the CalculateDetails method with retries for deadlock failures
private static void CalculateDetails(int id, List<ErrorType> problematicIds)
{
try
{
// Handle deadlocks
DeadlockRetryHelper.Execute(() => CalculateDetails(id));
}
catch (Exception e)
{
// Too many deadlock retries (or other exception).
// Record so we can diagnose problem or retry later
problematicIds.Add(new ErrorType(id, e));
}
}
The core CalculateDetails method
private static void CalculateDetails(int id)
{
// Creating a new DeviceContext is not expensive.
// No need to create outside of this method.
using (var dc = new TestDataContext())
{
// TODO: adjust IsolationLevel to minimize deadlocks
// If you don't need to change the isolation level
// then you can remove the TransactionScope altogether
using (var scope = new TransactionScope(
TransactionScopeOption.Required,
new TransactionOptions {IsolationLevel = IsolationLevel.Serializable}))
{
TestItem item = dc.TestItems.Single(i => i.Id == id);
// work done here
dc.SubmitChanges();
scope.Complete();
}
}
}
And of course my implementation of a deadlock retry helper
public static class DeadlockRetryHelper
{
private const int MaxRetries = 4;
private const int SqlDeadlock = 1205;
public static void Execute(Action action, int maxRetries = MaxRetries)
{
if (HasAmbientTransaction())
{
// Deadlock blows out containing transaction
// so no point retrying if already in tx.
action();
}
int retries = 0;
while (retries < maxRetries)
{
try
{
action();
return;
}
catch (Exception e)
{
if (IsSqlDeadlock(e))
{
retries++;
// Delay subsequent retries - not sure if this helps or not
Thread.Sleep(100 * retries);
}
else
{
throw;
}
}
}
action();
}
private static bool HasAmbientTransaction()
{
return Transaction.Current != null;
}
private static bool IsSqlDeadlock(Exception exception)
{
if (exception == null)
{
return false;
}
var sqlException = exception as SqlException;
if (sqlException != null && sqlException.Number == SqlDeadlock)
{
return true;
}
if (exception.InnerException != null)
{
return IsSqlDeadlock(exception.InnerException);
}
return false;
}
}
One further possibility is to use a partitioning strategy
If your tables can naturally be partitioned into several distinct sets of data, then you can either use SQL Server partitioned tables and indexes, or you could manually split your existing tables into several sets of tables. I would recommend using SQL Server's partitioning, since the second option would be messy. Also built-in partitioning is only available on SQL Enterprise Edition.
If partitioning is possible for you, you could choose a partion scheme that broke you data in lets say 8 distinct sets. Now you could use your original single threaded code, but have 8 threads each targetting a separate partition. Now there won't be any (or at least a minimum number of) deadlocks.
I hope that makes sense.
Overview
The root of your problem is that the L2S DataContext, like the Entity Framework's ObjectContext, is not thread-safe. As explained in this MSDN forum exchange, support for asynchronous operations in the .NET ORM solutions is still pending as of .NET 4.0; you'll have to roll your own solution, which as you've discovered isn't always easy to do when your framework assume single-threadedness.
I'll take this opportunity to note that L2S is built on top of ADO.NET, which itself fully supports asynchronous operation - personally, I would much prefer to deal directly with that lower layer and write the SQL myself, just to make sure that I fully understood what was transpiring over the network.
SQL Server Solution?
That being said, I have to ask - must this be a C# solution? If you can compose your solution out of a set of insert/update statements, you can just send over the SQL directly and your threading and performance problems vanish.* It seems to me that your problems are related not to the actual data transformations to be made, but center around making them performant from .NET. If .NET is removed from the equation, your task becomes simpler. After all, the best solution is often the one that has you writing the smallest amount of code, right? ;)
Even if your update/insert logic can't be expressed in a strictly set-relational manner, SQL Server does have a built-in mechanism for iterating over records and performing logic - while they are justly maligned for many use cases, cursors may in fact be appropriate for your task.
If this is a task that has to happen repeatedly, you could benefit greatly from coding it as a stored procedure.
*of course, long-running SQL brings its own problems like lock escalation and index usage that you'll have to contend with.
C# Solution
Of course, it may be that doing this in SQL is out of the question - maybe your code's decisions depend on data that comes from elsewhere, for example, or maybe your project has a strict 'no-SQL-allowed' convention. You mention some typical multithreading bugs, but without seeing your code I can't really be helpful with them specifically.
Doing this from C# is obviously viable, but you need to deal with the fact that a fixed amount of latency will exist for each and every call you make. You can mitigate the effects of network latency by using pooled connections, enabling multiple active result sets, and using the asynchronous Begin/End methods for executing your queries. Even with all of those, you will still have to accept that there is a cost to shipping data from SQL Server to your application.
One of the best ways to keep your code from stepping all over itself is to avoid sharing mutable data between threads as much as possible. That would mean not sharing the same DataContext across multiple threads. The next best approach is to lock critical sections of code that touch the shared data - lock blocks around all DataContext access, from the first read to the final write. That approach might just obviate the benefits of multithreading entirely; you can likely make your locking more fine-grained, but be ye warned that this is a path of pain.
Far better is to keep your operations separate from each other entirely. If you can partition your logic across 'main' records, that's ideal - that is to say, as long as there aren't relationships between the various child tables, and as long as one record in 'main' doesn't have implications for another, you can split your operations across multiple threads like this:
private IList<int> GetMainIds()
{
using (var context = new MyDataContext())
return context.Main.Select(m => m.Id).ToList();
}
private void FixUpSingleRecord(int mainRecordId)
{
using (var localContext = new MyDataContext())
{
var main = localContext.Main.FirstOrDefault(m => m.Id == mainRecordId);
if (main == null)
return;
foreach (var childOneQuality in main.ChildOneQualities)
{
// If child one is not found, create it
// Create the relationship if needed
}
// Repeat for ChildTwo and ChildThree
localContext.SaveChanges();
}
}
public void FixUpMain()
{
var ids = GetMainIds();
foreach (var id in ids)
{
var localId = id; // Avoid closing over an iteration member
ThreadPool.QueueUserWorkItem(delegate { FixUpSingleRecord(id) });
}
}
Obviously this is as much a toy example as the pseudocode in your question, but hopefully it gets you thinking about how to scope your tasks such that there is no (or minimal) shared state between them. That, I think, will be the key to a correct C# solution.
EDIT Responding to updates and comments
If you're seeing data consistency issues, I'd advise enforcing transaction semantics - you can do this by using a System.Transactions.TransactionScope (add a reference to System.Transactions). Alternately, you might be able to do this on an ADO.NET level by accessing the inner connection and calling BeginTransaction on it (or whatever the DataConnection method is called).
You also mention deadlocks. That you're battling SQL Server deadlocks indicates that the actual SQL queries are stepping on each other's toes. Without knowing what is actually being sent over the wire, it's difficult to say in detail what's happening and how to fix it. Suffice to say that SQL deadlocks result from SQL queries, and not necessarily from C# threading constructs - you need to examine what exactly is going over the wire. My gut tells me that if each 'main' record is truly independent of the others, then there shouldn't be a need for row and table locks, and that Linq to SQL is likely the culprit here.
You can get a dump of the raw SQL emitted by L2S in your code by setting the DataContext.Log property to something e.g. Console.Out. Though I've never personally used it, I understand the LINQPad offers L2S facilities and you may be able to get at the SQL there, too.
SQL Server Management Studio will get you the rest of the way there - using the Activity Monitor, you can watch for lock escalation in real time. Using the Query Analyzer, you can get a view of exactly how SQL Server will execute your queries. With those, you should be able to get a good notion of what your code is doing server-side, and in turn how to go about fixing it.
I would recommend moving all the XML processing into the SQL server, too. Not only will all your deadlocks disappear, but you will see such a boost in performance that you will never want to go back.
It will be best explained by an example. In this example I assume that the XML blob already is going into your main table (I call it closet). I will assume the following schema:
CREATE TABLE closet (id int PRIMARY KEY, xmldoc ntext)
CREATE TABLE shoe(id int PRIMARY KEY IDENTITY, color nvarchar(20))
CREATE TABLE closet_shoe_relationship (
closet_id int REFERENCES closet(id),
shoe_id int REFERENCES shoe(id)
)
And I expect that your data (main table only) initially looks like this:
INSERT INTO closet(id, xmldoc) VALUES (1, '<ROOT><shoe><color>blue</color></shoe></ROOT>')
INSERT INTO closet(id, xmldoc) VALUES (2, '<ROOT><shoe><color>red</color></shoe></ROOT>')
Then your whole task is as simple as the following:
INSERT INTO shoe(color) SELECT DISTINCT CAST(CAST(xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) AS color from closet
INSERT INTO closet_shoe_relationship(closet_id, shoe_id) SELECT closet.id, shoe.id FROM shoe JOIN closet ON CAST(CAST(closet.xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) = shoe.color
But given that you will do a lot of similar processing, you can make your life easier by declaring your main blob as XML type, and further simplifying to this:
INSERT INTO shoe(color)
SELECT DISTINCT CAST(xmldoc.query('//shoe/color/text()') AS nvarchar)
FROM closet
INSERT INTO closet_shoe_relationship(closet_id, shoe_id)
SELECT closet.id, shoe.id
FROM shoe JOIN closet
ON CAST(xmldoc.query('//shoe/color/text()') AS nvarchar) = shoe.color
There are additional performance optimizations possible, like pre-computing repeatedly invoked Xpath results in a temporary or permanent table, or converting the initial population of the main table into a BULK INSERT, but I don't expect that you will really need those to succeed.
sql server deadlocks are normal & to be expected in this type of scenario - MS's recommendation is that these should be handled on the application side rather than the db side.
However if you do need to make sure that a stored procedure is only called once then you can use a sql mutex lock using sp_getapplock. Here's an example of how to implement this
BEGIN TRAN
DECLARE #mutex_result int;
EXEC #mutex_result = sp_getapplock #Resource = 'CheckSetFileTransferLock',
#LockMode = 'Exclusive';
IF ( #mutex_result < 0)
BEGIN
ROLLBACK TRAN
END
-- do some stuff
EXEC #mutex_result = sp_releaseapplock #Resource = 'CheckSetFileTransferLock'
COMMIT TRAN
This may be obvious, but looping through each tuple and doing your work in your servlet container involves a lot of per-record overhead.
If possible, move some or all of that processing to the SQL server by rewriting your logic as one or more stored procedures.
If
You don't have a lot of time to spend on this issue and need it to fix it right now
You are sure that your code is done so that different thread will NOT modify the same record
You are not afraid
Then ... you can just add "WITH NO LOCK" to your queries so that MSSQL doesn't apply the locks.
To use with caution :)
But anyway, you didn't tell us where the time is lost (in the mono-threaded version). Because if it's in the code, I'll advise you to write everything in the DB directly to avoid continuous data exchange. If it's in the db, I'll advise to check index (too much ?), i/o, cpu etc.
hello there i have a situation the entities are customerManager warehouse customer and suppliers
my goals are :
the warehouse is singletone and open db in runtime.
the customerManager manage customers as threads who query the warehouse and update it (after buying staff).
when one of the items in the warehouse is run out of we ask supplier in a diffrent thread to supply it for us ' while the supplier does his thing (let's assume it's something like 5 seconds ) the customer waits(in queue) and invoked when the supplier method returned true (let's assume it return true always)..
so my questions are about 3 things :
design - should the customerManager holds inside him the warehouse and the customers ? it seems like the best soultion, does someone recommend otherwise?(c# design topic )
how many threads can go to the db at once ? can a db handle it by himself so i wont need to do it myself ? should I hold for them SqlCommand(s)? should I use dataset or datareader? in other words can someone advice me how to do it ?
should i do for 10 threads :
for (int i = 0 ; i < 10 ; i++)
{
SqlConnection sqlConnection = new SqlConnection(r_ConnectionString);
sqlConnection.Open();
sqlConnection.Close();
}
...so the conection pool would be open for 10 connectiones ?
** database ADO.NET ** topic
how should the threads wait in a queue?(in order to wait to the supplier method to awake them ) how to wake them ? is there a good solution in c# for that ? (c# threads topic)
I think the question is too long but otherwise would be too out of context so I would appreciate if you would write in the the title what question you want to reference.
thank you .
Your worker threads could be fed work via BlockingCollection or ConcurrentQueue.
For connection management you are better off doing this:
using (SqlConnection conn = new SqlConnection(...))
{
}
since this ensures Dispose() gets called for you. As noted in other feedback, you can do this without worrying about actual conn count to the DB since ADO.Net manages a pool of physical connections behind the scenes.
Nobody can tell you whether DataSet or DataReader works best, it depends on your usage of the data once it's loaded. DataReader provides a sequential read of each record in turn, while DataSet provides an in-memory cache of underlying DB data and in that sense is a 'higher-level' abstraction.
SqlConnections are implemented in .net using a connection pool. You do not need to worry about the management of the connections themselves. The only requirement on you is that adfter you open them you call .close. .net will manage the rest for you in an efficient way.
If you wanted to run multiple queries simultaneously then you can call the sqlcommand wtih begin invoke and end invoke.
By using both of these you cna work at a level that does not require you to mange the threads whilst gettting a multi threaded behaviour.
However you shuold read up on ADO.Net because a lot of what you are talking about is unnecessary when you kow how it works.
as for dataset or datareader that depends on your problem, Dataset is a very heavy object though, datareaders are lightweight and fast that allow you to populate a collection fairly easily.
I prefer using linq2sql though or entity framework. ADO.Net is kinda fragile because you ahve to do a lot of casting on data adn manual mapping onto objects that is prone to errors at run time rather than compile time.
Background: I've got a bunch of strings that I'm getting from a database, and I want to return them. Traditionally, it would be something like this:
public List<string> GetStuff(string connectionString)
{
List<string> categoryList = new List<string>();
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
categoryList.Add(sqlDataReader["myImportantColumn"].ToString());
}
}
}
return categoryList;
}
But then I figure the consumer is going to want to iterate through the items and doesn't care about much else, and I'd like to not box myself in to a List, per se, so if I return an IEnumerable everything is good/flexible. So I was thinking I could use a "yield return" type design to handle this...something like this:
public IEnumerable<string> GetStuff(string connectionString)
{
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
yield return sqlDataReader["myImportantColumn"].ToString();
}
}
}
}
But now that I'm reading a bit more about yield (on sites like this...msdn didn't seem to mention this), it's apparently a lazy evaluator, that keeps the state of the populator around, in anticipation of someone asking for the next value, and then only running it until it returns the next value.
This seems fine in most cases, but with a DB call, this sounds a bit dicey. As a somewhat contrived example, if someone asks for an IEnumerable from that I'm populating from a DB call, gets through half of it, and then gets stuck in a loop...as far as I can see my DB connection is going to stay open forever.
Sounds like asking for trouble in some cases if the iterator doesn't finish...am I missing something?
It's a balancing act: do you want to force all the data into memory immediately so you can free up the connection, or do you want to benefit from streaming the data, at the cost of tying up the connection for all that time?
The way I look at it, that decision should potentially be up to the caller, who knows more about what they want to do. If you write the code using an iterator block, the caller can very easily turned that streaming form into a fully-buffered form:
List<string> stuff = new List<string>(GetStuff(connectionString));
If, on the other hand, you do the buffering yourself, there's no way the caller can go back to a streaming model.
So I'd probably use the streaming model and say explicitly in the documentation what it does, and advise the caller to decide appropriately. You might even want to provide a helper method to basically call the streamed version and convert it into a list.
Of course, if you don't trust your callers to make the appropriate decision, and you have good reason to believe that they'll never really want to stream the data (e.g. it's never going to return much anyway) then go for the list approach. Either way, document it - it could very well affect how the return value is used.
Another option for dealing with large amounts of data is to use batches, of course - that's thinking somewhat away from the original question, but it's a different approach to consider in the situation where streaming would normally be attractive.
You're not always unsafe with the IEnumerable. If you leave the framework call GetEnumerator (which is what most of the people will do), then you're safe. Basically, you're as safe as the carefullness of the code using your method:
class Program
{
static void Main(string[] args)
{
// safe
var firstOnly = GetList().First();
// safe
foreach (var item in GetList())
{
if(item == "2")
break;
}
// safe
using (var enumerator = GetList().GetEnumerator())
{
for (int i = 0; i < 2; i++)
{
enumerator.MoveNext();
}
}
// unsafe
var enumerator2 = GetList().GetEnumerator();
for (int i = 0; i < 2; i++)
{
enumerator2.MoveNext();
}
}
static IEnumerable<string> GetList()
{
using (new Test())
{
yield return "1";
yield return "2";
yield return "3";
}
}
}
class Test : IDisposable
{
public void Dispose()
{
Console.WriteLine("dispose called");
}
}
Whether you can affort to leave the database connection open or not depends on your architecture as well. If the caller participates in an transaction (and your connection is auto enlisted), then the connection will be kept open by the framework anyway.
Another advantage of yield is (when using a server-side cursor), your code doesn't have to read all data (example: 1,000 items) from the database, if your consumer wants to get out of the loop earlier (example: after the 10th item). This can speed up querying data. Especially in an Oracle environment, where server-side cursors are the common way to retrieve data.
You are not missing anything. Your sample shows how NOT to use yield return. Add the items to a list, close the connection, and return the list. Your method signature can still return IEnumerable.
Edit: That said, Jon has a point (so surprised!): there are rare occasions where streaming is actually the best thing to do from a performance perspective. After all, if it's 100,000 (1,000,000? 10,000,000?) rows we're talking about here, you don't want to be loading that all into memory first.
As an aside - note that the IEnumerable<T> approach is essentially what the LINQ providers (LINQ-to-SQL, LINQ-to-Entities) do for a living. The approach has advantages, as Jon says. However, there are definite problems too - in particular (for me) in terms of (the combination of) separation | abstraction.
What I mean here is that:
in a MVC scenario (for example) you want your "get data" step to actually get data, so that you can test it works at the controller, not the view (without having to remember to call .ToList() etc)
you can't guarantee that another DAL implementation will be able to stream data (for example, a POX/WSE/SOAP call can't usually stream records); and you don't necessarily want to make the behaviour confusingly different (i.e. connection still open during iteration with one implementation, and closed for another)
This ties in a bit with my thoughts here: Pragmatic LINQ.
But I should stress - there are definitely times when the streaming is highly desirable. It isn't a simple "always vs never" thing...
Slightly more concise way to force evaluation of iterator:
using System.Linq;
//...
var stuff = GetStuff(connectionString).ToList();
No, you are on the right path... the yield will lock the reader... you can test it doing another database call while calling the IEnumerable
The only way this would cause problems is if the caller abuses the protocol of IEnumerable<T>. The correct way to use it is to call Dispose on it when it is no longer needed.
The implementation generated by yield return takes the Dispose call as a signal to execute any open finally blocks, which in your example will call Dispose on the objects you've created in the using statements.
There are a number of language features (in particular foreach) which make it very easy to use IEnumerable<T> correctly.
You could always use a separate thread to buffer the data (perhaps to a queue) while also doing a yeild to return the data. When the user requests data (returned via a yeild), an item is removed from the queue. Data is also being continuously added to the queue via the separate thread. That way, if the user requests the data fast enough, the queue is never very full and you do not have to worry about memory issues. If they don't, then the queue will fill up, which may not be so bad. If there is some sort of limitation you would like to impose on memory, you could enforce a maximum queue size (at which point the other thread would wait for items to be removed before adding more to the queue). Naturally, you will want to make sure you handle resources (i.e., the queue) correctly between the two threads.
As an alternative, you could force the user to pass in a boolean to indicate whether or not the data should be buffered. If true, the data is buffered and the connection is closed as soon as possible. If false, the data is not buffered and the database connection stays open as long as the user needs it to be. Having a boolean parameter forces the user to make the choice, which ensures they know about the issue.
I've bumped into this wall a few times. SQL database queries are not easily streamable like files. Instead, query only as much as you think you'll need and return it as whatever container you want (IList<>, DataTable, etc.). IEnumerable won't help you here.
What you can do is use a SqlDataAdapter instead and fill a DataTable. Something like this:
public IEnumerable<string> GetStuff(string connectionString)
{
DataTable table = new DataTable();
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
SqlDataAdapter dataAdapter = new SqlDataAdapter(sqlCommand);
dataAdapter.Fill(table);
}
}
foreach(DataRow row in table.Rows)
{
yield return row["myImportantColumn"].ToString();
}
}
This way, you're querying everything in one shot, and closing the connection immediately, yet you're still lazily iterating the result. Furthermore, the caller of this method can't cast the result to a List and do something they shouldn't be doing.
Dont use yield here. your sample is fine.