The work I do involves downloading into memory HUGE amounts of data from a SQL server database. To accomplish this, we have custom dataset definitions that we load using a SqlDataReader, then iterate through the Datatable and build each row into an object, and then usually package those objects into a massive dictionary.
The amount of data we are using is large enough that sometimes it cannot fit into a single datatable which have a memory cap. The dictionaries have even grown large enough to surpass 8 gb's of system memory in the most extreme cases. I was giving the task of fixing the outofmemory exceptions being thrown when the datatables overflowed. I did this by implementing a batch process method that seemed to be in conflict with how datatables are meant to be used, but it worked for the time being.
I now have the task of further reducing the memory requirements of this process. My idea is to create a generically typed class inheriting from IEnumerator that takes a SqlDataReader and essentially uses the reader as the collection it is Enumerating. The MoveNext() function will advance the reader, and the Current property will return the typed object specified built from a builder method from the reader's current row.
My question: Is this a feasible idea? I've never heard/can't find online anything like it.
Also, logistically: How would I call the specific builder function that the type declaration demands when the Current property is called?
I'm open to criticism and chastising for dreaming up a silly idea. I'm most interested in finding the best practice for approaching the overall goal.
Seems reasonably sensible, and actually pretty straightforward using an iterator block:
private static IEnumerable<Foo> WrapReader(SqlDataReader reader)
{
while (reader.Read())
{
Foo foo = ...; // TODO: Build a Foo from the reader
yield return foo;
}
}
Then you can use it with:
using (SqlDataReader reader = ...)
{
foreach (Foo foo in WrapReader(reader))
{
...
}
}
You can even use LINQ to Objects if you're careful:
using (SqlDataReader reader = ...)
{
var query = from foo in WrapReader(reader)
where foo.Price > 100
select foo.Name;
// Use the query...
}
Related
Recently I started to investigate a Pipeline pattern or also known as Pipes and Filters. I thought it is a good way to structure the code and applications which just process the data.
I used this article as a base for my pipeline and steps implementation (but this is not so important).
As usual blog is covering simple scenario but in my case I need (or maybe not) to work on IDisposable objects which might travel through the process.
For instance Streams
Let's consider simple pipeline which should load csv file and insert its rows into some db. In simple abstraction we could implement such functions
Stream Step1(string filePath)
IEnumerable<RowType> Step2(Stream stream)
bool Step3(IEnumerable<RowType> data)
Now my question is if that is a good approach. Because if we implement that as step after step processing the Stream object leaves first step and it is easy to fall into a memory leakage problem.
I know that some might say that I should have Step1 which is loading and deserialising data but we are considering simple process. We might have more complex ones where passing a Stream makes more sense.
I am wondering how can I implement such pipelines to avoid memory leaks and also avoiding loading whole file into MemoryStream (which would be safer). Should I somehow wrap each step in try..catch blocks to call Dispose() if something goes wrong? Or should I pass all IDisposable resources into Pipeline object which will be wrapped with using to dispose all resources produced during processing correctly?
If it's planned to be used like Step3( Step2( Step1(filePath) ) ), then
Step2 should dispose the stream. It may use yield return feature of c#, which creates an implementation of IEnumerator<> underneath, which implements IDisposable, and allows for "subscribing" for the "event" of the finishing of enumerating and call Stream.Dispose at that point. E.g. :
IEnumerable<RowType> Step2(Stream stream)
{
using(stream)
using(StreamReader sr = new StreamReader(stream))
{
while(!sr.EndOfStream)
{
yield return Parse(sr.ReadLine()); //yield return implements IEnumerator<>
}
} // finally part of the using will be called from IEnumerator<>.Dispose()
}
Then if Step3 either uses LINQ
bool Step3(IEnumerable<RowType> data) => data.Any(item => SomeDecisionLogic(item));
or foreach
bool Step3(IEnumerable<RowType> data)
{
foreach(var item in data)
if(SomeDecisionLogic(item)))
return true;
}
for enumerating, both of them guarantee to call IEnumerator<>.Dispose() (ref1, ECMA-334 C# Spec, ch.13.9.5 ), which will call Stream.Dispose
IMO it's worth having a pipeline if the interaction is between at least 2 different systems and if the work can be executed in parallel. Otherwise it's more overhead.
In this case there are 2 systems: the file system where the CSV file is and the database. I think the pipeline should have at least 2 steps that run in parallel:
IEnumerable<Row> ReadFromCsv(string csvFilePath)
void UpdateDabase<IEnumerable<Row> rows)
In this case it should be clear that the Stream is bound to ReadFromCsv.
IEnumerable<Row> ReadFromCsv(path)
{
using(var stream = File.OpenRead(path))
{
var lines = GetLines(stream); // yield one at a time, not all at once
foreach (var line in line) yield return GetRow(line);
}
}
I guess the scope depends on the steps - which in turn depend on the way you design the pipeline based on your needs.
Here is my sample code that I am using to fetch data from database:
on DAO layer:
public IEnumerable<IDataRecord> GetDATA(ICommonSearchCriteriaDto commonSearchCriteriaDto)
{
using(DbContext)
{
DbDataReader reader = DbContext.GetReader("ABC_PACKAGE.GET_DATA", oracleParams.ToArray(), CommandType.StoredProcedure);
while (reader.Read())
{
yield return reader;
}
}
}
On BO layer I am calling the above method like:
List<IGridDataDto> GridDataDtos = MapMultiple(_costDriversGraphDao.GetGraphData(commonSearchCriteriaDto)).ToList();
on mapper layer MapMultiple method is defined like:
public IGridDataDto MapSingle(IDataRecord dataRecord)
{
return new GridDataDto
{
Code = Convert.ToString(dataRecord["Code"]),
Name = Convert.ToString(dataRecord["Name"]),
Type = Convert.ToString(dataRecord["Type"])
};
}
public IEnumerable<IGridDataDto> MapMultiple(IEnumerable<IDataRecord> dataRecords)
{
return dataRecords.Select(MapSingle);
}
The above code is working well and good but I am wondering about two concerns with the above code.
How long data reader’s connection will be opened?
When I consider code performance factor only, Is this a good idea to use ‘yield return’ instead of adding record into a list and returning the whole list?
your code doesn't show where you open/close the connection; but the reader here will actually only be open while you are iterating the data. Deferred execution, etc. The only bit of your code that does this is the .ToList(), so it'll be fine. In the more general case, yes: the reader will be open for the amount of time you take to iterate it; if you do a .ToList() that will be minimal; if you do a foreach and (for every item) make an external http request and wait 20 seconds, then yes - it will be open for longer.
Both have their uses; the non-buffered approach is great for huge results that you want to process as a stream, without ever having to load them into a single in-memory list (or even have all of them in memory at a time); returning a list keeps the connection closed quickly, and makes it easy to avoid accidentally using the connection while it already has an open reader, but is not ideal for large results
If you return an iterator block, the caller can decide what is sane; if you always return a list, they don't have much option. A third way (that we do in dapper) is to make the choice theirs; we have an optional bool parameter which defaults to "return a list", but which the caller can change to indicate "return an iterator block"; basically:
bool buffered = true
in the parameters, and:
var data = QueryInternal<T>(...blah...);
return buffered ? data.ToList() : data;
in the implementation. In most cases, returning a list is perfectly reasonable and avoids a lot of problems, hence we make that the default.
How long data reader’s connection will be opened?
The connection will remain open until the reader is dismissed, which means that it would be open until the iteration is over.
When I consider code performance factor only, Is this a good idea to use yield return instead of adding record into a list and returning the whole list?
This depends on several factors:
If you are not planning to fetch the entire result, yield return will help you save on the amount of data transferred on the network
If you are not planning to convert returned data to objects, or if multiple rows are used to create a single object, yield return will help you save on the memory used at the peak usage point of your program
If you plan to iterate the enture result set over a short period of time, there will be no performance penalties for using yield return. If the iteration is going to last for a significant amount of time on multiple concurrent threads, the number of open cursors on the RDBMS side may become exceeded.
This answer ignores flaws in the shown implementation and covers the general idea.
It is a tradeoff - it is impossible to tell whether it is a good idea without knowing the constraints of your system - what is the amount of data you expect to get, the memory consumption you are willing to accept, expected load on the database, etc
Here is a peace of code:
void MyFunc(List<MyObj> objects)
{
MyFunc1(objects);
foreach( MyObj obj in objects.Where(obj1=>obj1.Good))
{
// Do Action With Good Object
}
}
void MyFunc1(List<MyObj> objects)
{
int iGoodCount = objects.Where(obj1=>obj1.Good).Count();
BeHappy(iGoodCount);
// do other stuff with 'objects' collection
}
Here we see that collection is analyzed twice and each time the value of 'Good' property is checked for each member: 1st time when calculating count of good objects, 2nd - when iterating through all good objects.
It is desirable to have that optimized, and here is a straightforward solution:
before call to MyFunc1 makecreate an additional temporary collection of good objects only (goodObjects, it can be IEnumerable);
get count of these objects and pass it as an additional parameter to MyFunc1;
in the 'MyFunc' method iterate not through 'objects.Where(...)' but through the 'goodObjects' collection.
Not too bad approach (as far as I see), but additional variable is required to be created in the 'MyFunc' method and additional parameter is required to be passed.
Question: is there any LinQ out-of-the-box functionality that allows any caching during 1st Where().Count(), remembering a processed collection and use it in the next iteration?
Any thoughts are welcome.
Thanks.
No, LINQ queries are not optimized in this way (what you describe is similar to the way SQL Server reuses a query execution plan). LINQ does not (and, for practical purposes, cannot) know enough about your objects in order to optimize this way. As far as it knows, your collection has changed (or is entirely different) between the two calls.
You're obviously aware of the ability to persist your query into a new List<T>, but apart from that there's really nothing that I can recommend without knowing more about your class and where else MyFunc is used.
As long as MyFunc1 doesn't need to modify the list by adding/removing objects, this will work.
void MyFunc(List<MyObj> objects)
{
ILookup<bool, MyObj> objLookup = objects.ToLookup(obj1 => obj1.Good);
MyFunc1(objLookup[true]);
foreach(MyObj obj in objLookup[true])
{
//..
}
}
void MyFunc1(IEnumerable<MyObj> objects)
{
//..
}
This code :
IEnumerable<string> lines = File.ReadLines("file path");
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
throws an ObjectDisposedException : {"Cannot read from a closed TextReader."} if the second foreach is executed.
It seems that the iterator object returned from File.ReadLines(..) can't be enumerated more than once. You have to obtain a new iterator object by calling File.ReadLines(..) and then use it to iterate.
If I replace File.ReadLines(..) with my version(parameters are not verified, it's just an example):
public static IEnumerable<string> MyReadLines(string path)
{
using (var stream = new TextReader(path))
{
string line;
while ((line = stream.ReadLine()) != null)
{
yield return line;
}
}
}
it's possible to iterate more than once the lines of the file.
An investigation using .Net Reflector showed that the implementation of the File.ReadLines(..) calls a private File.InternalReadLines(TextReader reader) that creates the actual iterator. The reader passed as a parameter is used in the MoveNext() method of the iterator to get the lines of the file and is disposed when we reach the end of the file. This means that once MoveNext() returns false there is no way to iterate a second time because the reader is closed and you have to get a new reader by creating a new iterator with the ReadLines(..) method.In my version a new reader is created in the MoveNext() method each time we start a new iteration.
Is this the expected behavior of the File.ReadLines(..) method?
I find troubling the fact that it's necessary to call the method each time before you enumerate the results. You would also have to call the method each time before you iterate the results of a Linq query that uses the method.
I know this is old, but i actually just ran into this while working on some code on a Windows 7 machine. Contrary to what people were saying here, this actually was a bug. See this link.
So the easy fix is to update your .net framefork. I thought this was worth updating since this was the top search result.
I don't think it's a bug, and I don't think it's unusual -- in fact that's what I'd expect for something like a text file reader to do. IO is an expensive operation, so in general you want to do everything in one pass.
It isn't a bug. But I believe you can use ReadAllLines() to do what you want instead. ReadAllLines creates a string array and pulls in all the lines into the array, instead of just a simple enumerator over a stream like ReadLines does.
If you need to access the lines twice you can always buffer them into a List<T>
using System.Linq;
List<string> lines = File.ReadLines("file path").ToList();
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
I don't know if it can be considered a bug or not if it's by design but I can certainly say two things...
This should be posted on Connect, not StackOverflow although they're not going to change it before 4.0 is released. And that usually means they won't ever fix it.
The design of the method certainly appears to be flawed.
You are correct in noting that returning an IEnumerable implies that it should be reusable and it does not guarantee the same results if iterated twice. If it had returned an IEnumerator instead then it would be a different story.
So anyway, I think it's a good find and I think the API is a lousy one to begin with. ReadAllLines and ReadAllText give you a nice convenient way of getting at the entire file but if the caller cares enough about performance to be using a lazy enumerable, they shouldn't be delegating so much responsibility to a static helper method in the first place.
I believe you are confusing an IQueryable with an IEnumerable. Yes, it's true that IQueryable can be treated as an IEnumerable, but they are not exactly the same thing. An IQueryable queries each time it's used, while an IEnumerable has no such implied reuse.
A Linq Query returns an IQueryable. ReadLines returns an IEnumerable.
There's a subtle distinction here because of the way an Enumerator is created. An IQueryable creates an IEnumerator when you call GetEnumerator() on it (which is done automatically by foreach). ReadLines() creates the IEnumerator when the ReadLines() function is called. As such, when you reuse an IQueryable, it creates a new IEnumerator when you reuse it, but since the ReadLines() creates the IEnumerator (and not an IQueryable), the only way to get a new IEnumerator is to call ReadLines() again.
In other words, you should only be able to expect to reuse an IQueryable, not an IEnumerator.
EDIT:
On further reflection (no pun intended) I think my initial response was a bit too simplistic. If IEnumerable was not reusable, you couldn't do something like this:
List<int> li = new List<int>() {1, 2, 3, 4};
IEnumerable<int> iei = li;
foreach (var i in iei) { Console.WriteLine(i); }
foreach (var i in iei) { Console.WriteLine(i); }
Clearly, one would not expect the second foreach to fail.
The problem, as is so often the case with these kinds of abstractions, is that not everything fits perfectly. For example, Streams are typically one-way, but for network use they had to be adapted to work bi-directionally.
In this case, an IEnumerable was originally envisioned to be a reusable feature, but it has since been adapted to be so generic that reusability is not a guarantee or even should be expected. Witness the explosion of various libraries that use IEnumerables in non-reusable ways, such as Jeffery Richters PowerThreading library.
I simply don't think we can assume IEnumerables are reusable in all cases anymore.
It's not a bug. File.ReadLines() uses lazy evaluation and it is not idempotent. That's why it's not safe to enumerate it twice in a row. Remember an IEnumerable represents a data source that can be enumerated, it does not state it is safe to be enumerated twice, although this might be unexpected since most people are used to using IEnumerable over idempotent collections.
From the MSDN:
The ReadLines(String, System) and
ReadAllLines(String, System) methods
differ as follows: When you use
ReadLines, you can start enumerating
the collection of strings before the
whole collection is returned; when you
use ReadAllLines, you must wait for
the whole array of strings be returned
before you can access the
array.Therefore, when you are working
with very large files, ReadLines can
be more efficient.
Your findings via reflector are correct and verify this behavior. The implementation you provided avoids this unexpected behavior but makes still use of lazy evaluation.
Background: I've got a bunch of strings that I'm getting from a database, and I want to return them. Traditionally, it would be something like this:
public List<string> GetStuff(string connectionString)
{
List<string> categoryList = new List<string>();
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
categoryList.Add(sqlDataReader["myImportantColumn"].ToString());
}
}
}
return categoryList;
}
But then I figure the consumer is going to want to iterate through the items and doesn't care about much else, and I'd like to not box myself in to a List, per se, so if I return an IEnumerable everything is good/flexible. So I was thinking I could use a "yield return" type design to handle this...something like this:
public IEnumerable<string> GetStuff(string connectionString)
{
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
yield return sqlDataReader["myImportantColumn"].ToString();
}
}
}
}
But now that I'm reading a bit more about yield (on sites like this...msdn didn't seem to mention this), it's apparently a lazy evaluator, that keeps the state of the populator around, in anticipation of someone asking for the next value, and then only running it until it returns the next value.
This seems fine in most cases, but with a DB call, this sounds a bit dicey. As a somewhat contrived example, if someone asks for an IEnumerable from that I'm populating from a DB call, gets through half of it, and then gets stuck in a loop...as far as I can see my DB connection is going to stay open forever.
Sounds like asking for trouble in some cases if the iterator doesn't finish...am I missing something?
It's a balancing act: do you want to force all the data into memory immediately so you can free up the connection, or do you want to benefit from streaming the data, at the cost of tying up the connection for all that time?
The way I look at it, that decision should potentially be up to the caller, who knows more about what they want to do. If you write the code using an iterator block, the caller can very easily turned that streaming form into a fully-buffered form:
List<string> stuff = new List<string>(GetStuff(connectionString));
If, on the other hand, you do the buffering yourself, there's no way the caller can go back to a streaming model.
So I'd probably use the streaming model and say explicitly in the documentation what it does, and advise the caller to decide appropriately. You might even want to provide a helper method to basically call the streamed version and convert it into a list.
Of course, if you don't trust your callers to make the appropriate decision, and you have good reason to believe that they'll never really want to stream the data (e.g. it's never going to return much anyway) then go for the list approach. Either way, document it - it could very well affect how the return value is used.
Another option for dealing with large amounts of data is to use batches, of course - that's thinking somewhat away from the original question, but it's a different approach to consider in the situation where streaming would normally be attractive.
You're not always unsafe with the IEnumerable. If you leave the framework call GetEnumerator (which is what most of the people will do), then you're safe. Basically, you're as safe as the carefullness of the code using your method:
class Program
{
static void Main(string[] args)
{
// safe
var firstOnly = GetList().First();
// safe
foreach (var item in GetList())
{
if(item == "2")
break;
}
// safe
using (var enumerator = GetList().GetEnumerator())
{
for (int i = 0; i < 2; i++)
{
enumerator.MoveNext();
}
}
// unsafe
var enumerator2 = GetList().GetEnumerator();
for (int i = 0; i < 2; i++)
{
enumerator2.MoveNext();
}
}
static IEnumerable<string> GetList()
{
using (new Test())
{
yield return "1";
yield return "2";
yield return "3";
}
}
}
class Test : IDisposable
{
public void Dispose()
{
Console.WriteLine("dispose called");
}
}
Whether you can affort to leave the database connection open or not depends on your architecture as well. If the caller participates in an transaction (and your connection is auto enlisted), then the connection will be kept open by the framework anyway.
Another advantage of yield is (when using a server-side cursor), your code doesn't have to read all data (example: 1,000 items) from the database, if your consumer wants to get out of the loop earlier (example: after the 10th item). This can speed up querying data. Especially in an Oracle environment, where server-side cursors are the common way to retrieve data.
You are not missing anything. Your sample shows how NOT to use yield return. Add the items to a list, close the connection, and return the list. Your method signature can still return IEnumerable.
Edit: That said, Jon has a point (so surprised!): there are rare occasions where streaming is actually the best thing to do from a performance perspective. After all, if it's 100,000 (1,000,000? 10,000,000?) rows we're talking about here, you don't want to be loading that all into memory first.
As an aside - note that the IEnumerable<T> approach is essentially what the LINQ providers (LINQ-to-SQL, LINQ-to-Entities) do for a living. The approach has advantages, as Jon says. However, there are definite problems too - in particular (for me) in terms of (the combination of) separation | abstraction.
What I mean here is that:
in a MVC scenario (for example) you want your "get data" step to actually get data, so that you can test it works at the controller, not the view (without having to remember to call .ToList() etc)
you can't guarantee that another DAL implementation will be able to stream data (for example, a POX/WSE/SOAP call can't usually stream records); and you don't necessarily want to make the behaviour confusingly different (i.e. connection still open during iteration with one implementation, and closed for another)
This ties in a bit with my thoughts here: Pragmatic LINQ.
But I should stress - there are definitely times when the streaming is highly desirable. It isn't a simple "always vs never" thing...
Slightly more concise way to force evaluation of iterator:
using System.Linq;
//...
var stuff = GetStuff(connectionString).ToList();
No, you are on the right path... the yield will lock the reader... you can test it doing another database call while calling the IEnumerable
The only way this would cause problems is if the caller abuses the protocol of IEnumerable<T>. The correct way to use it is to call Dispose on it when it is no longer needed.
The implementation generated by yield return takes the Dispose call as a signal to execute any open finally blocks, which in your example will call Dispose on the objects you've created in the using statements.
There are a number of language features (in particular foreach) which make it very easy to use IEnumerable<T> correctly.
You could always use a separate thread to buffer the data (perhaps to a queue) while also doing a yeild to return the data. When the user requests data (returned via a yeild), an item is removed from the queue. Data is also being continuously added to the queue via the separate thread. That way, if the user requests the data fast enough, the queue is never very full and you do not have to worry about memory issues. If they don't, then the queue will fill up, which may not be so bad. If there is some sort of limitation you would like to impose on memory, you could enforce a maximum queue size (at which point the other thread would wait for items to be removed before adding more to the queue). Naturally, you will want to make sure you handle resources (i.e., the queue) correctly between the two threads.
As an alternative, you could force the user to pass in a boolean to indicate whether or not the data should be buffered. If true, the data is buffered and the connection is closed as soon as possible. If false, the data is not buffered and the database connection stays open as long as the user needs it to be. Having a boolean parameter forces the user to make the choice, which ensures they know about the issue.
I've bumped into this wall a few times. SQL database queries are not easily streamable like files. Instead, query only as much as you think you'll need and return it as whatever container you want (IList<>, DataTable, etc.). IEnumerable won't help you here.
What you can do is use a SqlDataAdapter instead and fill a DataTable. Something like this:
public IEnumerable<string> GetStuff(string connectionString)
{
DataTable table = new DataTable();
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
SqlDataAdapter dataAdapter = new SqlDataAdapter(sqlCommand);
dataAdapter.Fill(table);
}
}
foreach(DataRow row in table.Rows)
{
yield return row["myImportantColumn"].ToString();
}
}
This way, you're querying everything in one shot, and closing the connection immediately, yet you're still lazily iterating the result. Furthermore, the caller of this method can't cast the result to a List and do something they shouldn't be doing.
Dont use yield here. your sample is fine.