Recently I started to investigate a Pipeline pattern or also known as Pipes and Filters. I thought it is a good way to structure the code and applications which just process the data.
I used this article as a base for my pipeline and steps implementation (but this is not so important).
As usual blog is covering simple scenario but in my case I need (or maybe not) to work on IDisposable objects which might travel through the process.
For instance Streams
Let's consider simple pipeline which should load csv file and insert its rows into some db. In simple abstraction we could implement such functions
Stream Step1(string filePath)
IEnumerable<RowType> Step2(Stream stream)
bool Step3(IEnumerable<RowType> data)
Now my question is if that is a good approach. Because if we implement that as step after step processing the Stream object leaves first step and it is easy to fall into a memory leakage problem.
I know that some might say that I should have Step1 which is loading and deserialising data but we are considering simple process. We might have more complex ones where passing a Stream makes more sense.
I am wondering how can I implement such pipelines to avoid memory leaks and also avoiding loading whole file into MemoryStream (which would be safer). Should I somehow wrap each step in try..catch blocks to call Dispose() if something goes wrong? Or should I pass all IDisposable resources into Pipeline object which will be wrapped with using to dispose all resources produced during processing correctly?
If it's planned to be used like Step3( Step2( Step1(filePath) ) ), then
Step2 should dispose the stream. It may use yield return feature of c#, which creates an implementation of IEnumerator<> underneath, which implements IDisposable, and allows for "subscribing" for the "event" of the finishing of enumerating and call Stream.Dispose at that point. E.g. :
IEnumerable<RowType> Step2(Stream stream)
{
using(stream)
using(StreamReader sr = new StreamReader(stream))
{
while(!sr.EndOfStream)
{
yield return Parse(sr.ReadLine()); //yield return implements IEnumerator<>
}
} // finally part of the using will be called from IEnumerator<>.Dispose()
}
Then if Step3 either uses LINQ
bool Step3(IEnumerable<RowType> data) => data.Any(item => SomeDecisionLogic(item));
or foreach
bool Step3(IEnumerable<RowType> data)
{
foreach(var item in data)
if(SomeDecisionLogic(item)))
return true;
}
for enumerating, both of them guarantee to call IEnumerator<>.Dispose() (ref1, ECMA-334 C# Spec, ch.13.9.5 ), which will call Stream.Dispose
IMO it's worth having a pipeline if the interaction is between at least 2 different systems and if the work can be executed in parallel. Otherwise it's more overhead.
In this case there are 2 systems: the file system where the CSV file is and the database. I think the pipeline should have at least 2 steps that run in parallel:
IEnumerable<Row> ReadFromCsv(string csvFilePath)
void UpdateDabase<IEnumerable<Row> rows)
In this case it should be clear that the Stream is bound to ReadFromCsv.
IEnumerable<Row> ReadFromCsv(path)
{
using(var stream = File.OpenRead(path))
{
var lines = GetLines(stream); // yield one at a time, not all at once
foreach (var line in line) yield return GetRow(line);
}
}
I guess the scope depends on the steps - which in turn depend on the way you design the pipeline based on your needs.
Related
I've watched the chat on LINQ with IAsyncEnumerable which has given me some insight on dealing with extension methods for IAsyncEnumerables, but wasn't detailed enough frankly for a real-world application, especially for my experience level, and I understand that samples/documentation don't really exist as of yet for IAsyncEnumerables
I'm trying to read from a file, do some transformation on the stream, returning a IAsyncEnumerable, and then send those objects downstream after an arbitrary number of objects have been obtained, like:
await foreach (var data in ProcessBlob(downloadedFile))
{
//todo add data to List<T> called listWithPreConfiguredNumberOfElements
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements);
//repeat the behaviour till all the elements in the IAsyncEnumerable returned by ProcessBlob are sent downstream to the _messageHandler.
}
My understanding from reading on the matter so far is that the await foreach line is working on data that employs the use of Tasks (or ValueTasks), so we don't have a count up front. I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
I'm using the System.Linq.Async package in the hopes that I could use a relevant extensions method. I can see some promise in the form of TakeWhile, but my understanding on how thread-safe the task I intend to do is not all there, causing me to lose confidence.
Any help or push in the right direction would be massively appreciated, thank you.
There is an operator Buffer that does what you want, in the package System.Interactive.Async.
// Projects each element of an async-enumerable sequence into consecutive
// non-overlapping buffers which are produced based on element count information.
public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
this IAsyncEnumerable<TSource> source, int count);
This package contains operators like Amb, Throw, Catch, Defer, Finally etc that do not have a direct equivalent in Linq, but they do have an equivalent in System.Reactive. This is because IAsyncEnumerables are conceptually closer to IObservables than to IEnumerables (because both have a time dimension, while IEnumerables are timeless).
I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
You need to think in terms of execution flows, not threads, when dealing with async; since you are await-ing the processing step, there isn't actually a concurrency problem accessing the list, because regardless of which threads are used: the list is only accessed once at a time.
If you are still concerned, you could new a list per batch, but that is probably overkill. What you do need, however, is two additions - a reset between batches, and a final processing step:
var listWithPreConfiguredNumberOfElements = new List<YourType>(preConfiguredNumber);
await foreach (var data in ProcessBlob(downloadedFile)) // CAF?
{
listWithPreConfiguredNumberOfElements.Add(data);
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
{
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
listWithPreConfiguredNumberOfElements.Clear(); // reset for a new batch
// (replace this with a "new" if you're still concerned about concurrency)
}
}
if (listWithPreConfiguredNumberOfElements.Any())
{ // process any stragglers
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
}
You might also choose to use ConfigureAwait(false) in the three spots marked // CAF?
My case is like this:
I am building an application that can read data from some source (files or database) and write that data to another source (files or database).
So, basically I have objects:
InputHandler -> Queue -> OutputHandler
Looking at a situation where input is some files, InputHandler would:
1. Use FilesReader to read data from all the files (FilesReader encapsulates the logic of reading files and it returns a collection of objects)
2. Add the objects to queue.
(and then it repeats infinitely since InputHandler has a while loop that looks for new files all the time).
The problem appears when files are really big - FilesReader, which reads all files and parses them, is not the best idea here. It would be much better if I could somehow read a portion of the file, parse it, and put it in a queue - and repeat it until the end of each file.
It is doable using Streams, however, I don't want my FilesReader to know anything about the queue - it feels to me that it breaks OOP rule of separation of concerns.
Could you suggest me a solution for this issue?
//UPDATE
Here's some code that shows (in simplified way) what InputHandler does:
public class InputHandler {
public Task Start() {
while(true) {
var newData = await _filesReader.GetData();
_queue.Enqueue(newData);
}
}
}
This code shows how the code looks like right now. So, if I have 1000 files, each having lots and lots of data, _filesReader will try to read all this data and return it - and memory would quickly be exhausted.
Now, if _filesReader was to use streams and return data partially, the memory usage would be kept low.
One solution would be to have _queue object inside of _filesReader - it could just read data from stream and push directly to queue - I don't like it - too much responsibility for _filesReader.
Another solution (as proposed by jhilgeman) - filesReader could raise events with the data in them.
Is there some other solution?
I'm not entirely sure I understand why using an IO stream of some kind would change the way you would add objects to the queue.
However, what I would personally do is set up a static custom event in your FilesReader class, like OnObjectRead. Use a stream to read through files and as you read a record, raise the event and pass that object/record to it.
Then have an event subscriber that takes the record and pushes it into the Queue. It would be up to your app architecture to determine the best place to put that subscriber.
On a side note, you mentioned your InputHandler has a while loop that looks for new files all the time. I'd strongly recommend you don't use a while loop for this if you're only checking the filesystem. This is the purpose of FileSystemWatcher - to give you an efficient way to be immediately notified about changes in the filesystem without you having to loop. Otherwise you're constantly grinding the filesystem and constantly eating up disk I/O.
This code shows how the code looks like right now. So, if I have 1000 files, each having lots and lots of data, _filesReader will try to read all this data and return it - and memory would quickly be exhausted.
Regarding the problem of unlimited memory consumption, a simple solution is to replace the _queue with a BlockingCollection. This class has bounding capabilities out of the box.
public class InputHandler
{
private readonly BlockingCollection<string> _buffer
= new BlockingCollection<string>(boundedCapacity: 10);
public Task Start()
{
while (true)
{
var newData = await _filesReader.GetData();
_buffer.Add(newData); // will block until _buffer
// has less than 10 items.
}
}
}
I think I came up with some idea. My main goal is to have FilesReader that does not rely on any specific way of how data is transfered from it. All it should do is to read data, return it, and not care about any queues, or whatever else I could use. That's a job of InputHandler - it knows about the queue and it's using FilesReader to get some data to put in that queue.
I changed FilesReader interface a bit. Now it has a method like this:
Task ReadData(IFileInfo file, Action<IEnumerable<IDataPoint>> resultHandler, CancellationToken cancellationToken)
Now, InputHandler invokes the method like this:
await _filesReader.ReadData(file, data => _queue.Enqueue(data), cancellationToken);
I think it its a good solution in terms of separation of concerns.
FilesReader can read data in chunks and whenever a new chunk is parsed, it just invokes the delegate - and continues working on the rest of the file.
What do you think about such solution?
Here is my sample code that I am using to fetch data from database:
on DAO layer:
public IEnumerable<IDataRecord> GetDATA(ICommonSearchCriteriaDto commonSearchCriteriaDto)
{
using(DbContext)
{
DbDataReader reader = DbContext.GetReader("ABC_PACKAGE.GET_DATA", oracleParams.ToArray(), CommandType.StoredProcedure);
while (reader.Read())
{
yield return reader;
}
}
}
On BO layer I am calling the above method like:
List<IGridDataDto> GridDataDtos = MapMultiple(_costDriversGraphDao.GetGraphData(commonSearchCriteriaDto)).ToList();
on mapper layer MapMultiple method is defined like:
public IGridDataDto MapSingle(IDataRecord dataRecord)
{
return new GridDataDto
{
Code = Convert.ToString(dataRecord["Code"]),
Name = Convert.ToString(dataRecord["Name"]),
Type = Convert.ToString(dataRecord["Type"])
};
}
public IEnumerable<IGridDataDto> MapMultiple(IEnumerable<IDataRecord> dataRecords)
{
return dataRecords.Select(MapSingle);
}
The above code is working well and good but I am wondering about two concerns with the above code.
How long data reader’s connection will be opened?
When I consider code performance factor only, Is this a good idea to use ‘yield return’ instead of adding record into a list and returning the whole list?
your code doesn't show where you open/close the connection; but the reader here will actually only be open while you are iterating the data. Deferred execution, etc. The only bit of your code that does this is the .ToList(), so it'll be fine. In the more general case, yes: the reader will be open for the amount of time you take to iterate it; if you do a .ToList() that will be minimal; if you do a foreach and (for every item) make an external http request and wait 20 seconds, then yes - it will be open for longer.
Both have their uses; the non-buffered approach is great for huge results that you want to process as a stream, without ever having to load them into a single in-memory list (or even have all of them in memory at a time); returning a list keeps the connection closed quickly, and makes it easy to avoid accidentally using the connection while it already has an open reader, but is not ideal for large results
If you return an iterator block, the caller can decide what is sane; if you always return a list, they don't have much option. A third way (that we do in dapper) is to make the choice theirs; we have an optional bool parameter which defaults to "return a list", but which the caller can change to indicate "return an iterator block"; basically:
bool buffered = true
in the parameters, and:
var data = QueryInternal<T>(...blah...);
return buffered ? data.ToList() : data;
in the implementation. In most cases, returning a list is perfectly reasonable and avoids a lot of problems, hence we make that the default.
How long data reader’s connection will be opened?
The connection will remain open until the reader is dismissed, which means that it would be open until the iteration is over.
When I consider code performance factor only, Is this a good idea to use yield return instead of adding record into a list and returning the whole list?
This depends on several factors:
If you are not planning to fetch the entire result, yield return will help you save on the amount of data transferred on the network
If you are not planning to convert returned data to objects, or if multiple rows are used to create a single object, yield return will help you save on the memory used at the peak usage point of your program
If you plan to iterate the enture result set over a short period of time, there will be no performance penalties for using yield return. If the iteration is going to last for a significant amount of time on multiple concurrent threads, the number of open cursors on the RDBMS side may become exceeded.
This answer ignores flaws in the shown implementation and covers the general idea.
It is a tradeoff - it is impossible to tell whether it is a good idea without knowing the constraints of your system - what is the amount of data you expect to get, the memory consumption you are willing to accept, expected load on the database, etc
The work I do involves downloading into memory HUGE amounts of data from a SQL server database. To accomplish this, we have custom dataset definitions that we load using a SqlDataReader, then iterate through the Datatable and build each row into an object, and then usually package those objects into a massive dictionary.
The amount of data we are using is large enough that sometimes it cannot fit into a single datatable which have a memory cap. The dictionaries have even grown large enough to surpass 8 gb's of system memory in the most extreme cases. I was giving the task of fixing the outofmemory exceptions being thrown when the datatables overflowed. I did this by implementing a batch process method that seemed to be in conflict with how datatables are meant to be used, but it worked for the time being.
I now have the task of further reducing the memory requirements of this process. My idea is to create a generically typed class inheriting from IEnumerator that takes a SqlDataReader and essentially uses the reader as the collection it is Enumerating. The MoveNext() function will advance the reader, and the Current property will return the typed object specified built from a builder method from the reader's current row.
My question: Is this a feasible idea? I've never heard/can't find online anything like it.
Also, logistically: How would I call the specific builder function that the type declaration demands when the Current property is called?
I'm open to criticism and chastising for dreaming up a silly idea. I'm most interested in finding the best practice for approaching the overall goal.
Seems reasonably sensible, and actually pretty straightforward using an iterator block:
private static IEnumerable<Foo> WrapReader(SqlDataReader reader)
{
while (reader.Read())
{
Foo foo = ...; // TODO: Build a Foo from the reader
yield return foo;
}
}
Then you can use it with:
using (SqlDataReader reader = ...)
{
foreach (Foo foo in WrapReader(reader))
{
...
}
}
You can even use LINQ to Objects if you're careful:
using (SqlDataReader reader = ...)
{
var query = from foo in WrapReader(reader)
where foo.Price > 100
select foo.Name;
// Use the query...
}
This code :
IEnumerable<string> lines = File.ReadLines("file path");
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
throws an ObjectDisposedException : {"Cannot read from a closed TextReader."} if the second foreach is executed.
It seems that the iterator object returned from File.ReadLines(..) can't be enumerated more than once. You have to obtain a new iterator object by calling File.ReadLines(..) and then use it to iterate.
If I replace File.ReadLines(..) with my version(parameters are not verified, it's just an example):
public static IEnumerable<string> MyReadLines(string path)
{
using (var stream = new TextReader(path))
{
string line;
while ((line = stream.ReadLine()) != null)
{
yield return line;
}
}
}
it's possible to iterate more than once the lines of the file.
An investigation using .Net Reflector showed that the implementation of the File.ReadLines(..) calls a private File.InternalReadLines(TextReader reader) that creates the actual iterator. The reader passed as a parameter is used in the MoveNext() method of the iterator to get the lines of the file and is disposed when we reach the end of the file. This means that once MoveNext() returns false there is no way to iterate a second time because the reader is closed and you have to get a new reader by creating a new iterator with the ReadLines(..) method.In my version a new reader is created in the MoveNext() method each time we start a new iteration.
Is this the expected behavior of the File.ReadLines(..) method?
I find troubling the fact that it's necessary to call the method each time before you enumerate the results. You would also have to call the method each time before you iterate the results of a Linq query that uses the method.
I know this is old, but i actually just ran into this while working on some code on a Windows 7 machine. Contrary to what people were saying here, this actually was a bug. See this link.
So the easy fix is to update your .net framefork. I thought this was worth updating since this was the top search result.
I don't think it's a bug, and I don't think it's unusual -- in fact that's what I'd expect for something like a text file reader to do. IO is an expensive operation, so in general you want to do everything in one pass.
It isn't a bug. But I believe you can use ReadAllLines() to do what you want instead. ReadAllLines creates a string array and pulls in all the lines into the array, instead of just a simple enumerator over a stream like ReadLines does.
If you need to access the lines twice you can always buffer them into a List<T>
using System.Linq;
List<string> lines = File.ReadLines("file path").ToList();
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
I don't know if it can be considered a bug or not if it's by design but I can certainly say two things...
This should be posted on Connect, not StackOverflow although they're not going to change it before 4.0 is released. And that usually means they won't ever fix it.
The design of the method certainly appears to be flawed.
You are correct in noting that returning an IEnumerable implies that it should be reusable and it does not guarantee the same results if iterated twice. If it had returned an IEnumerator instead then it would be a different story.
So anyway, I think it's a good find and I think the API is a lousy one to begin with. ReadAllLines and ReadAllText give you a nice convenient way of getting at the entire file but if the caller cares enough about performance to be using a lazy enumerable, they shouldn't be delegating so much responsibility to a static helper method in the first place.
I believe you are confusing an IQueryable with an IEnumerable. Yes, it's true that IQueryable can be treated as an IEnumerable, but they are not exactly the same thing. An IQueryable queries each time it's used, while an IEnumerable has no such implied reuse.
A Linq Query returns an IQueryable. ReadLines returns an IEnumerable.
There's a subtle distinction here because of the way an Enumerator is created. An IQueryable creates an IEnumerator when you call GetEnumerator() on it (which is done automatically by foreach). ReadLines() creates the IEnumerator when the ReadLines() function is called. As such, when you reuse an IQueryable, it creates a new IEnumerator when you reuse it, but since the ReadLines() creates the IEnumerator (and not an IQueryable), the only way to get a new IEnumerator is to call ReadLines() again.
In other words, you should only be able to expect to reuse an IQueryable, not an IEnumerator.
EDIT:
On further reflection (no pun intended) I think my initial response was a bit too simplistic. If IEnumerable was not reusable, you couldn't do something like this:
List<int> li = new List<int>() {1, 2, 3, 4};
IEnumerable<int> iei = li;
foreach (var i in iei) { Console.WriteLine(i); }
foreach (var i in iei) { Console.WriteLine(i); }
Clearly, one would not expect the second foreach to fail.
The problem, as is so often the case with these kinds of abstractions, is that not everything fits perfectly. For example, Streams are typically one-way, but for network use they had to be adapted to work bi-directionally.
In this case, an IEnumerable was originally envisioned to be a reusable feature, but it has since been adapted to be so generic that reusability is not a guarantee or even should be expected. Witness the explosion of various libraries that use IEnumerables in non-reusable ways, such as Jeffery Richters PowerThreading library.
I simply don't think we can assume IEnumerables are reusable in all cases anymore.
It's not a bug. File.ReadLines() uses lazy evaluation and it is not idempotent. That's why it's not safe to enumerate it twice in a row. Remember an IEnumerable represents a data source that can be enumerated, it does not state it is safe to be enumerated twice, although this might be unexpected since most people are used to using IEnumerable over idempotent collections.
From the MSDN:
The ReadLines(String, System) and
ReadAllLines(String, System) methods
differ as follows: When you use
ReadLines, you can start enumerating
the collection of strings before the
whole collection is returned; when you
use ReadAllLines, you must wait for
the whole array of strings be returned
before you can access the
array.Therefore, when you are working
with very large files, ReadLines can
be more efficient.
Your findings via reflector are correct and verify this behavior. The implementation you provided avoids this unexpected behavior but makes still use of lazy evaluation.