How to separate reading a file from adding data to queue - c#

My case is like this:
I am building an application that can read data from some source (files or database) and write that data to another source (files or database).
So, basically I have objects:
InputHandler -> Queue -> OutputHandler
Looking at a situation where input is some files, InputHandler would:
1. Use FilesReader to read data from all the files (FilesReader encapsulates the logic of reading files and it returns a collection of objects)
2. Add the objects to queue.
(and then it repeats infinitely since InputHandler has a while loop that looks for new files all the time).
The problem appears when files are really big - FilesReader, which reads all files and parses them, is not the best idea here. It would be much better if I could somehow read a portion of the file, parse it, and put it in a queue - and repeat it until the end of each file.
It is doable using Streams, however, I don't want my FilesReader to know anything about the queue - it feels to me that it breaks OOP rule of separation of concerns.
Could you suggest me a solution for this issue?
//UPDATE
Here's some code that shows (in simplified way) what InputHandler does:
public class InputHandler {
public Task Start() {
while(true) {
var newData = await _filesReader.GetData();
_queue.Enqueue(newData);
}
}
}
This code shows how the code looks like right now. So, if I have 1000 files, each having lots and lots of data, _filesReader will try to read all this data and return it - and memory would quickly be exhausted.
Now, if _filesReader was to use streams and return data partially, the memory usage would be kept low.
One solution would be to have _queue object inside of _filesReader - it could just read data from stream and push directly to queue - I don't like it - too much responsibility for _filesReader.
Another solution (as proposed by jhilgeman) - filesReader could raise events with the data in them.
Is there some other solution?

I'm not entirely sure I understand why using an IO stream of some kind would change the way you would add objects to the queue.
However, what I would personally do is set up a static custom event in your FilesReader class, like OnObjectRead. Use a stream to read through files and as you read a record, raise the event and pass that object/record to it.
Then have an event subscriber that takes the record and pushes it into the Queue. It would be up to your app architecture to determine the best place to put that subscriber.
On a side note, you mentioned your InputHandler has a while loop that looks for new files all the time. I'd strongly recommend you don't use a while loop for this if you're only checking the filesystem. This is the purpose of FileSystemWatcher - to give you an efficient way to be immediately notified about changes in the filesystem without you having to loop. Otherwise you're constantly grinding the filesystem and constantly eating up disk I/O.

This code shows how the code looks like right now. So, if I have 1000 files, each having lots and lots of data, _filesReader will try to read all this data and return it - and memory would quickly be exhausted.
Regarding the problem of unlimited memory consumption, a simple solution is to replace the _queue with a BlockingCollection. This class has bounding capabilities out of the box.
public class InputHandler
{
private readonly BlockingCollection<string> _buffer
= new BlockingCollection<string>(boundedCapacity: 10);
public Task Start()
{
while (true)
{
var newData = await _filesReader.GetData();
_buffer.Add(newData); // will block until _buffer
// has less than 10 items.
}
}
}

I think I came up with some idea. My main goal is to have FilesReader that does not rely on any specific way of how data is transfered from it. All it should do is to read data, return it, and not care about any queues, or whatever else I could use. That's a job of InputHandler - it knows about the queue and it's using FilesReader to get some data to put in that queue.
I changed FilesReader interface a bit. Now it has a method like this:
Task ReadData(IFileInfo file, Action<IEnumerable<IDataPoint>> resultHandler, CancellationToken cancellationToken)
Now, InputHandler invokes the method like this:
await _filesReader.ReadData(file, data => _queue.Enqueue(data), cancellationToken);
I think it its a good solution in terms of separation of concerns.
FilesReader can read data in chunks and whenever a new chunk is parsed, it just invokes the delegate - and continues working on the rest of the file.
What do you think about such solution?

Related

How to safely iterate over an IAsyncEnumerable to send a collection downstream for message processing in batches

I've watched the chat on LINQ with IAsyncEnumerable which has given me some insight on dealing with extension methods for IAsyncEnumerables, but wasn't detailed enough frankly for a real-world application, especially for my experience level, and I understand that samples/documentation don't really exist as of yet for IAsyncEnumerables
I'm trying to read from a file, do some transformation on the stream, returning a IAsyncEnumerable, and then send those objects downstream after an arbitrary number of objects have been obtained, like:
await foreach (var data in ProcessBlob(downloadedFile))
{
//todo add data to List<T> called listWithPreConfiguredNumberOfElements
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements);
//repeat the behaviour till all the elements in the IAsyncEnumerable returned by ProcessBlob are sent downstream to the _messageHandler.
}
My understanding from reading on the matter so far is that the await foreach line is working on data that employs the use of Tasks (or ValueTasks), so we don't have a count up front. I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
I'm using the System.Linq.Async package in the hopes that I could use a relevant extensions method. I can see some promise in the form of TakeWhile, but my understanding on how thread-safe the task I intend to do is not all there, causing me to lose confidence.
Any help or push in the right direction would be massively appreciated, thank you.
There is an operator Buffer that does what you want, in the package System.Interactive.Async.
// Projects each element of an async-enumerable sequence into consecutive
// non-overlapping buffers which are produced based on element count information.
public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
this IAsyncEnumerable<TSource> source, int count);
This package contains operators like Amb, Throw, Catch, Defer, Finally etc that do not have a direct equivalent in Linq, but they do have an equivalent in System.Reactive. This is because IAsyncEnumerables are conceptually closer to IObservables than to IEnumerables (because both have a time dimension, while IEnumerables are timeless).
I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
You need to think in terms of execution flows, not threads, when dealing with async; since you are await-ing the processing step, there isn't actually a concurrency problem accessing the list, because regardless of which threads are used: the list is only accessed once at a time.
If you are still concerned, you could new a list per batch, but that is probably overkill. What you do need, however, is two additions - a reset between batches, and a final processing step:
var listWithPreConfiguredNumberOfElements = new List<YourType>(preConfiguredNumber);
await foreach (var data in ProcessBlob(downloadedFile)) // CAF?
{
listWithPreConfiguredNumberOfElements.Add(data);
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
{
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
listWithPreConfiguredNumberOfElements.Clear(); // reset for a new batch
// (replace this with a "new" if you're still concerned about concurrency)
}
}
if (listWithPreConfiguredNumberOfElements.Any())
{ // process any stragglers
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
}
You might also choose to use ConfigureAwait(false) in the three spots marked // CAF?

Pipeline pattern and disposable objects

Recently I started to investigate a Pipeline pattern or also known as Pipes and Filters. I thought it is a good way to structure the code and applications which just process the data.
I used this article as a base for my pipeline and steps implementation (but this is not so important).
As usual blog is covering simple scenario but in my case I need (or maybe not) to work on IDisposable objects which might travel through the process.
For instance Streams
Let's consider simple pipeline which should load csv file and insert its rows into some db. In simple abstraction we could implement such functions
Stream Step1(string filePath)
IEnumerable<RowType> Step2(Stream stream)
bool Step3(IEnumerable<RowType> data)
Now my question is if that is a good approach. Because if we implement that as step after step processing the Stream object leaves first step and it is easy to fall into a memory leakage problem.
I know that some might say that I should have Step1 which is loading and deserialising data but we are considering simple process. We might have more complex ones where passing a Stream makes more sense.
I am wondering how can I implement such pipelines to avoid memory leaks and also avoiding loading whole file into MemoryStream (which would be safer). Should I somehow wrap each step in try..catch blocks to call Dispose() if something goes wrong? Or should I pass all IDisposable resources into Pipeline object which will be wrapped with using to dispose all resources produced during processing correctly?
If it's planned to be used like Step3( Step2( Step1(filePath) ) ), then
Step2 should dispose the stream. It may use yield return feature of c#, which creates an implementation of IEnumerator<> underneath, which implements IDisposable, and allows for "subscribing" for the "event" of the finishing of enumerating and call Stream.Dispose at that point. E.g. :
IEnumerable<RowType> Step2(Stream stream)
{
using(stream)
using(StreamReader sr = new StreamReader(stream))
{
while(!sr.EndOfStream)
{
yield return Parse(sr.ReadLine()); //yield return implements IEnumerator<>
}
} // finally part of the using will be called from IEnumerator<>.Dispose()
}
Then if Step3 either uses LINQ
bool Step3(IEnumerable<RowType> data) => data.Any(item => SomeDecisionLogic(item));
or foreach
bool Step3(IEnumerable<RowType> data)
{
foreach(var item in data)
if(SomeDecisionLogic(item)))
return true;
}
for enumerating, both of them guarantee to call IEnumerator<>.Dispose() (ref1, ECMA-334 C# Spec, ch.13.9.5 ), which will call Stream.Dispose
IMO it's worth having a pipeline if the interaction is between at least 2 different systems and if the work can be executed in parallel. Otherwise it's more overhead.
In this case there are 2 systems: the file system where the CSV file is and the database. I think the pipeline should have at least 2 steps that run in parallel:
IEnumerable<Row> ReadFromCsv(string csvFilePath)
void UpdateDabase<IEnumerable<Row> rows)
In this case it should be clear that the Stream is bound to ReadFromCsv.
IEnumerable<Row> ReadFromCsv(path)
{
using(var stream = File.OpenRead(path))
{
var lines = GetLines(stream); // yield one at a time, not all at once
foreach (var line in line) yield return GetRow(line);
}
}
I guess the scope depends on the steps - which in turn depend on the way you design the pipeline based on your needs.

Windows Store App Incremental Serialization

So I finally got my listview content to serialize and write to a file so I can restore my apps state across different sessions. Now I'm wondering if there is a way I can incrementally serialize and save my data. Currently, I call this method in the SaveState method of my mainpage:
private async void writeToFile()
{
var f = await Windows.Storage.ApplicationData.Current.LocalFolder.CreateFileAsync("data.txt", CreationCollisionOption.ReplaceExisting);
using (var st = await f.OpenStreamForWriteAsync())
{
var s = new DataContractSerializer(typeof(ObservableCollection<Item>),
new Type[] { typeof(Item) });
s.WriteObject(st, convoStrings);
}
}
What I think would be more ideal is to write out data to storage as it is generated, so I don't have to serialize my entire list in the small suspend time frame, but I don't know if this is possible to incrementally serialize my collection, and if it is, how would I do it?
Note that my data doesn't change after it is generated, so I don't have to worry about anything other than appending new data to the end of my currently serialized list.
It depends on you definition when to save the data to the hard drive. Maybe you want to save the new collection state when a new collection item is added or removed? Or when the content of an item changes.
The main problem about saving everything just in time to the hard drive is, that is may be doggy slow. If you're using an async programming model, it wouldn't be a problem directly since your app won't hang since yeah, everything is async.
I think it may be a better idea to save the collection say every minute AND when the user closes the application. This will only work if you're dealing with a cerain amount of data since you only have about 3 seconds to perform all the IO stuff.
As you can see, there is no perfect solution. It really depends on your requirements and the size of the data. Without further information thats all I can tell you for sure.

Inserting data in background/async task what is the best way?

I have an very quick/lightweight mvc action, that is requested very often and I need to maintain minimal response time under heavy load.
What i need to do, is from time to time depending on conditions to insert small amount of data to sql server (log unique id for statistics, for ~1-5% of queries).
I don't need inserted data for response and if I loose some of it because application restart or smth, I'll survive.
I imagine that I could queue somehow inserting and do it in background, may be even do some kind of buffering - like wait till queue collects 100 of inserts and then make them in one pass.
I'm pretty sure, that somebody must have done/seen such implementation before, there's no need to reinvent wheel, so if somebody could point to right direction, I would be thankful.
You could trigger a background task from your controller action that will do the insertion (fire and forget):
public ActionResult Insert(SomeViewModel model)
{
Task.Factory.StartNew(() =>
{
// do the inserts
});
return View();
}
Be aware though that IIS could recycle the application at any time which would kill any running tasks.
Create a class that will store the data that needs to be pushed to the server, and a queue to hold a queue of the objects
Queue<LogData> loggingQueue = new Queue<LogData>();
public class LogData {
public DataToLog {get; set}
}
The create a timer or some other method within the app that will be triggered every now and then to post the queued data to the database
I agree with #Darin Dimitrov's approach although I would add that you could simply use this task to write to the MSMQ on the machine. From there you could write a service that reads the queue and inserts the data into the database. That way you could throttle the service that reads data or even move the queue onto a different machine.
If you wanted to take this one step further you could use something like nServiceBus and a pub/sub model to write the events into the database.

Multithread access to a LinkedList in .Net

I need a linked list to be able to add items at both sides. The list will hold data to be shown in a trend viewer. As there is a huge amount of data I need to show data before its completely read so what I want to do is to read a block of data that I know has been already written and while I read that block have two threads filling the sides of the collection:
I thought on using LinkedList but it says that is does not support this scenario. Any ideas on something at the Framework that can help me or will I have to develop my custom list from scratch?
Thanks in advance.
EDIT: The main idea of the solution is to do it without locking anything because I'm reading a piece of the list that is not going to be changed while writing at other places. I mean, the read Thread will only read one chunk (from A to B) (a section that has been already written). When it finishes and other chunks have been completely written the reader will read those chunks while the writters write new data.
See the updated diagram:
If you are on .NET4 you can use two ConcurrentQueue. One for the left side and one for the right side.
If I understand correctly you have a linked list to which you are adding data at the beginning and the end. You are never adding or removing from anywhere else. If this is the case you do not have to worry about threading since the other threads will never interfere.
Simply do something like this:
//Everything between first and last is thread safe since
//the other threads only add before and after.
LinkedListNode<object> first = myList.First;
LinkedListNode<object> current = first;
LinkedListNode<object> last = myList.Last;
bool done = false;
if (current == null) return; //empty list
do
{
//do stuff
if (current == last) done = true;
current = current.Next;
} while (!done && current != null);
After you are done with this section you can do the same with two more sections from the new myList.First to first and from last to the new myList.Last.
You could use the linked list and just use normal .NET threading constructs like the lock keyword in order to protect access to the list.
Any custom list you developed would probably do something like that anyway.
I would recommend to consider an other approach with single data structure to persist incomming data, in this way you can keep order of incomming data messages.
For instance you can use blocking queue, in this SO post you can find nice example Creating a blocking Queue in .NET?.
Why not using LinkedList class.
The documentation says its not threadsafe so you have to synchronize access to the list for yourself, but you have to do this with any datastructure accessed by multiple threads.
Performance should be quiet good here is what msdn says about inserting nodes at any position:
LinkedList provides separate nodes of type LinkedListNode, so insertion and removal are O(1) operations.
You just have to lock read and insert operations with the lock construct.
EDIT
Ok, i think i understand what you want. You want a list like datastructure which is split into chunks of items. You want to independently write and read chunks of items without locking the whole list.
I suggest to use the LinkedList holding your chunks of data items.
The chunks themself can be represented as simple List or can be LinkedList instances as well.
You have to lock the access to the global LinkedList.
Now your writer threads fill one private List with n items a time. When finished the writer locks the LinkedList and adds his private list with dataitems to the LinkedList.
The reader thread locks the LinkedList reads one chunk and releases the lock. Now it can process n dataitems without locking them.

Categories