I'm just starting out with C#'s new async features. I've read plenty of how-to's now on parallel downloads etc. but nothing on reading/processing a text file.
I had an old script I use to filter a log file and figured I'd have a go at upgrading it. However I'm unsure if my usage of the new async/await syntax is correct.
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
Am I thinking about it correctly, or what is the best way to implement this?
static async Task<string[]> FilterLogFile(string fileLocation)
{
string line;
List<string> matches = new List<string>();
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
CheckForMatch(line, matches);
}
}
return matches.ToArray();
}
The full script: http://share.linqpad.net/29kgbe.linq
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
But that's not what your code does. Instead, you will (asynchronously) return an array when all reading is done. If you actually want to asynchronously return the matches one by one, you would need some sort of asynchronous collection. You could use a block from TPL Dataflow for that. For example:
ISourceBlock<string> FilterLogFile(string fileLocation)
{
var block = new BufferBlock<string>();
Task.Run(async () =>
{
string line;
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
var match = GetMatch(line);
if (match != null)
block.Post(match);
}
}
block.Complete();
});
return block;
}
(You would need to add error handling, probably by faulting the returned block.)
You would then link the returned block to another block that will process the results. Or you could read them directly from the block (by using ReceiveAsync()).
But looking at the full code, I'm not sure this approach would be that useful to you. Because of the way you process the results (grouping and then ordering by count in each group), you can't do much with them until you have all of them.
Related
I have a WPF app that reads an Outlook .pst file, extracts each message, and saves both it and any attachments as .pdf files. After that's all done, it does some other processing on the files.
I'm currently using a plain old foreach loop for the first part. Here is a rather simplified version of the code...
// These two are used by the WPF UI to display progress
string BusyContent;
ObservableCollection<string> Msgs = new();
// See note lower down about the quick-and-dirty logging
string _logFile = #"C:\Path\To\LogFile.log";
// _allFiles is used to keep a record of all the files we generate. Used after the loop ends
List<string> _allFiles = new();
// nCurr is used to update BusyContent, which is bound to the UI to show progress
int nCurr = 0;
// The messages would really be extracted from the .pst file. Empty list used for simplicity
List<Message> messages = new();
async Task ProcessMessages() {
using StreamWriter logFile = new(_logFile, true);
foreach (Message msg in messages) {
nCurr++;
string fileName = GenerateFileName(msg);
// We log a lot more, but only one shown for simplicity
Log(logFile, $"File: {fileName}");
_allFiles.Add(fileName);
// Let the user know where we are up to
BusyContent = $"Processing message {nCurr}";
// Msgs is bound to a WPF grid, so we need to use Dispatcher to update
Application.Current.Dispatcher.Invoke(() => Msgs.Add(fileName));
// Finally we write out the .pdf files
await ProcessMessage(msg);
}
}
async Task ProcessMessage(Message msg) {
// The methods called here are omitted as they aren't relevant to my questions
await GenerateMessagePdf(msg);
foreach(Attachment a in msg.Attachments) {
string fileName = GenerateFileName(a);
// Note that we update _allFiles here as well as in the main loop
_allFiles.Add(fileName);
await GenerateAttachmentPdf(a);
}
}
static void Log(StreamWriter logFile, string msg) =>
logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);
This all works fine, but can take quite some time on a large .pst file. I'm wondering if converting this to use Parallel.ForEach would speed things up. I can see the basic usage of this method, but have a few questions, mainly concerned with the class-level variables that are used within the loop...
The logFile variable is passed around. Will this cause issues? This isn't a major problem, as this logging was added as a quick-and-dirty debugging device, and really should be replaced with a proper logging framework, but I'd still like to know if what I'm dong would be an issue in the parallel version
nCurr is updated inside the loop. Is this safe, or is there a better way to do this?
_allFiles is also updated inside the main loop. I'm only adding entries, not reading or removing, but is this safe?
Similarly, _allFiles is updated inside the ProcessMessage method. I guess the answer to this question depends on the previous one.
Is there a problem updating BusyContent and calling Application.Current.Dispatcher.Invoke inside the loop?
Thanks for any help you can give.
At first, it is necessary to use thread safe collections:
ObservableConcurrentCollection<string> Msgs = new();
ConcurrentQueue<string> _allFiles = new();
ObservableConcurrentCollection can be installed through NuGet. ConcurrentQueue is located in using System.Collections.Concurrent;.
Special thanks to Theodor Zoulias for the pointing out that there is better option for ConcurentBag.
And then it is possible to use Parallel.ForEachor Task.
Parallel.ForEach uses Partitioner which allows to avoid creation more tasks than necessary. So it tries to run each method in parallel. So it is better to exclude async and await keywords of methods which participate in Parallel.ForEach.
async Task ProcessMessages()
{
using StreamWriter logFile = new(_logFile, true);
await Task.Run(() => {
Parallel.ForEach(messages, msg =>
{
var currentCount = Interlocked.Increment(ref nCurr);
string fileName = GenerateFileName(msg);
Log(logFile, $"File: {fileName}");
_allFiles.Enqueue(fileName);
BusyContent = $"Processing message {currentCount}";
ProcessMessage(msg);
});
});
}
int ProcessMessage(Message msg)
{
// The methods called here are omitted as they aren't relevant to my questions
var message = GenerateMessagePdf(msg);
foreach (Attachment a in msg.Attachments)
{
string fileName = GenerateFileName(a);
_allFiles.Enqueue(fileName);
GenerateAttachmentPdf(a);
}
return msg.Id;
}
private string GenerateAttachmentPdf(Attachment a) => string.Empty;
private string GenerateMessagePdf(Message message) => string.Empty;
string GenerateFileName(Attachment attachment) => string.Empty;
string GenerateFileName(Message message) => string.Empty;
void Log(StreamWriter logFile, string msg) =>
logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);
And another way is awaiting all tasks. In this case, there is no need to exclude async and await keywords.
async Task ProcessMessages()
{
using StreamWriter logFile = new(_logFile, true);
var messageTasks = messages.Select(msg =>
{
var currentCount = Interlocked.Increment(ref nCurr);
string fileName = GenerateFileName(msg);
Log(logFile, $"File: {fileName}");
_allFiles.Enqueue(fileName);
BusyContent = $"Processing message {currentCount}";
return ProcessMessage(msg);
});
var msgs = await Task.WhenAll(messageTasks);
}
I'm trying to see how to efficiently read in some data from a file, do some parallel work (per line) then write the new line back to the file system.
I know I can do this, one line at a time .. but I was hoping to do this a few lines at a time -or- .. if one line is 'busy' waiting for the async work to complete, then move on to the next line, etc.
Here's some sample data and logic...
Header
SomeId#1, SomeId#2, SomeId#3, Name, Has this line been processed and cleaned(true/false)
File Data
444,2,12,Leia Organa, true
121,33333,4,Han Solo, true
1,2,3,Jane Doe, false
1,4,11,John Doe, false
So the first 2 lines have been processed and I will skip those lines.
The 3rd and 4th line need to be processed. When the data has been checked, I wish to save this back to the file like
1,33333,3,Jane Doe, true
So this is the general logic...
read line
call DoWorkAsync() <-- which could take a second or 5
save this line back to the file again.
I was just hoping that I didn't have to wait for the DoWorkAsync() to complete before I can save then read the next line. I was hoping that I could start reading the next line ... and if the previous line finishes .. fine .. then save that line to the same line number in the file .. and move on again to the next line.
It's like I could have 5 or 10 lines all working at the same time .. waiting for the results to come back from the 3rd party api ... working in parallel or whatever.
Can this be done in .NET? I'm sure .NET has the functionality for this .. I just can't see the pattern to do this.
NOTE: I usually do async/await for I/O intensive operations (like hitting the filesystem or calling some 3rd party api endpoint) vs Parallel.ForEach which I use for cpu intensive work.
NOTE: Why the true/false at the end of the line? Because I can't process all the lines at once. I have api limits.
Other ideas were to have two files, one for PENDING and one for PROCESSED.
Here it is a stub of a parallel processor which uses async/await while processing lines in batches.
This approach ensures that the same order is preserved when writing.
public async Task ProcessFile()
{
const int parallelism = 5;
using (var readStream = File.OpenRead(#"c:\myinputfile"))
{
// put HERE your logic for skipping to a specific line
// e.g. readStream.Seek(lastPosition);
using (var reader = new StreamReader(readStream))
{
while (!reader.EndOfStream)
{
var tasks = new List<Task<string>>();
for (var i = 0; i < parallelism; i++)
{
var line = await reader.ReadLineAsync();
tasks.Add(DoWorkAsync(line));
if (reader.EndOfStream)
break;
}
var results = await Task.WhenAll(tasks);
using (var writeStream = File.Open(#"d:\myresultfile", FileMode.Append))
using (var writer = new StreamWriter(writeStream))
{
foreach (var line in results)
await writer.WriteLineAsync(line);
}
}
}
}
}
public async Task<string> DoWorkAsync(string line)
{
await Task.Delay(new Random().Next(1000, 5000));
// do some work and return line with last parameter = true
return line.Replace("false", "true"); // e.g.
}
It surely needs improvement, but it should give you a good base for writing your own.
I'm stuck with .NET 4.0 on a project. StreamReader offers no Async or Begin/End version of ReadLine. The underlying Stream object has BeginRead/BeginEnd but these take a byte array so I'd have to implement the logic for reading line by line.
Is there something in the 4.0 Framework to achieve this?
You can use Task. You don't specify other part of your code so I don't know what you want to do. I advise you to avoid using Task.Wait because this is blocking the UI thread and waiting for the task to finish, which became not really async ! If you want to do some other action after the file is readed in the task, you can use task.ContinueWith.
Here full example, how to do it without blocking the UI thread
static void Main(string[] args)
{
string filePath = #"FILE PATH";
Task<string[]> task = Task.Run<string[]>(() => ReadFile(filePath));
bool stopWhile = false;
//if you want to not block the UI with Task.Wait() for the result
// and you want to perform some other operations with the already read file
Task continueTask = task.ContinueWith((x) => {
string[] result = x.Result; //result of readed file
foreach(var a in result)
{
Console.WriteLine(a);
}
stopWhile = true;
});
//here do other actions not related with the result of the file content
while(!stopWhile)
{
Console.WriteLine("TEST");
}
}
public static string[] ReadFile(string filePath)
{
List<String> lines = new List<String>();
string line = "";
using (StreamReader sr = new StreamReader(filePath))
{
while ((line = sr.ReadLine()) != null)
lines.Add(line);
}
Console.WriteLine("File Readed");
return lines.ToArray();
}
You can use the Task Parallel Library (TPL) to do some of the async behavior you're trying to do.
Wrap the synchronous method in a task:
var asyncTask = Task.Run(() => YourMethod(args, ...));
var asyncTask.Wait(); // You can also Task.WaitAll or other methods if you have several of these that you want to run in parallel.
var result = asyncTask.Result;
If you need to do this a lot for StreamReader, you can then go on to make this into an extension method for StreamReader if you want to simulate the regular async methods. Just take care with the error handling and other quirks of using the TPL.
I have very big files that I have to read and process. Can this be done in parallel using Threading?
Here is a bit of code that I've done. But it doesen't seem to get a shorter execution time the reading and processing the files one after the other.
String[] files = openFileDialog1.FileNames;
Parallel.ForEach(files, f =>
{
readTraceFile(f);
});
private void readTraceFile(String file)
{
StreamReader reader = new StreamReader(file);
String line;
while ((line = reader.ReadLine()) != null)
{
String pattern = "\\s{4,}";
foreach (String trace in Regex.Split(line, pattern))
{
if (trace != String.Empty)
{
String[] details = Regex.Split(trace, "\\s+");
Instruction instruction = new Instruction(details[0],
int.Parse(details[1]),
int.Parse(details[2]));
Console.WriteLine("computing...");
instructions.Add(instruction);
}
}
}
}
It looks like your application's performance is mostly limited by IO. However, you still have a bit of CPU-bound work in your code. These two bits of work are interdependent: your CPU-bound work cannot start until the IO has done its job, and the IO does not move on to the next work item until your CPU has finished with the previous one. They're both holding each other up. Therefore, it is possible (explained at the very bottom) that you will see an improvement in throughput if you perform your IO- and CPU-bound work in parallel, like so:
void ReadAndProcessFiles(string[] filePaths)
{
// Our thread-safe collection used for the handover.
var lines = new BlockingCollection<string>();
// Build the pipeline.
var stage1 = Task.Run(() =>
{
try
{
foreach (var filePath in filePaths)
{
using (var reader = new StreamReader(filePath))
{
string line;
while ((line = reader.ReadLine()) != null)
{
// Hand over to stage 2 and continue reading.
lines.Add(line);
}
}
}
}
finally
{
lines.CompleteAdding();
}
});
var stage2 = Task.Run(() =>
{
// Process lines on a ThreadPool thread
// as soon as they become available.
foreach (var line in lines.GetConsumingEnumerable())
{
String pattern = "\\s{4,}";
foreach (String trace in Regex.Split(line, pattern))
{
if (trace != String.Empty)
{
String[] details = Regex.Split(trace, "\\s+");
Instruction instruction = new Instruction(details[0],
int.Parse(details[1]),
int.Parse(details[2]));
Console.WriteLine("computing...");
instructions.Add(instruction);
}
}
}
});
// Block until both tasks have completed.
// This makes this method prone to deadlocking.
// Consider using 'await Task.WhenAll' instead.
Task.WaitAll(stage1, stage2);
}
I highly doubt that it's your CPU work holding things up, but if it happens to be the case, you can also parallelise stage 2 like so:
var stage2 = Task.Run(() =>
{
var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };
Parallel.ForEach(lines.GetConsumingEnumerable(), parallelOptions, line =>
{
String pattern = "\\s{4,}";
foreach (String trace in Regex.Split(line, pattern))
{
if (trace != String.Empty)
{
String[] details = Regex.Split(trace, "\\s+");
Instruction instruction = new Instruction(details[0],
int.Parse(details[1]),
int.Parse(details[2]));
Console.WriteLine("computing...");
instructions.Add(instruction);
}
}
});
});
Mind you, if your CPU work component is negligible in comparison to the IO component, you won't see much speed-up. The more even the workload is, the better the pipeline is going to perform in comparison with sequential processing.
Since we're talking about performance note that I am not particularly thrilled about the number of blocking calls in the above code. If I were doing this in my own project, I would have gone the async/await route. I chose not to do so in this case because I wanted to keep things easy to understand and easy to integrate.
From the look of what you are trying to do, you are almost certainly I/O bound. Attempting parallel processing in the case will not help and may in fact slow down processing due to addition seek operations on the disk drives (unless you can have the data split over multiple spindles).
Try processing the lines in parallel instead. For example:
var q = from file in files
from line in File.ReadLines(file).AsParallel() // for smaller files File.ReadAllLines(file).AsParallel() might be faster
from trace in line.Split(new [] {" "}, StringSplitOptions.RemoveEmptyEntries) // split by 4 spaces and no need for trace != "" check
let details = trace.Split(null as char[], StringSplitOptions.RemoveEmptyEntries) // like Regex.Split(trace, "\\s+") but removes empty strings too
select new Instruction(details[0], int.Parse(details[1]), int.Parse(details[2]));
List<Instruction> instructions = q.ToList(); // all of the file reads and work is done here with .ToList
Random access to a non-SSD hard drive (when you try to read/write different files at the same time or a fragmented file) is usually much slower than sequential access (for example reading single defragmented file), so I expect processing single file in parallel to be faster with defragmented files.
Also, sharing resources across the threads (for example Console.Write or adding to a thread safe blocking collection) can slow down or block/deadlock the execution, because some of the threads will have to wait for the other threads to finish accessing that resource.
var entries = new ConcurrentBag<object>();
var files = Directory.GetFiles(path, "*.txt", SearchOption.AllDirectories);
int fileCounter = 0;
Parallel.ForEach(files.ToList(), file =>
{
var lines = File.ReadAllLines(file, Encoding.Default);
entries.Add(new { lineCount = lines.Length });
Interlocked.Increment(ref fileCounter);
});
I am educating myself on Parallel.Invoke, and parallel processing in general, for use in current project. I need a push in the right direction to understand how you can dynamically\intelligently allocate more parallel 'threads' as required.
As an example. Say you are parsing large log files. This involves reading from file, some sort of parsing of the returned lines and finally writing to a database.
So to me this is a typical problem that can benefit from parallel processing.
As a simple first pass the following code implements this.
Parallel.Invoke(
()=> readFileLinesToBuffer(),
()=> parseFileLinesFromBuffer(),
()=> updateResultsToDatabase()
);
Behind the scenes
readFileLinesToBuffer() reads each line and stores to a buffer.
parseFileLinesFromBuffer comes along and consumes lines from buffer and then let's say it put them on another buffer so that updateResultsToDatabase() can come along and consume this buffer.
So the code shown assumes that each of the three steps uses the same amount of time\resources but lets say the parseFileLinesFromBuffer() is a long running process so instead of running just one of these methods you want to run two in parallel.
How can you have the code intelligently decide to do this based on any bottlenecks it might perceive?
Conceptually I can see how some approach of monitoring the buffer sizes might work, spawning a new 'thread' to consume the buffer at an increased rate for example...but I figure this type of issue has been considered in putting together the TPL library.
Some sample code would be great but I really just need a clue as to what concepts I should investigate next. It looks like maybe the System.Threading.Tasks.TaskScheduler holds the key?
Have you tried the Reactive Extensions?
http://msdn.microsoft.com/en-us/data/gg577609.aspx
The Rx is a new tecnology from Microsoft, the focus as stated in the official site:
The Reactive Extensions (Rx)... ...is a library to compose
asynchronous and event-based programs using observable collections and
LINQ-style query operators.
You can download it as a Nuget package
https://nuget.org/packages/Rx-Main/1.0.11226
Since I am currently learning Rx I wanted to take this example and just write code for it, the code I ended up it is not actually executed in parallel, but it is completely asynchronous, and guarantees the source lines are executed in order.
Perhaps this is not the best implementation, but like I said I am learning Rx, (thread-safe should be a good improvement)
This is a DTO that I am using to return data from the background threads
class MyItem
{
public string Line { get; set; }
public int CurrentThread { get; set; }
}
These are the basic methods doing the real work, I am simulating the time with a simple Thread.Sleep and I am returning the thread used to execute each method Thread.CurrentThread.ManagedThreadId. Note the timer of the ProcessLine it is 4 sec, it's the most time-consuming operation
private IEnumerable<MyItem> ReadLinesFromFile(string fileName)
{
var source = from e in Enumerable.Range(1, 10)
let v = e.ToString()
select v;
foreach (var item in source)
{
Thread.Sleep(1000);
yield return new MyItem { CurrentThread = Thread.CurrentThread.ManagedThreadId, Line = item };
}
}
private MyItem UpdateResultToDatabase(string processedLine)
{
Thread.Sleep(700);
return new MyItem { Line = "s" + processedLine, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
private MyItem ProcessLine(string line)
{
Thread.Sleep(4000);
return new MyItem { Line = "p" + line, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
The following method I am using it just to update the UI
private void DisplayResults(MyItem myItem, Color color, string message)
{
this.listView1.Items.Add(
new ListViewItem(
new[]
{
message,
myItem.Line ,
myItem.CurrentThread.ToString(),
Thread.CurrentThread.ManagedThreadId.ToString()
}
)
{
ForeColor = color
}
);
}
And finally this is the method that calls the Rx API
private void PlayWithRx()
{
// we init the observavble with the lines read from the file
var source = this.ReadLinesFromFile("some file").ToObservable(Scheduler.TaskPool);
source.ObserveOn(this).Subscribe(x =>
{
// for each line read, we update the UI
this.DisplayResults(x, Color.Red, "Read");
// for each line read, we subscribe the line to the ProcessLine method
var process = Observable.Start(() => this.ProcessLine(x.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(c =>
{
// for each line processed, we update the UI
this.DisplayResults(c, Color.Blue, "Processed");
// for each line processed we subscribe to the final process the UpdateResultToDatabase method
// finally, we update the UI when the line processed has been saved to the database
var persist = Observable.Start(() => this.UpdateResultToDatabase(c.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(z => this.DisplayResults(z, Color.Black, "Saved"));
});
});
}
This process runs totally in the background, this is the output generated:
in an async/await world, you'd have something like:
public async Task ProcessFileAsync(string filename)
{
var lines = await ReadLinesFromFileAsync(filename);
var parsed = await ParseLinesAsync(lines);
await UpdateDatabaseAsync(parsed);
}
then a caller could just do var tasks = filenames.Select(ProcessFileAsync).ToArray(); and whatever (WaitAll, WhenAll, etc, depending on context)
Use a couple of BlockingCollection. Here is an example
The idea is that you create a producer that puts data into the collection
while (true) {
var data = ReadData();
blockingCollection1.Add(data);
}
Then you create any number of consumers that reads from the collection
while (true) {
var data = blockingCollection1.Take();
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
}
and so on
You can also let TPL handle the number of consumers by using Parallel.Foreach
Parallel.ForEach(blockingCollection1.GetConsumingPartitioner(),
data => {
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
});
(note that you need to use GetConsumingPartitioner not GetConsumingEnumerable (see here)