I have a WPF app that reads an Outlook .pst file, extracts each message, and saves both it and any attachments as .pdf files. After that's all done, it does some other processing on the files.
I'm currently using a plain old foreach loop for the first part. Here is a rather simplified version of the code...
// These two are used by the WPF UI to display progress
string BusyContent;
ObservableCollection<string> Msgs = new();
// See note lower down about the quick-and-dirty logging
string _logFile = #"C:\Path\To\LogFile.log";
// _allFiles is used to keep a record of all the files we generate. Used after the loop ends
List<string> _allFiles = new();
// nCurr is used to update BusyContent, which is bound to the UI to show progress
int nCurr = 0;
// The messages would really be extracted from the .pst file. Empty list used for simplicity
List<Message> messages = new();
async Task ProcessMessages() {
using StreamWriter logFile = new(_logFile, true);
foreach (Message msg in messages) {
nCurr++;
string fileName = GenerateFileName(msg);
// We log a lot more, but only one shown for simplicity
Log(logFile, $"File: {fileName}");
_allFiles.Add(fileName);
// Let the user know where we are up to
BusyContent = $"Processing message {nCurr}";
// Msgs is bound to a WPF grid, so we need to use Dispatcher to update
Application.Current.Dispatcher.Invoke(() => Msgs.Add(fileName));
// Finally we write out the .pdf files
await ProcessMessage(msg);
}
}
async Task ProcessMessage(Message msg) {
// The methods called here are omitted as they aren't relevant to my questions
await GenerateMessagePdf(msg);
foreach(Attachment a in msg.Attachments) {
string fileName = GenerateFileName(a);
// Note that we update _allFiles here as well as in the main loop
_allFiles.Add(fileName);
await GenerateAttachmentPdf(a);
}
}
static void Log(StreamWriter logFile, string msg) =>
logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);
This all works fine, but can take quite some time on a large .pst file. I'm wondering if converting this to use Parallel.ForEach would speed things up. I can see the basic usage of this method, but have a few questions, mainly concerned with the class-level variables that are used within the loop...
The logFile variable is passed around. Will this cause issues? This isn't a major problem, as this logging was added as a quick-and-dirty debugging device, and really should be replaced with a proper logging framework, but I'd still like to know if what I'm dong would be an issue in the parallel version
nCurr is updated inside the loop. Is this safe, or is there a better way to do this?
_allFiles is also updated inside the main loop. I'm only adding entries, not reading or removing, but is this safe?
Similarly, _allFiles is updated inside the ProcessMessage method. I guess the answer to this question depends on the previous one.
Is there a problem updating BusyContent and calling Application.Current.Dispatcher.Invoke inside the loop?
Thanks for any help you can give.
At first, it is necessary to use thread safe collections:
ObservableConcurrentCollection<string> Msgs = new();
ConcurrentQueue<string> _allFiles = new();
ObservableConcurrentCollection can be installed through NuGet. ConcurrentQueue is located in using System.Collections.Concurrent;.
Special thanks to Theodor Zoulias for the pointing out that there is better option for ConcurentBag.
And then it is possible to use Parallel.ForEachor Task.
Parallel.ForEach uses Partitioner which allows to avoid creation more tasks than necessary. So it tries to run each method in parallel. So it is better to exclude async and await keywords of methods which participate in Parallel.ForEach.
async Task ProcessMessages()
{
using StreamWriter logFile = new(_logFile, true);
await Task.Run(() => {
Parallel.ForEach(messages, msg =>
{
var currentCount = Interlocked.Increment(ref nCurr);
string fileName = GenerateFileName(msg);
Log(logFile, $"File: {fileName}");
_allFiles.Enqueue(fileName);
BusyContent = $"Processing message {currentCount}";
ProcessMessage(msg);
});
});
}
int ProcessMessage(Message msg)
{
// The methods called here are omitted as they aren't relevant to my questions
var message = GenerateMessagePdf(msg);
foreach (Attachment a in msg.Attachments)
{
string fileName = GenerateFileName(a);
_allFiles.Enqueue(fileName);
GenerateAttachmentPdf(a);
}
return msg.Id;
}
private string GenerateAttachmentPdf(Attachment a) => string.Empty;
private string GenerateMessagePdf(Message message) => string.Empty;
string GenerateFileName(Attachment attachment) => string.Empty;
string GenerateFileName(Message message) => string.Empty;
void Log(StreamWriter logFile, string msg) =>
logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);
And another way is awaiting all tasks. In this case, there is no need to exclude async and await keywords.
async Task ProcessMessages()
{
using StreamWriter logFile = new(_logFile, true);
var messageTasks = messages.Select(msg =>
{
var currentCount = Interlocked.Increment(ref nCurr);
string fileName = GenerateFileName(msg);
Log(logFile, $"File: {fileName}");
_allFiles.Enqueue(fileName);
BusyContent = $"Processing message {currentCount}";
return ProcessMessage(msg);
});
var msgs = await Task.WhenAll(messageTasks);
}
Related
I'm stuck with .NET 4.0 on a project. StreamReader offers no Async or Begin/End version of ReadLine. The underlying Stream object has BeginRead/BeginEnd but these take a byte array so I'd have to implement the logic for reading line by line.
Is there something in the 4.0 Framework to achieve this?
You can use Task. You don't specify other part of your code so I don't know what you want to do. I advise you to avoid using Task.Wait because this is blocking the UI thread and waiting for the task to finish, which became not really async ! If you want to do some other action after the file is readed in the task, you can use task.ContinueWith.
Here full example, how to do it without blocking the UI thread
static void Main(string[] args)
{
string filePath = #"FILE PATH";
Task<string[]> task = Task.Run<string[]>(() => ReadFile(filePath));
bool stopWhile = false;
//if you want to not block the UI with Task.Wait() for the result
// and you want to perform some other operations with the already read file
Task continueTask = task.ContinueWith((x) => {
string[] result = x.Result; //result of readed file
foreach(var a in result)
{
Console.WriteLine(a);
}
stopWhile = true;
});
//here do other actions not related with the result of the file content
while(!stopWhile)
{
Console.WriteLine("TEST");
}
}
public static string[] ReadFile(string filePath)
{
List<String> lines = new List<String>();
string line = "";
using (StreamReader sr = new StreamReader(filePath))
{
while ((line = sr.ReadLine()) != null)
lines.Add(line);
}
Console.WriteLine("File Readed");
return lines.ToArray();
}
You can use the Task Parallel Library (TPL) to do some of the async behavior you're trying to do.
Wrap the synchronous method in a task:
var asyncTask = Task.Run(() => YourMethod(args, ...));
var asyncTask.Wait(); // You can also Task.WaitAll or other methods if you have several of these that you want to run in parallel.
var result = asyncTask.Result;
If you need to do this a lot for StreamReader, you can then go on to make this into an extension method for StreamReader if you want to simulate the regular async methods. Just take care with the error handling and other quirks of using the TPL.
I'm just starting out with C#'s new async features. I've read plenty of how-to's now on parallel downloads etc. but nothing on reading/processing a text file.
I had an old script I use to filter a log file and figured I'd have a go at upgrading it. However I'm unsure if my usage of the new async/await syntax is correct.
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
Am I thinking about it correctly, or what is the best way to implement this?
static async Task<string[]> FilterLogFile(string fileLocation)
{
string line;
List<string> matches = new List<string>();
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
CheckForMatch(line, matches);
}
}
return matches.ToArray();
}
The full script: http://share.linqpad.net/29kgbe.linq
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
But that's not what your code does. Instead, you will (asynchronously) return an array when all reading is done. If you actually want to asynchronously return the matches one by one, you would need some sort of asynchronous collection. You could use a block from TPL Dataflow for that. For example:
ISourceBlock<string> FilterLogFile(string fileLocation)
{
var block = new BufferBlock<string>();
Task.Run(async () =>
{
string line;
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
var match = GetMatch(line);
if (match != null)
block.Post(match);
}
}
block.Complete();
});
return block;
}
(You would need to add error handling, probably by faulting the returned block.)
You would then link the returned block to another block that will process the results. Or you could read them directly from the block (by using ReceiveAsync()).
But looking at the full code, I'm not sure this approach would be that useful to you. Because of the way you process the results (grouping and then ordering by count in each group), you can't do much with them until you have all of them.
I'm writing the video converting tool for one web site.
Logic is next
User upload the file
System conver it for user (of course user should not wait while converting)
I have a code:
public async void AddVideo(HotelVideosViewModel model)
{
var sourceFile = FileHandler.SaveFile(model.VideoFile);
var fullFilePath = HttpContext.Current.Server.MapPath(sourceFile);
await Task.Factory.StartNew(() => video.Convert(fullFilePath));
model.FileLocation = sourceFile;
model.IsConverted = true;
this.Add<HotelVideos>(Mapper.Map<HotelVideosViewModel, HotelVideos>(model));
}
This start encoding:
await Task.Factory.StartNew(() => video.Convert(fullFilePath));
But the code after this is never executed... Does somebody has a solution?
PS: under video.Conver(...) is have something like:
public bool Convert(string fileLocation)
{
// do staff
}
--- UPDATE ---
Just got the interesting thing. If i'm going to exchange video.Convert(fullFilePath) to Thread.Sleep(2000) everything start to work. So the issue is in Second method i think.
So i'm adding the next file code (draft now):
using Softpae.Media;
/// <summary>
/// TODO: Update summary.
/// </summary>
public class VideoConvertManager
{
private readonly Job2Convert jobConverter = new Job2Convert();
private readonly MediaServer mediaServer = new MediaServer();
public bool Convert(string fileLocation)
{
this.jobConverter.pszSrcFile = fileLocation;
this.jobConverter.pszDstFile = fileLocation + ".mp4";
this.jobConverter.pszDstFormat = "mp4";
this.jobConverter.pszAudioCodec = null;
this.jobConverter.pszVideoCodec = "h264";
if (this.mediaServer.ConvertFile(this.jobConverter))
{
FileHandler.DeleteFile(fileLocation);
return true;
};
FileHandler.DeleteFile(fileLocation);
return false;
}
}
You need to make AddVideo return Task, and await its result before completing the ASP.NET request.
Update: DO NOT call Wait or other blocking methods on an asynchronous Task! This causes deadlock.
You may find my async intro helpful.
P.S. async doesn't change the way HTTP works. You still only have one response per request. If you want your end user (e.g., browser) to not block, then you'll have to implement some kind of queue system where the browser can start video conversion jobs and be notified of their result (e.g., using SignalR). async won't just magically make this happen - it only works within the context of a single request/response pair.
Adding ConfigureAwait(false) will keep the continuation from being forced on the request thread
I am educating myself on Parallel.Invoke, and parallel processing in general, for use in current project. I need a push in the right direction to understand how you can dynamically\intelligently allocate more parallel 'threads' as required.
As an example. Say you are parsing large log files. This involves reading from file, some sort of parsing of the returned lines and finally writing to a database.
So to me this is a typical problem that can benefit from parallel processing.
As a simple first pass the following code implements this.
Parallel.Invoke(
()=> readFileLinesToBuffer(),
()=> parseFileLinesFromBuffer(),
()=> updateResultsToDatabase()
);
Behind the scenes
readFileLinesToBuffer() reads each line and stores to a buffer.
parseFileLinesFromBuffer comes along and consumes lines from buffer and then let's say it put them on another buffer so that updateResultsToDatabase() can come along and consume this buffer.
So the code shown assumes that each of the three steps uses the same amount of time\resources but lets say the parseFileLinesFromBuffer() is a long running process so instead of running just one of these methods you want to run two in parallel.
How can you have the code intelligently decide to do this based on any bottlenecks it might perceive?
Conceptually I can see how some approach of monitoring the buffer sizes might work, spawning a new 'thread' to consume the buffer at an increased rate for example...but I figure this type of issue has been considered in putting together the TPL library.
Some sample code would be great but I really just need a clue as to what concepts I should investigate next. It looks like maybe the System.Threading.Tasks.TaskScheduler holds the key?
Have you tried the Reactive Extensions?
http://msdn.microsoft.com/en-us/data/gg577609.aspx
The Rx is a new tecnology from Microsoft, the focus as stated in the official site:
The Reactive Extensions (Rx)... ...is a library to compose
asynchronous and event-based programs using observable collections and
LINQ-style query operators.
You can download it as a Nuget package
https://nuget.org/packages/Rx-Main/1.0.11226
Since I am currently learning Rx I wanted to take this example and just write code for it, the code I ended up it is not actually executed in parallel, but it is completely asynchronous, and guarantees the source lines are executed in order.
Perhaps this is not the best implementation, but like I said I am learning Rx, (thread-safe should be a good improvement)
This is a DTO that I am using to return data from the background threads
class MyItem
{
public string Line { get; set; }
public int CurrentThread { get; set; }
}
These are the basic methods doing the real work, I am simulating the time with a simple Thread.Sleep and I am returning the thread used to execute each method Thread.CurrentThread.ManagedThreadId. Note the timer of the ProcessLine it is 4 sec, it's the most time-consuming operation
private IEnumerable<MyItem> ReadLinesFromFile(string fileName)
{
var source = from e in Enumerable.Range(1, 10)
let v = e.ToString()
select v;
foreach (var item in source)
{
Thread.Sleep(1000);
yield return new MyItem { CurrentThread = Thread.CurrentThread.ManagedThreadId, Line = item };
}
}
private MyItem UpdateResultToDatabase(string processedLine)
{
Thread.Sleep(700);
return new MyItem { Line = "s" + processedLine, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
private MyItem ProcessLine(string line)
{
Thread.Sleep(4000);
return new MyItem { Line = "p" + line, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
The following method I am using it just to update the UI
private void DisplayResults(MyItem myItem, Color color, string message)
{
this.listView1.Items.Add(
new ListViewItem(
new[]
{
message,
myItem.Line ,
myItem.CurrentThread.ToString(),
Thread.CurrentThread.ManagedThreadId.ToString()
}
)
{
ForeColor = color
}
);
}
And finally this is the method that calls the Rx API
private void PlayWithRx()
{
// we init the observavble with the lines read from the file
var source = this.ReadLinesFromFile("some file").ToObservable(Scheduler.TaskPool);
source.ObserveOn(this).Subscribe(x =>
{
// for each line read, we update the UI
this.DisplayResults(x, Color.Red, "Read");
// for each line read, we subscribe the line to the ProcessLine method
var process = Observable.Start(() => this.ProcessLine(x.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(c =>
{
// for each line processed, we update the UI
this.DisplayResults(c, Color.Blue, "Processed");
// for each line processed we subscribe to the final process the UpdateResultToDatabase method
// finally, we update the UI when the line processed has been saved to the database
var persist = Observable.Start(() => this.UpdateResultToDatabase(c.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(z => this.DisplayResults(z, Color.Black, "Saved"));
});
});
}
This process runs totally in the background, this is the output generated:
in an async/await world, you'd have something like:
public async Task ProcessFileAsync(string filename)
{
var lines = await ReadLinesFromFileAsync(filename);
var parsed = await ParseLinesAsync(lines);
await UpdateDatabaseAsync(parsed);
}
then a caller could just do var tasks = filenames.Select(ProcessFileAsync).ToArray(); and whatever (WaitAll, WhenAll, etc, depending on context)
Use a couple of BlockingCollection. Here is an example
The idea is that you create a producer that puts data into the collection
while (true) {
var data = ReadData();
blockingCollection1.Add(data);
}
Then you create any number of consumers that reads from the collection
while (true) {
var data = blockingCollection1.Take();
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
}
and so on
You can also let TPL handle the number of consumers by using Parallel.Foreach
Parallel.ForEach(blockingCollection1.GetConsumingPartitioner(),
data => {
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
});
(note that you need to use GetConsumingPartitioner not GetConsumingEnumerable (see here)
I've just begun to explore the TPL and have a design question.
My Scenario:
I have a list of URLs that each refer to an image. I want each image to be downloaded in parallel. As soon as at least one image is downloaded, I want to execute a method that does something with the downloaded image. That method should NOT be parallelized -- it should be serial.
I think the following will work but I'm not sure if this is the right way to do it. Because I have separate classes for collecting the images and for doing "something" with the collected images, I end up passing around an array of Tasks which seems wrong since it exposes the inner workings of how images are retrieved. But I don't know a way around it. In reality there is more to both of these methods but that's not important for this. Just know that they really shouldn't be lumped into one large method that both retrieves and does something with the image.
//From the Director class
Task<Image>[] downloadTasks = collector.RetrieveImages(listOfURLs);
for (int i = 0; i < listOfURLs.Count; i++)
{
//Wait for any of the remaining downloads to complete
int completedIndex = Task<Image>.WaitAny(downloadTasks);
Image completedImage = downloadTasks[completedIndex].Result;
//Now do something with the image (this "something" must happen serially)
//Uses the "Formatter" class to accomplish this let's say
}
///////////////////////////////////////////////////
//From the Collector class
public Task<Image>[] RetrieveImages(List<string> urls)
{
Task<Image>[] tasks = new Task<Image>[urls.Count];
int index = 0;
foreach (string url in urls)
{
string lambdaVar = url; //Required... Bleh
tasks[index] = Task<Image>.Factory.StartNew(() =>
{
using (WebClient client = new WebClient())
{
//TODO: Replace with live image locations
string fileName = String.Format("{0}.png", i);
client.DownloadFile(lambdaVar, Path.Combine(
Application.StartupPath, fileName));
}
return Image.FromFile(Path.Combine(Application.StartupPath, fileName));
},
TaskCreationOptions.LongRunning | TaskCreationOptions.AttachedToParent);
index++;
}
return tasks;
}
Typically you use WaitAny to wait for one task when you don't care about the results of any of the others. For example if you just cared about the first image that happened to get returned.
How about this instead.
This creates two tasks, one which loads images and adds them to a blocking collection. The second task waits on the collection and processes any images added to the queue. When all the images are loaded the first task closes the queue down so the second task can shut down.
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Net;
using System.Threading.Tasks;
namespace ClassLibrary1
{
public class Class1
{
readonly string _path = Directory.GetCurrentDirectory();
public void Demo()
{
IList<string> listOfUrls = new List<string>();
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
BlockingCollection<Image> images = new BlockingCollection<Image>();
Parallel.Invoke(
() => // Task 1: load the images
{
Parallel.For(0, listOfUrls.Count, (i) =>
{
Image img = RetrieveImages(listOfUrls[i], i);
img.Tag = i;
images.Add(img); // Add each image to the queue
});
images.CompleteAdding(); // Done with images.
},
() => // Task 2: Process images serially
{
foreach (var img in images.GetConsumingEnumerable())
{
string newPath = Path.Combine(_path, String.Format("{0}_rot.png", img.Tag));
Console.WriteLine("Rotating image {0}", img.Tag);
img.RotateFlip(RotateFlipType.RotateNoneFlipXY);
img.Save(newPath);
}
});
}
public Image RetrieveImages(string url, int i)
{
using (WebClient client = new WebClient())
{
string fileName = Path.Combine(_path, String.Format("{0}.png", i));
Console.WriteLine("Downloading {0}...", url);
client.DownloadFile(url, Path.Combine(_path, fileName));
Console.WriteLine("Saving {0} as {1}.", url, fileName);
return Image.FromFile(Path.Combine(_path, fileName));
}
}
}
}
WARNING: The code doesn't have any error checking or cancelation. It's late and you need something to do right? :)
This is an example of the pipeline pattern. It assumes that getting an image is pretty slow and that the cost of locking inside the blocking collection isn't going to cause a problem because it happens relatively infrequently compared to the time spent downloading images.
Our book... You can read more about this and other patterns for parallel programming at http://parallelpatterns.codeplex.com/
Chapter 7 covers pipelines and the accompanying examples show pipelines with error handling and cancellation.
TPL already provides the ContinueWith function to execute one task when another finishes. Task chaining is one of the main patterns used in TPL for asynchronous operations.
The following method downloads a set of images and continues by renaming each of the files
static void DownloadInParallel(string[] urls)
{
var tempFolder = Path.GetTempPath();
var downloads = from url in urls
select Task.Factory.StartNew<string>(() =>{
using (var client = new WebClient())
{
var uri = new Uri(url);
string file = Path.Combine(tempFolder,uri.Segments.Last());
client.DownloadFile(uri, file);
return file;
}
},TaskCreationOptions.LongRunning|TaskCreationOptions.AttachedToParent)
.ContinueWith(t=>{
var filePath = t.Result;
File.Move(filePath, filePath + ".test");
},TaskContinuationOptions.ExecuteSynchronously);
var results = downloads.ToArray();
Task.WaitAll(results);
}
You should also check the WebClient Async Tasks from the ParallelExtensionsExtras samples. The DownloadXXXTask extension methods handle both the creation of tasks and the asynchronous downloading of files.
The following method uses the DownloadDataTask extension to get the image's data and rotate it before saving it to disk
static void DownloadInParallel2(string[] urls)
{
var tempFolder = Path.GetTempPath();
var downloads = from url in urls
let uri=new Uri(url)
let filePath=Path.Combine(tempFolder,uri.Segments.Last())
select new WebClient().DownloadDataTask(uri)
.ContinueWith(t=>{
var img = Image.FromStream(new MemoryStream(t.Result));
img.RotateFlip(RotateFlipType.RotateNoneFlipY);
img.Save(filePath);
},TaskContinuationOptions.ExecuteSynchronously);
var results = downloads.ToArray();
Task.WaitAll(results);
}
The best way to do this would probably be by implementing the Observer pattern: have your RetreiveImages function implement IObservable, put your "completed image action" into an IObserver object's OnNext method, and subscribe it to RetreiveImages.
I haven't tried this myself yet (still have to play more with the task library) but I think this is the "right" way to do it.
//download all images
private async void GetAllImages ()
{
var downloadTasks = listOfURLs.Where(url => !string.IsNullOrEmpty(url)).Select(async url =>
{
var ret = await RetrieveImage(url);
return ret;
}).ToArray();
var counts = await Task.WhenAll(downloadTasks);
}
//From the Collector class
public async Task<Image> RetrieveImage(string url)
{
var lambdaVar = url; //Required... Bleh
using (WebClient client = new WebClient())
{
//TODO: Replace with live image locations
var fileName = String.Format("{0}.png", i);
await client.DownloadFile(lambdaVar, Path.Combine(Application.StartupPath, fileName));
}
return Image.FromFile(Path.Combine(Application.StartupPath, fileName));
}