Is there a .NET 4.0 replacement for StreamReader.ReadLineAsync? - c#

I'm stuck with .NET 4.0 on a project. StreamReader offers no Async or Begin/End version of ReadLine. The underlying Stream object has BeginRead/BeginEnd but these take a byte array so I'd have to implement the logic for reading line by line.
Is there something in the 4.0 Framework to achieve this?

You can use Task. You don't specify other part of your code so I don't know what you want to do. I advise you to avoid using Task.Wait because this is blocking the UI thread and waiting for the task to finish, which became not really async ! If you want to do some other action after the file is readed in the task, you can use task.ContinueWith.
Here full example, how to do it without blocking the UI thread
static void Main(string[] args)
{
string filePath = #"FILE PATH";
Task<string[]> task = Task.Run<string[]>(() => ReadFile(filePath));
bool stopWhile = false;
//if you want to not block the UI with Task.Wait() for the result
// and you want to perform some other operations with the already read file
Task continueTask = task.ContinueWith((x) => {
string[] result = x.Result; //result of readed file
foreach(var a in result)
{
Console.WriteLine(a);
}
stopWhile = true;
});
//here do other actions not related with the result of the file content
while(!stopWhile)
{
Console.WriteLine("TEST");
}
}
public static string[] ReadFile(string filePath)
{
List<String> lines = new List<String>();
string line = "";
using (StreamReader sr = new StreamReader(filePath))
{
while ((line = sr.ReadLine()) != null)
lines.Add(line);
}
Console.WriteLine("File Readed");
return lines.ToArray();
}

You can use the Task Parallel Library (TPL) to do some of the async behavior you're trying to do.
Wrap the synchronous method in a task:
var asyncTask = Task.Run(() => YourMethod(args, ...));
var asyncTask.Wait(); // You can also Task.WaitAll or other methods if you have several of these that you want to run in parallel.
var result = asyncTask.Result;
If you need to do this a lot for StreamReader, you can then go on to make this into an extension method for StreamReader if you want to simulate the regular async methods. Just take care with the error handling and other quirks of using the TPL.

Related

How do I update shared state with Parallel.ForEach

I have a WPF app that reads an Outlook .pst file, extracts each message, and saves both it and any attachments as .pdf files. After that's all done, it does some other processing on the files.
I'm currently using a plain old foreach loop for the first part. Here is a rather simplified version of the code...
// These two are used by the WPF UI to display progress
string BusyContent;
ObservableCollection<string> Msgs = new();
// See note lower down about the quick-and-dirty logging
string _logFile = #"C:\Path\To\LogFile.log";
// _allFiles is used to keep a record of all the files we generate. Used after the loop ends
List<string> _allFiles = new();
// nCurr is used to update BusyContent, which is bound to the UI to show progress
int nCurr = 0;
// The messages would really be extracted from the .pst file. Empty list used for simplicity
List<Message> messages = new();
async Task ProcessMessages() {
using StreamWriter logFile = new(_logFile, true);
foreach (Message msg in messages) {
nCurr++;
string fileName = GenerateFileName(msg);
// We log a lot more, but only one shown for simplicity
Log(logFile, $"File: {fileName}");
_allFiles.Add(fileName);
// Let the user know where we are up to
BusyContent = $"Processing message {nCurr}";
// Msgs is bound to a WPF grid, so we need to use Dispatcher to update
Application.Current.Dispatcher.Invoke(() => Msgs.Add(fileName));
// Finally we write out the .pdf files
await ProcessMessage(msg);
}
}
async Task ProcessMessage(Message msg) {
// The methods called here are omitted as they aren't relevant to my questions
await GenerateMessagePdf(msg);
foreach(Attachment a in msg.Attachments) {
string fileName = GenerateFileName(a);
// Note that we update _allFiles here as well as in the main loop
_allFiles.Add(fileName);
await GenerateAttachmentPdf(a);
}
}
static void Log(StreamWriter logFile, string msg) =>
logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);
This all works fine, but can take quite some time on a large .pst file. I'm wondering if converting this to use Parallel.ForEach would speed things up. I can see the basic usage of this method, but have a few questions, mainly concerned with the class-level variables that are used within the loop...
The logFile variable is passed around. Will this cause issues? This isn't a major problem, as this logging was added as a quick-and-dirty debugging device, and really should be replaced with a proper logging framework, but I'd still like to know if what I'm dong would be an issue in the parallel version
nCurr is updated inside the loop. Is this safe, or is there a better way to do this?
_allFiles is also updated inside the main loop. I'm only adding entries, not reading or removing, but is this safe?
Similarly, _allFiles is updated inside the ProcessMessage method. I guess the answer to this question depends on the previous one.
Is there a problem updating BusyContent and calling Application.Current.Dispatcher.Invoke inside the loop?
Thanks for any help you can give.
At first, it is necessary to use thread safe collections:
ObservableConcurrentCollection<string> Msgs = new();
ConcurrentQueue<string> _allFiles = new();
ObservableConcurrentCollection can be installed through NuGet. ConcurrentQueue is located in using System.Collections.Concurrent;.
Special thanks to Theodor Zoulias for the pointing out that there is better option for ConcurentBag.
And then it is possible to use Parallel.ForEachor Task.
Parallel.ForEach uses Partitioner which allows to avoid creation more tasks than necessary. So it tries to run each method in parallel. So it is better to exclude async and await keywords of methods which participate in Parallel.ForEach.
async Task ProcessMessages()
{
using StreamWriter logFile = new(_logFile, true);
await Task.Run(() => {
Parallel.ForEach(messages, msg =>
{
var currentCount = Interlocked.Increment(ref nCurr);
string fileName = GenerateFileName(msg);
Log(logFile, $"File: {fileName}");
_allFiles.Enqueue(fileName);
BusyContent = $"Processing message {currentCount}";
ProcessMessage(msg);
});
});
}
int ProcessMessage(Message msg)
{
// The methods called here are omitted as they aren't relevant to my questions
var message = GenerateMessagePdf(msg);
foreach (Attachment a in msg.Attachments)
{
string fileName = GenerateFileName(a);
_allFiles.Enqueue(fileName);
GenerateAttachmentPdf(a);
}
return msg.Id;
}
private string GenerateAttachmentPdf(Attachment a) => string.Empty;
private string GenerateMessagePdf(Message message) => string.Empty;
string GenerateFileName(Attachment attachment) => string.Empty;
string GenerateFileName(Message message) => string.Empty;
void Log(StreamWriter logFile, string msg) =>
logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);
And another way is awaiting all tasks. In this case, there is no need to exclude async and await keywords.
async Task ProcessMessages()
{
using StreamWriter logFile = new(_logFile, true);
var messageTasks = messages.Select(msg =>
{
var currentCount = Interlocked.Increment(ref nCurr);
string fileName = GenerateFileName(msg);
Log(logFile, $"File: {fileName}");
_allFiles.Enqueue(fileName);
BusyContent = $"Processing message {currentCount}";
return ProcessMessage(msg);
});
var msgs = await Task.WhenAll(messageTasks);
}

How to optimize reading a list of files and storing them in a database?

I was recently asked a question in an interview and it really got me thinking.
I am trying to understand and learn more about multithreading, parallelism and concurrency, and performance.
The scenario is that you have a list of file paths. Files are saved on your HDD or on blob storage.
You have read the files and store them in a database. How would you do it in the most optimal manner?
The following are some of the ways that I could think of:
The simplest way is to loop through the list and perform this task sequentially.
Foreach(var filePath in filePaths)
{
ProcessFile(filePath);
}
public void ProcessFile(string filePath)
{
var file = readFile(filePath);
storeInDb(file);
}
2nd way I could think of is creating multiple threads perhaps:
Foreach(var filePath in filePaths)
{
Thread t = new Thread(ProcessFIle(filePath));
t.Start();
}
(not sure if the above code is correct.)
3rd way is using async await
List<Tasks> listOfTasks;
Foreach(var filePath in filePaths)
{
var task = ProcessFile(filePath);
listOfTasks.Add(task);
}
Task.WhenAll(listOftasks);
public async void ProcessFile(string filePath)
{
var file = readFile(filePath);
storeInDb(file);
}
4th way is Parallel.For:
Parallel.For(0,filePaths.Count , new ParallelOptions { MaxDegreeOfParallelism = 10 }, i =>
{
ProcessFile(filePaths[i]);
});
What are the differences between them. Which one would be better suited for the job and is there anything better?
You could also use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
IObservable<string> query =
from filePath in filePaths.ToObservable()
from file in Observable.Start(() => ReadFile(filePath))
from db in Observable.Start(() => StoreInDb(file))
select filePath;
IDisposable subscription =
query
.Subscribe(
filePath => Console.WriteLine($"{filePath} Processed."),
() => Console.WriteLine("Done."));
I wrote a simple extension method to help start async tasks, limit the amount of concurrency, and wait for them all to complete;
public static async Task WhenAll(this IEnumerable<Task> tasks, int batchSize)
{
var started = new List<Task>();
foreach(var t in tasks)
{
started.Add(t);
if (started.Count >= batchSize)
{
var ended = await Task.WhenAny(started);
started.Remove(ended);
}
}
await Task.WhenAll(started);
}
Then you'd want a method to stream the file contents directly into the database. For example;
async Task Process(string filename){
using var stream = File.OpenRead(filename)
// TODO connect to the database
var sqlCommand = ...;
sqlCommand.CommandText = "update [table] set [column] = #stream";
sqlCommand.Parameters.Add(new SqlParameter("#stream", SqlDbType.VarBinary)
{
Value = stream
});
await sqlCommand.ExecuteNonQueryAsync();
}
IEnumerable<string> files = ...;
await files.Select(f => Process(f)).WhenAll(20);
Is this the best approach? Probably not. Since it's too easy to misuse this extension. Accidently starting tasks multiple times, or starting them all at once.

Read a file from background task

I'm trying to call a method from inside the Run method of a background task which among other it desirializes a xml file. The problem is that I end up in a deadlock. This is the methos that reads the file
protected async Task<Anniversaries> readFile(string fileName)
{
IStorageFile file;
Anniversaries tempAnniversaries;
file = await ApplicationData.Current.LocalFolder.GetFileAsync(fileName);
using (IRandomAccessStream stream =
await file.OpenAsync(FileAccessMode.Read))
using (Stream inputStream = stream.AsStreamForRead())
{
DataContractSerializer serializer = new DataContractSerializer(typeof(Anniversaries));
tempAnniversaries = serializer.ReadObject(inputStream) as Anniversaries;
}
return tempAnniversaries;
}
and here is the Run method
public sealed class TileUpdater : IBackgroundTask
{
GeneralAnniversariesManager generalManager = new GeneralAnniversariesManager();
Anniversaries tempAnn = new Anniversaries();
string test = "skata";
public async void Run(IBackgroundTaskInstance taskInstance)
{
DateTime curentTime = new DateTime();
var defferal = taskInstance.GetDeferral();
await generalManager.InitializeAnniversariesAsync().AsAsyncAction();
curentTime = DateTime.Now;
var updater = TileUpdateManager.CreateTileUpdaterForApplication();
updater.EnableNotificationQueue(true);
updater.Clear();
for (int i = 1; i < 6; i++)
{
var tile = TileUpdateManager.GetTemplateContent(TileTemplateType.TileWide310x150BlockAndText01);
tile.GetElementsByTagName("text")[0].InnerText = test + i;
tile.GetElementsByTagName("text")[1].InnerText = curentTime.ToString();
updater.Update(new TileNotification(tile));
}
defferal.Complete();
}
I'm assuming that by deadlock you mean that the deserialization method finishes too late and your original program tries to read the data before it's finished loading.
It depends on how complicated/reliable you want your solution to be and how you're intending to use the program. The simplest way relies on the fact that the directory creation function is always 100% atomic in Windows/Unix and OSX. For example at the top of your readFile function have something like this.
Directory.CreateDirectory("lock");
Before you start parsing the results of your async action in TileUpdater, have a loop that looks like this.
while (Directory.Exists("lock"))
{
Thread.Sleep(50);
}
This assumes that everything is happening in the same directory, generally you'll want to replace "lock" with a path that leads to the user's temp directory for their version of Windows/Linux/OSX.
If you want to implement something more complicated where you're reading from a series of files while at the same time reading the deserialized output into your class, you'll want to use something like a System.Collections.Concurrent.ConcurrentQueue that allows your threads to act completely independently without blocking each other.
Incidentally I'm assuming that you know that the class Process and the function .waitfor() exists. You can spin off a thread and then at a later point, halt the main thread until the spawned thread finishes.
Actually I think I've found where the problem is. At the namespaces, I've tried a try and catch and I got an exception about using different namespaces at the datacontract serealizer. I have updated the code like this
file = await ApplicationData.Current.LocalFolder.GetFileAsync("EortologioMovingEntries.xml");
try
{
using (IRandomAccessStream stream =
await file.OpenAsync(FileAccessMode.Read))
using (Stream inputStream = stream.AsStreamForRead())
{
DataContractSerializer serializer = new DataContractSerializer(typeof(Anniversaries), "Anniversaries", "http://schemas.datacontract.org/2004/07/Eortologio.Model");
tempAnniversaries = serializer.ReadObject(inputStream) as Anniversaries;
}
}
catch (Exception ex)
{
error = ex.ToString();
tempAnniversaries.Entries.Add(new AnniversaryEntry("Ena", DateTime.Now, "skata", PriorityEnum.High));
}
I don't get any exceptions now but the tempAnniversaries returns null. Any ideas?

Freeze when trying to read a file using PCLStorage

I am using PCLStorage library in my project so that I can access filesystem from my PCL lib. I am trying to read a file as follows:
static async Task<T> LoadAsync<T> (string fileName) where T : class
{
var rootFolder = FileSystem.Current.LocalStorage; // debugger stops here
var m5cacheFolder = await rootFolder.GetFolderAsync (CacheFolderName); // but instead of going to this line, jumps to end of this method
var userInfoFile = await m5cacheFolder.GetFileAsync (fileName);
var userInfoFileContent = await userInfoFile.ReadAllTextAsync ();
var stringReader = new StringReader (userInfoFileContent);
var serializer = new XmlSerializer (typeof(T));
return (T)serializer.Deserialize (stringReader);
}
Since PCLStorage is asynchronous and I want to use it in a syncrhonous code I am calling it like this:
var task = LoadAsync<User> (UserInfoFileName);
user = task.Result;
The problem is that whole application freezes when I try to execute this code. As described in comments above, the code in LoadAsync method is not executed. I am using newest Xamarin 3. My PCL library is referenced in Xamarin iOS project. Both projects have references to PCLStorage through nugget.
On the other hand following code is executed correctly:
static async void PersistAsync (object obj, string fileName)
{
var rootFolder = FileSystem.Current.LocalStorage;
var m5cacheFolder = await rootFolder.CreateFolderAsync (CacheFolderName, CreationCollisionOption.OpenIfExists);
var userInfoFile = await m5cacheFolder.CreateFileAsync (fileName, CreationCollisionOption.ReplaceExisting);
var serializer = new XmlSerializer (obj.GetType ());
var stringWriter = new StringWriter ();
serializer.Serialize (stringWriter, obj);
await userInfoFile.WriteAllTextAsync (stringWriter.ToString ());
}
This is always a potential recipe for disaster:
var task = LoadAsync<User>(UserInfoFileName);
user = task.Result;
If this is happening in the UI thread, then you're basically blocking the UI thread until the task has completed - but the task will need to execute its continuations on the UI thread too. You've got a deadlock.
Basically you should be trying to structure your app to embrace asynchrony more. You could use Task.ConfigureAwait() to schedule the continuations in LoadAsync to execute on thread-pool threads instead, but you're still going to be blocking the UI until it's completed, which is against the spirit of asynchrony.
Asynchrony is somewhat viral - if you try to make just one part of your app asynchronous, you're going to have a hard time. You need to be asynchronous all the way up, at least in terms of UI operations.
(If you block waiting for the task returned by PersistAsync you'll have a similar issue, by the way.)

Using C# 5.0 async to read a file

I'm just starting out with C#'s new async features. I've read plenty of how-to's now on parallel downloads etc. but nothing on reading/processing a text file.
I had an old script I use to filter a log file and figured I'd have a go at upgrading it. However I'm unsure if my usage of the new async/await syntax is correct.
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
Am I thinking about it correctly, or what is the best way to implement this?
static async Task<string[]> FilterLogFile(string fileLocation)
{
string line;
List<string> matches = new List<string>();
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
CheckForMatch(line, matches);
}
}
return matches.ToArray();
}
The full script: http://share.linqpad.net/29kgbe.linq
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
But that's not what your code does. Instead, you will (asynchronously) return an array when all reading is done. If you actually want to asynchronously return the matches one by one, you would need some sort of asynchronous collection. You could use a block from TPL Dataflow for that. For example:
ISourceBlock<string> FilterLogFile(string fileLocation)
{
var block = new BufferBlock<string>();
Task.Run(async () =>
{
string line;
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
var match = GetMatch(line);
if (match != null)
block.Post(match);
}
}
block.Complete();
});
return block;
}
(You would need to add error handling, probably by faulting the returned block.)
You would then link the returned block to another block that will process the results. Or you could read them directly from the block (by using ReceiveAsync()).
But looking at the full code, I'm not sure this approach would be that useful to you. Because of the way you process the results (grouping and then ordering by count in each group), you can't do much with them until you have all of them.

Categories