Read text file with IAsyncEnumerable - c#

I came across IAsyncEnumerable while I am testing C# 8.0 features. I found remarkable examples from Anthony Chu (https://anthonychu.ca/post/async-streams-dotnet-core-3-iasyncenumerable/). It is async stream and replacement for Task<IEnumerable<T>>
// Data Access Layer.
public async IAsyncEnumerable<Product> GetAllProducts()
{
Container container = cosmosClient.GetContainer(DatabaseId, ContainerId);
var iterator = container.GetItemQueryIterator<Product>("SELECT * FROM c");
while (iterator.HasMoreResults)
{
foreach (var product in await iterator.ReadNextAsync())
{
yield return product;
}
}
}
// Usage
await foreach (var product in productsRepository.GetAllProducts())
{
Console.WriteLine(product);
}
I am wondering if this can be applied to read text files like below usage that read file line by line.
foreach (var line in File.ReadLines("Filename"))
{
// ...process line.
}
I really want to know how to apply async with IAsyncEnumerable<string>() to the above foreach loop so that it streams while reading.
How do I implement iterator so that I can use yield return to read line by line?

Exactly the same, however there is no async workload, so let's pretend
public async IAsyncEnumerable<string> SomeSortOfAwesomeness()
{
foreach (var line in File.ReadLines("Filename.txt"))
{
// simulates an async workload,
// otherwise why would be using IAsyncEnumerable?
// -- added due to popular demand
await Task.Delay(100);
yield return line;
}
}
or
This is just an wrapped APM workload, see Stephen Clearys comments for clarification
public static async IAsyncEnumerable<string> SomeSortOfAwesomeness()
{
using StreamReader reader = File.OpenText("Filename.txt");
while(!reader.EndOfStream)
yield return await reader.ReadLineAsync();
}
Usage
await foreach(var line in SomeSortOfAwesomeness())
{
Console.WriteLine(line);
}
Update from Stephen Cleary
File.OpenText sadly only allows synchronous I/O; the async APIs are
implemented poorly in that scenario. To open a true asynchronous file,
you'd need to use a FileStream constructor passing isAsync: true or
FileOptions.Asynchronous.
ReadLineAsync basically results in this code, as you can see, it's only the Stream APM Begin and End methods wrapped
private Task<Int32> BeginEndReadAsync(Byte[] buffer, Int32 offset, Int32 count)
{
return TaskFactory<Int32>.FromAsyncTrim(
this, new ReadWriteParameters { Buffer = buffer, Offset = offset, Count = count },
(stream, args, callback, state) => stream.BeginRead(args.Buffer, args.Offset, args.Count, callback, state), // cached by compiler
(stream, asyncResult) => stream.EndRead(asyncResult)); // cached by compiler
}

I did some performance tests and it seems that a large bufferSize is helpful, together with the FileOptions.SequentialScan option.
public static async IAsyncEnumerable<string> ReadLinesAsync(string filePath)
{
using var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read,
FileShare.Read, 32768, FileOptions.Asynchronous | FileOptions.SequentialScan);
using var reader = new StreamReader(stream);
while (true)
{
var line = await reader.ReadLineAsync().ConfigureAwait(false);
if (line == null) break;
yield return line;
}
}
The enumeration in not trully asynchronous though. According to my experiments (.NET Core 3.1) the xxxAsync methods of the StreamReader class are blocking the current thread for a duration longer than the awaiting period of the Task they return. For example reading a 6 MB file with the method ReadToEndAsync in my PC blocks the current thread for 120 msec before returning the task, and then the task is completed in just 20 msec. So I am not sure that there is much value at using these methods. Faking asynchrony is much easier by using the synchronous APIs together with some Linq.Async:
IAsyncEnumerable<string> lines = File.ReadLines("SomeFile.txt").ToAsyncEnumerable();
.NET 6 update: The implementation of the asynchronous filesystem APIs has been improved on .NET 6. For experimental data with the File.ReadAllLinesAsync method, see here.

Related

Connection problems while using Parallel.ForEach Loop

I have a foreach loop which is responsible for executing a certain set of statements. A part of that is to save an image from a URL to Azure storage. I have to do this for a large set of data. To achieve the same I have converted the foreach loop into a Parallel.ForEach loop.
Parallel.ForEach(listSkills, item =>
{
// some business logic
var b = getImageFromUrl(item.Url);
Stream ms = new MemoryStream(b);
saveImage(ms);
// more business logic
});
private static byte[] getByteArray(Stream input)
{
using (MemoryStream ms = new MemoryStream())
{
input.CopyTo(ms);
return ms.ToArray();
}
}
public static byte[] getImageFromUrl(string url)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
byte[] b = null;
request = (HttpWebRequest)WebRequest.Create(url);
response = (HttpWebResponse)request.GetResponse();
if (request.HaveResponse)
{
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
b = getByteArray(receiveStream);
}
}
return b;
}
public static void saveImage(Stream fileContent)
{
fileContent.Seek(0, SeekOrigin.Begin);
byte[] bytes = getByteArray(fileContent);
var blob = null;
blob.UploadFromByteArrayAsync(bytes, 0, bytes.Length).Wait();
}
Although there are instances when I am getting the below error and the image is not getting saved.
An existing connection was forcibly closed by the remote host.
Also sharing the StackTrace :
at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
at System.Net.Security.SslStream.<FillBufferAsync>d__183`1.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Net.Security.SslStream.<ReadAsyncInternal>d__181`1.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.Stream.Read(Span`1 buffer)
at System.Net.Http.HttpConnection.Read(Span`1 destination)
at System.Net.Http.HttpConnection.ContentLengthReadStream.Read(Span`1 buffer)
at System.Net.Http.HttpBaseStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.Stream.CopyTo(Stream destination, Int32 bufferSize)
at Utilities.getByteArray(Stream input) in D:\repos\SampleProj\Sample\Helpers\CH.cs:line 238
at Utilities.getImageFromUrl(String url) in D:\repos\SampleProj\Sample\Helpers\CH.cs:line 178
I am guessing this maybe because I am not using locks? I am unsure whether to use locks within a Parallel.ForEach loop.
According to another question on stackoverflow, here are the potential causes for An existing connection was forcibly closed by the remote host. :
You are sending malformed data to the application (which could include sending an HTTPS request to an HTTP server)
The network link between the client and server is going down for some reason
You have triggered a bug in the third-party application that caused it to crash
The third-party application has exhausted system resources
Since only some of your requests are affected, I think we can exclude the first one. This can be, of course, a network issue, and in that case, this would happend from time to time depending on the quality of the netwok between you and the server.
Unless you find indication of an AzureStorage's bug from other users, there is a high probability your call are consuming too much of the remote server's resources (connections/data) at the same time. Servers and proxy have limitation on how much connections they can handle at the same time (especially from the same client machine).
Depending on the size of your listSkills list, your code may launch a big number of request in parallel (as much as your thread pool can), possibly flooding the server.
You could at least limit the number of parallel task launch using MaxDegreeOfParallelism like this :
Parallel.ForEach(listSkills,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
item =>
{
// some business logic
var b = getImageFromUrl(item.Url);
Stream ms = new MemoryStream(b);
saveImage(ms);
// more business logic
});
You can control parallelism like:
listSkills.AsParallel()
.Select(item => {/*Your Logic*/ return item})
.WithDegreeOfParallelism(10)
.Select(item =>
{
getImageFromUrl(item.url);
saveImage(your_stream);
return item;
});
But Parallel.ForEach is not good for IO because it's designed for CPU-intensive tasks, if you use it for IO-bound operations specially making web requests you may waste thread pool thread blocked while waiting for response.
You use asynchronous web request methods like HttpWebRequest.GetResponseAsync, in the other side you can also use thread synchronization constructs for that, as ab example using Semaphore, the Semaphore is like queue, it allows X threads to pass, and the rest should wait until one of busy threads will finish it's work.
First make your getStream method as async like (this is not good solution but can be better):
public static async Task getImageFromUrl(SemaphoreSlim semaphore, string url)
{
try
{
HttpWebRequest request = null;
byte[] b = null;
request = (HttpWebRequest)WebRequest.Create(url);
using (var response = await request.GetResponseAsync().ConfigureAwait(false))
{
// your logic
}
}
catch (Exception ex)
{
// handle exp
}
finally
{
// release
semaphore.Release();
}
}
and then:
using (var semaphore = new SemaphoreSlim(10))
{
foreach (var url in urls)
{
// await here until there is a room for this task
await semaphore.WaitAsync();
tasks.Add(getImageFromUrl(semaphore, url));
}
// await for the rest of tasks to complete
await Task.WhenAll(tasks);
}
You should not use the Parallel or Task.Run instead you can have an async handler method like:
public async Task handleResponse(Task<HttpResponseMessage> response)
{
HttpResponseMessage response = await response;
//Process your data
}
and then use Task.WhenAll like:
Task[] requests = myList.Select(l => getImageFromUrl(l.Id))
.Select(r => handleResponse(r))
.ToArray();
await Task.WhenAll(requests);
at the end there are several solution for your scenario but forget Parallel.Foreach instead use optimized solution.
There are several problems with this code:
Parallel.ForEach is meant for data parallelism, not IO. The code is freezing all CPU cores waiting for IO to complete
HttpWebRequest is a wrapper over HttpClient in .NET Core. Using HttpWebRequest is inefficient and far more complex than needed.
HttpClient can post retrieve or post stream contents which means there's no reason to load stream contents in memory. HttpClient is thread-safe and meant to be reused too.
There are several ways to execute many IO operations concurrently in .NET Core.
.NET 6
In the current Long-Term-Support version of .NET, .NET 6, this can be done using Parallel.ForEachAsync. Scott Hanselman shows how easy it is to use it for API calls
You can retrieve the data directly with GetBytesAsync :
record CopyRequest(Uri sourceUri,Uri blobUri);
...
var requests=new List<CopyRequest>();
//Load some source/target URLs
var client=new HttpClient();
await Parallel.ForEachAsync(requests,async req=>{
var bytes=await client.GetBytesAsync(req.sourceUri);
var blob=new CloudAppendBlob(req.targetUri);
await blob.UploadFromByteArrayAsync(bytes, 0, bytes.Length);
});
A better option would be to retrieve the data as a stream and send it directly to the blob :
await Parallel.ForEachAsync(requests,async req=>{
var response=await client.GetAsync(req.sourceUri,
HttpCompletionOption.ResponseHeadersRead);
using var sourceStream=await response.Content.ReadAsStreamAsync();
var blob=new CloudAppendBlob(req.targetUri);
await blob.UploadFromStreamAsync(sourceStream);
});
HttpCompletionOption.ResponseHeadersRead causes GetAsync to return as soon as the response headers are received, without buffering any of the response data.
.NET 3.1
In older .NET Core versions (which are reaching End-of-Life in a few months) you can use eg an ActionBlock with a Degree-Of-Parallelism greater than 1:
var options=new ExecuteDataflowBlockOptions{ MaxDegreeOfParallelism = 8};
var copyBlock=new ActionBlock<CopyRequest>(async req=>{
var response=await client.GetAsync(req.sourceUri,
HttpCompletionOption.ResponseHeadersRead);
using var sourceStream=await response.Content.ReadAsStreamAsync();
var blob=new CloudAppendBlob(req.targetUri);
await blob.UploadFromStreamAsync(sourceStream);
}, options);
The block classes in the TPL Dataflow library can be used to construct processing pipelines similar to a shell script pipeline, with each block piping its output to the next block.

how to implement cancellation with yield return readlineAsync() [duplicate]

When I cancel my async method with the following content by calling the Cancel() method of my CancellationTokenSource, it will stop eventually. However since the line Console.WriteLine(await reader.ReadLineAsync()); takes quite a bit to complete, I tried to pass my CancellationToken to ReadLineAsync() as well (expecting it to return an empty string) in order to make the method more responsive to my Cancel() call. However I could not pass a CancellationToken to ReadLineAsync().
Can I cancel a call to Console.WriteLine() or Streamreader.ReadLineAsync() and if so, how do I do it?
Why is ReadLineAsync() not accepting a CancellationToken? I thought it was good practice to give async methods an optional CancellationToken parameter even if the method still completes after being canceled.
StreamReader reader = new StreamReader(dataStream);
while (!reader.EndOfStream)
{
if (ct.IsCancellationRequested){
ct.ThrowIfCancellationRequested();
break;
}
else
{
Console.WriteLine(await reader.ReadLineAsync());
}
}
Update:
Like stated in the comments below, the Console.WriteLine() call alone was already taking up several seconds due to a poorly formatted input string of 40.000 characters per line. Breaking this down solves my response-time issues, but I am still interested in any suggestions or workarounds on how to cancel this long-running statement if for some reason writing 40.000 characters into one line was intended (for example when dumping the whole string into a file).
.NET 6 brings Task.WaitAsync(CancellationToken). So one can write:
using StreamReader reader = new StreamReader(dataStream);
while (!reader.EndOfStream)
{
Console.WriteLine(await reader.ReadLineAsync().WaitAsync(cancellationToken).ConfigureAwait(false));
}
In .NET 7 (not yet released), it should be possible to write simply:
using StreamReader reader = new StreamReader(dataStream);
while (!reader.EndOfStream)
{
Console.WriteLine(await reader.ReadLineAsync(cancellationToken).ConfigureAwait(false);
}
based on https://github.com/dotnet/runtime/issues/20824 and https://github.com/dotnet/runtime/pull/61898.
You can't cancel the operation unless it's cancellable. You can use the WithCancellation extension method to have your code flow behave as if it was cancelled, but the underlying would still run:
public static Task<T> WithCancellation<T>(this Task<T> task, CancellationToken cancellationToken)
{
return task.IsCompleted // fast-path optimization
? task
: task.ContinueWith(
completedTask => completedTask.GetAwaiter().GetResult(),
cancellationToken,
TaskContinuationOptions.ExecuteSynchronously,
TaskScheduler.Default);
}
Usage:
await task.WithCancellation(cancellationToken);
You can't cancel Console.WriteLine and you don't need to. It's instantaneous if you have a reasonable sized string.
About the guideline: If your implementation doesn't actually support cancellation you shouldn't be accepting a token since it sends a mixed message.
If you do have a huge string to write to the console you shouldn't use Console.WriteLine. You can write the string in a character at a time and have that method be cancellable:
public void DumpHugeString(string line, CancellationToken token)
{
foreach (var character in line)
{
token.ThrowIfCancellationRequested();
Console.Write(character);
}
Console.WriteLine();
}
An even better solution would be to write in batches instead of single characters. Here's an implementation using MoreLinq's Batch:
public void DumpHugeString(string line, CancellationToken token)
{
foreach (var characterBatch in line.Batch(100))
{
token.ThrowIfCancellationRequested();
Console.Write(characterBatch.ToArray());
}
Console.WriteLine();
}
So, in conclusion:
var reader = new StreamReader(dataStream);
while (!reader.EndOfStream)
{
DumpHugeString(await reader.ReadLineAsync().WithCancellation(token), token);
}
I generalized this answer to this:
public static async Task<T> WithCancellation<T>(this Task<T> task, CancellationToken cancellationToken, Action action, bool useSynchronizationContext = true)
{
using (cancellationToken.Register(action, useSynchronizationContext))
{
try
{
return await task;
}
catch (Exception ex)
{
if (cancellationToken.IsCancellationRequested)
{
// the Exception will be available as Exception.InnerException
throw new OperationCanceledException(ex.Message, ex, cancellationToken);
}
// cancellation hasn't been requested, rethrow the original Exception
throw;
}
}
}
Now you can use your cancellation token on any cancelable async method. For example WebRequest.GetResponseAsync:
var request = (HttpWebRequest)WebRequest.Create(url);
using (var response = await request.GetResponseAsync())
{
. . .
}
will become:
var request = (HttpWebRequest)WebRequest.Create(url);
using (WebResponse response = await request.GetResponseAsync().WithCancellation(CancellationToken.None, request.Abort, true))
{
. . .
}
See example http://pastebin.com/KauKE0rW
I like to use an infinite delay, the code is quite clean.
If waiting is complete WhenAny returns and the cancellationToken will throw. Else, the result of task will be returned.
public static async Task<T> WithCancellation<T>(this Task<T> task, CancellationToken cancellationToken)
{
using (var delayCTS = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken))
{
var waiting = Task.Delay(-1, delayCTS.Token);
var doing = task;
await Task.WhenAny(waiting, doing);
delayCTS.Cancel();
cancellationToken.ThrowIfCancellationRequested();
return await doing;
}
}
You can't cancel Streamreader.ReadLineAsync(). IMHO this is because reading a single line should be very quick. But you can easily prevent the Console.WriteLine() from happening by using a separate task variable.
The check for ct.IsCancellationRequested is also redundand as ct.ThrowIfCancellationRequested() will only throw if cancellation is requested.
StreamReader reader = new StreamReader(dataStream);
while (!reader.EndOfStream)
{
ct.ThrowIfCancellationRequested();
string line = await reader.ReadLineAsync());
ct.ThrowIfCancellationRequested();
Console.WriteLine(line);
}

Is there a .NET 4.0 replacement for StreamReader.ReadLineAsync?

I'm stuck with .NET 4.0 on a project. StreamReader offers no Async or Begin/End version of ReadLine. The underlying Stream object has BeginRead/BeginEnd but these take a byte array so I'd have to implement the logic for reading line by line.
Is there something in the 4.0 Framework to achieve this?
You can use Task. You don't specify other part of your code so I don't know what you want to do. I advise you to avoid using Task.Wait because this is blocking the UI thread and waiting for the task to finish, which became not really async ! If you want to do some other action after the file is readed in the task, you can use task.ContinueWith.
Here full example, how to do it without blocking the UI thread
static void Main(string[] args)
{
string filePath = #"FILE PATH";
Task<string[]> task = Task.Run<string[]>(() => ReadFile(filePath));
bool stopWhile = false;
//if you want to not block the UI with Task.Wait() for the result
// and you want to perform some other operations with the already read file
Task continueTask = task.ContinueWith((x) => {
string[] result = x.Result; //result of readed file
foreach(var a in result)
{
Console.WriteLine(a);
}
stopWhile = true;
});
//here do other actions not related with the result of the file content
while(!stopWhile)
{
Console.WriteLine("TEST");
}
}
public static string[] ReadFile(string filePath)
{
List<String> lines = new List<String>();
string line = "";
using (StreamReader sr = new StreamReader(filePath))
{
while ((line = sr.ReadLine()) != null)
lines.Add(line);
}
Console.WriteLine("File Readed");
return lines.ToArray();
}
You can use the Task Parallel Library (TPL) to do some of the async behavior you're trying to do.
Wrap the synchronous method in a task:
var asyncTask = Task.Run(() => YourMethod(args, ...));
var asyncTask.Wait(); // You can also Task.WaitAll or other methods if you have several of these that you want to run in parallel.
var result = asyncTask.Result;
If you need to do this a lot for StreamReader, you can then go on to make this into an extension method for StreamReader if you want to simulate the regular async methods. Just take care with the error handling and other quirks of using the TPL.

How to make sure that the data of multiple Async downloads are saved in the order they were started?

I'm writing a basic Http Live Stream (HLS) downloader, where I'm re-downloading a m3u8 media playlist at an interval specified by "#EXT-X-TARGETDURATION" and then download the *.ts segments as they become available.
This is what the m3u8 media playlist might look like when first downloaded.
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:12
#EXT-X-MEDIA-SEQUENCE:1
#EXTINF:7.975,
http://website.com/segment_1.ts
#EXTINF:7.941,
http://website.com/segment_2.ts
#EXTINF:7.975,
http://website.com/segment_3.ts
I'd like to download these *.ts segments all at the same time with HttpClient async/await. The segments do not have the same size, so even though the download of "segment_1.ts" is started first, it might finish after the other two segments.
These segments are all part of one large video, so it's important that the data of the downloaded segments are written in the order they were started, NOT in the order they finished.
My code below works perfectly fine if the segments are downloaded one after another, but not when multiple segments are downloaded at the same time, because sometimes they don't finish in the order they were started.
I thought about using Task.WhenAll, which guarantees a correct order, but I don't want to keep the the downloaded segments in memory unnecessarily, because they can be a few megabytes in size. If the download of "segment_1.ts" does finish first, then it should be written to disk right away, without having to wait for the other segments to finish. Writing all the *.ts segments to separate files and joining them in the end is not an option either, because it would require double disk space and the total video can be a few gigabytes in size.
I have no idea how to do this and I'm wondering if somebody can help me with that. I'm looking for a way that doesn't require me to create threads manually or block a ThreadPool thread for a long period of time.
Some of the code and exception handling have been removed to make it easier to see what is going on.
// Async BlockingCollection from the AsyncEx library
private AsyncCollection<byte[]> segmentDataQueue = new AsyncCollection<byte[]>();
public void Start()
{
RunConsumer();
RunProducer();
}
private async void RunProducer()
{
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
for (var i = 1; i < lines.Length; i++)
{
var line = lines[i];
if (line.StartsWith("#EXT-X-TARGETDURATION"))
{
ParseTargetDuration(line);
}
else if (line.StartsWith("#EXT-X-MEDIA-SEQUENCE"))
{
ParseMediaSequence(line);
}
else if (!line.StartsWith("#"))
{
if (_isNewSegment)
{
// Fire and forget
DownloadTsSegment(line);
}
}
}
// Wait until it's time to reload the m3u8 media playlist again
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
// async void. We never await this method, so we can download multiple segments at once
private async void DownloadTsSegment(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
// Add the downloaded segment data to the AsyncCollection
await segmentDataQueue.AddAsync(data, _cts.Token).ConfigureAwait(false);
}
private async void RunConsumer()
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
while (!_isCancelled)
{
// Wait until new segment data is added to the AsyncCollection and write it to disk
var data = await segmentDataQueue.TakeAsync(_cts.Token).ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}
I don't think you need a producer/consumer queue at all here. However, I do think you should avoid "fire and forget".
You can start them all at the same time, and just process them as they complete.
First, define how to download a single segment:
private async Task<byte[]> DownloadTsSegmentAsync(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
return await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
}
Then add the parsing of the playlist which results in a list of segment downloads (which are all in progress already):
private List<Task<byte[]>> DownloadTasks(string data)
{
var result = new List<Task<byte[]>>();
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
...
if (_isNewSegment)
{
result.Add(DownloadTsSegmentAsync(line));
}
...
return result;
}
Consume this list one at a time (in order) by writing to a file:
private async Task RunConsumerAsync(List<Task<byte[]>> downloads)
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
for (var task in downloads)
{
var data = await task.ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}
And kick it all off with a producer:
public async Task RunAsync()
{
// TODO: consider CancellationToken instead of a boolean.
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
var tasks = DownloadTasks(data);
await RunConsumerAsync(tasks);
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
Note that this solution does run all downloads concurrently, and this can cause memory pressure. If this is a problem, I recommend you restructure to use TPL Dataflow, which has built-in support for throttling.
Assign each download a sequence number. Put the results into a Dictionary<int, byte[]>. Each time a download completes it adds its own result.
It then checks if there are segments to write to disk:
while (dict.ContainsKey(lowestWrittenSegmentNumber + 1)) {
WriteSegment(dict[lowestWrittenSegmentNumber + 1]);
lowestWrittenSegmentNumber++;
}
That way all segments end up on disk, in order and with buffering.
RunConsumer();
RunProducer();
Make sure to use async Task so that you can wait for completion with await Task.WhenAll(RunConsumer(), RunProducer());. But you should not need RunConsumer any longer.

Downloading multiple files by fastly and efficiently(async)

I have so many files that i have to download. So i try to use power of new async features as below.
var streamTasks = urls.Select(async url => (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream()).ToList();
var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
}
What i am afraid of about this code it will cause big memory usage because if there are 1000 files that contains 2MB file so this code will load 1000*2MB streams into memory?
I may missing something or i am totally right. If am not missed something so it is better to await every request and consume stream is best approach ?
Both options could be problematic. Downloading only one at a time doesn't scale and takes time while downloading all files at once could be too much of a load (also, no need to wait for all to download before you process them).
I prefer to always cap such operation with a configurable size. A simple way to do so is to use an AsyncLock (which utilizes SemaphoreSlim). A more robust way is to use TPL Dataflow with a MaxDegreeOfParallelism.
var block = new ActionBlock<string>(url =>
{
var stream = (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream();
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });
Your code will load the stream into memory whether you use async or not. Doing async work handles the I/O part by returning to the caller until your ResponseStream returns.
The choice you have to make dosent concern async, but rather the implementation of your program concerning reading a big stream input.
If I were you, I would think about how to split the work load into chunks. You might read the ResponseStream in parallel and save each stream to a different source (might be to a file) and release it from memory.
This is my own answer chunking idea from Yuval Itzchakov and i provide implementation. Please provide feedback for this implementation.
foreach (var chunk in urls.Batch(5))
{
var streamTasks = chunk
.Select(async url => await WebRequest.CreateHttp(url).GetResponseAsync())
.Select(async response => (await response).GetResponseStream());
var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
}
}
Batch is extension method that is simply as below.
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int chunksize)
{
while (source.Any())
{
yield return source.Take(chunksize);
source = source.Skip(chunksize);
}
}

Categories