I have so many files that i have to download. So i try to use power of new async features as below.
var streamTasks = urls.Select(async url => (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream()).ToList();
var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
}
What i am afraid of about this code it will cause big memory usage because if there are 1000 files that contains 2MB file so this code will load 1000*2MB streams into memory?
I may missing something or i am totally right. If am not missed something so it is better to await every request and consume stream is best approach ?
Both options could be problematic. Downloading only one at a time doesn't scale and takes time while downloading all files at once could be too much of a load (also, no need to wait for all to download before you process them).
I prefer to always cap such operation with a configurable size. A simple way to do so is to use an AsyncLock (which utilizes SemaphoreSlim). A more robust way is to use TPL Dataflow with a MaxDegreeOfParallelism.
var block = new ActionBlock<string>(url =>
{
var stream = (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream();
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });
Your code will load the stream into memory whether you use async or not. Doing async work handles the I/O part by returning to the caller until your ResponseStream returns.
The choice you have to make dosent concern async, but rather the implementation of your program concerning reading a big stream input.
If I were you, I would think about how to split the work load into chunks. You might read the ResponseStream in parallel and save each stream to a different source (might be to a file) and release it from memory.
This is my own answer chunking idea from Yuval Itzchakov and i provide implementation. Please provide feedback for this implementation.
foreach (var chunk in urls.Batch(5))
{
var streamTasks = chunk
.Select(async url => await WebRequest.CreateHttp(url).GetResponseAsync())
.Select(async response => (await response).GetResponseStream());
var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
}
}
Batch is extension method that is simply as below.
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int chunksize)
{
while (source.Any())
{
yield return source.Take(chunksize);
source = source.Skip(chunksize);
}
}
Related
I am trying to convert this to parallel to improve the upload times of a file but with what I have tried it has not had great changes in time.
I want to upload the blocks side-by-side and then confirm them. How could I manage to do it in parallel?
public static async Task UploadInBlocks
(BlobContainerClient blobContainerClient, string localFilePath, int blockSize)
{
string fileName = Path.GetFileName(localFilePath);
BlockBlobClient blobClient = blobContainerClient.GetBlockBlobClient(fileName);
FileStream fileStream = File.OpenRead(localFilePath);
ArrayList blockIDArrayList = new ArrayList();
byte[] buffer;
var bytesLeft = (fileStream.Length - fileStream.Position);
while (bytesLeft > 0)
{
if (bytesLeft >= blockSize)
{
buffer = new byte[blockSize];
await fileStream.ReadAsync(buffer, 0, blockSize);
}
else
{
buffer = new byte[bytesLeft];
await fileStream.ReadAsync(buffer, 0, Convert.ToInt32(bytesLeft));
bytesLeft = (fileStream.Length - fileStream.Position);
}
using (var stream = new MemoryStream(buffer))
{
string blockID = Convert.ToBase64String
(Encoding.UTF8.GetBytes(Guid.NewGuid().ToString()));
blockIDArrayList.Add(blockID);
await blobClient.StageBlockAsync(blockID, stream);
}
bytesLeft = (fileStream.Length - fileStream.Position);
}
string[] blockIDArray = (string[])blockIDArrayList.ToArray(typeof(string));
await blobClient.CommitBlockListAsync(blockIDArray);
}
Of course. You shouldn't expect any improvements - quite the opposite. Blob storage doesn't have any simplistic throughput throttling that would benefit from uploading in multiple streams, and you're already doing extremely light-weight I/O which is going to be entirely I/O bound.
Good I/O code has absolutely no benefits from parallelization. No matter how many workers you put on the job, the pipe is only this thick and will not allow you to pass more data through.
All your code just reimplements the already very efficient mechanisms that the blob storage library has... and you do it considerably worse, with pointless allocation, wrong arguments and new opportunities for bugs. Don't do that. The library can deal with streams just fine.
I have a foreach loop which is responsible for executing a certain set of statements. A part of that is to save an image from a URL to Azure storage. I have to do this for a large set of data. To achieve the same I have converted the foreach loop into a Parallel.ForEach loop.
Parallel.ForEach(listSkills, item =>
{
// some business logic
var b = getImageFromUrl(item.Url);
Stream ms = new MemoryStream(b);
saveImage(ms);
// more business logic
});
private static byte[] getByteArray(Stream input)
{
using (MemoryStream ms = new MemoryStream())
{
input.CopyTo(ms);
return ms.ToArray();
}
}
public static byte[] getImageFromUrl(string url)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
byte[] b = null;
request = (HttpWebRequest)WebRequest.Create(url);
response = (HttpWebResponse)request.GetResponse();
if (request.HaveResponse)
{
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
b = getByteArray(receiveStream);
}
}
return b;
}
public static void saveImage(Stream fileContent)
{
fileContent.Seek(0, SeekOrigin.Begin);
byte[] bytes = getByteArray(fileContent);
var blob = null;
blob.UploadFromByteArrayAsync(bytes, 0, bytes.Length).Wait();
}
Although there are instances when I am getting the below error and the image is not getting saved.
An existing connection was forcibly closed by the remote host.
Also sharing the StackTrace :
at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
at System.Net.Security.SslStream.<FillBufferAsync>d__183`1.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Net.Security.SslStream.<ReadAsyncInternal>d__181`1.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.Stream.Read(Span`1 buffer)
at System.Net.Http.HttpConnection.Read(Span`1 destination)
at System.Net.Http.HttpConnection.ContentLengthReadStream.Read(Span`1 buffer)
at System.Net.Http.HttpBaseStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.Stream.CopyTo(Stream destination, Int32 bufferSize)
at Utilities.getByteArray(Stream input) in D:\repos\SampleProj\Sample\Helpers\CH.cs:line 238
at Utilities.getImageFromUrl(String url) in D:\repos\SampleProj\Sample\Helpers\CH.cs:line 178
I am guessing this maybe because I am not using locks? I am unsure whether to use locks within a Parallel.ForEach loop.
According to another question on stackoverflow, here are the potential causes for An existing connection was forcibly closed by the remote host. :
You are sending malformed data to the application (which could include sending an HTTPS request to an HTTP server)
The network link between the client and server is going down for some reason
You have triggered a bug in the third-party application that caused it to crash
The third-party application has exhausted system resources
Since only some of your requests are affected, I think we can exclude the first one. This can be, of course, a network issue, and in that case, this would happend from time to time depending on the quality of the netwok between you and the server.
Unless you find indication of an AzureStorage's bug from other users, there is a high probability your call are consuming too much of the remote server's resources (connections/data) at the same time. Servers and proxy have limitation on how much connections they can handle at the same time (especially from the same client machine).
Depending on the size of your listSkills list, your code may launch a big number of request in parallel (as much as your thread pool can), possibly flooding the server.
You could at least limit the number of parallel task launch using MaxDegreeOfParallelism like this :
Parallel.ForEach(listSkills,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
item =>
{
// some business logic
var b = getImageFromUrl(item.Url);
Stream ms = new MemoryStream(b);
saveImage(ms);
// more business logic
});
You can control parallelism like:
listSkills.AsParallel()
.Select(item => {/*Your Logic*/ return item})
.WithDegreeOfParallelism(10)
.Select(item =>
{
getImageFromUrl(item.url);
saveImage(your_stream);
return item;
});
But Parallel.ForEach is not good for IO because it's designed for CPU-intensive tasks, if you use it for IO-bound operations specially making web requests you may waste thread pool thread blocked while waiting for response.
You use asynchronous web request methods like HttpWebRequest.GetResponseAsync, in the other side you can also use thread synchronization constructs for that, as ab example using Semaphore, the Semaphore is like queue, it allows X threads to pass, and the rest should wait until one of busy threads will finish it's work.
First make your getStream method as async like (this is not good solution but can be better):
public static async Task getImageFromUrl(SemaphoreSlim semaphore, string url)
{
try
{
HttpWebRequest request = null;
byte[] b = null;
request = (HttpWebRequest)WebRequest.Create(url);
using (var response = await request.GetResponseAsync().ConfigureAwait(false))
{
// your logic
}
}
catch (Exception ex)
{
// handle exp
}
finally
{
// release
semaphore.Release();
}
}
and then:
using (var semaphore = new SemaphoreSlim(10))
{
foreach (var url in urls)
{
// await here until there is a room for this task
await semaphore.WaitAsync();
tasks.Add(getImageFromUrl(semaphore, url));
}
// await for the rest of tasks to complete
await Task.WhenAll(tasks);
}
You should not use the Parallel or Task.Run instead you can have an async handler method like:
public async Task handleResponse(Task<HttpResponseMessage> response)
{
HttpResponseMessage response = await response;
//Process your data
}
and then use Task.WhenAll like:
Task[] requests = myList.Select(l => getImageFromUrl(l.Id))
.Select(r => handleResponse(r))
.ToArray();
await Task.WhenAll(requests);
at the end there are several solution for your scenario but forget Parallel.Foreach instead use optimized solution.
There are several problems with this code:
Parallel.ForEach is meant for data parallelism, not IO. The code is freezing all CPU cores waiting for IO to complete
HttpWebRequest is a wrapper over HttpClient in .NET Core. Using HttpWebRequest is inefficient and far more complex than needed.
HttpClient can post retrieve or post stream contents which means there's no reason to load stream contents in memory. HttpClient is thread-safe and meant to be reused too.
There are several ways to execute many IO operations concurrently in .NET Core.
.NET 6
In the current Long-Term-Support version of .NET, .NET 6, this can be done using Parallel.ForEachAsync. Scott Hanselman shows how easy it is to use it for API calls
You can retrieve the data directly with GetBytesAsync :
record CopyRequest(Uri sourceUri,Uri blobUri);
...
var requests=new List<CopyRequest>();
//Load some source/target URLs
var client=new HttpClient();
await Parallel.ForEachAsync(requests,async req=>{
var bytes=await client.GetBytesAsync(req.sourceUri);
var blob=new CloudAppendBlob(req.targetUri);
await blob.UploadFromByteArrayAsync(bytes, 0, bytes.Length);
});
A better option would be to retrieve the data as a stream and send it directly to the blob :
await Parallel.ForEachAsync(requests,async req=>{
var response=await client.GetAsync(req.sourceUri,
HttpCompletionOption.ResponseHeadersRead);
using var sourceStream=await response.Content.ReadAsStreamAsync();
var blob=new CloudAppendBlob(req.targetUri);
await blob.UploadFromStreamAsync(sourceStream);
});
HttpCompletionOption.ResponseHeadersRead causes GetAsync to return as soon as the response headers are received, without buffering any of the response data.
.NET 3.1
In older .NET Core versions (which are reaching End-of-Life in a few months) you can use eg an ActionBlock with a Degree-Of-Parallelism greater than 1:
var options=new ExecuteDataflowBlockOptions{ MaxDegreeOfParallelism = 8};
var copyBlock=new ActionBlock<CopyRequest>(async req=>{
var response=await client.GetAsync(req.sourceUri,
HttpCompletionOption.ResponseHeadersRead);
using var sourceStream=await response.Content.ReadAsStreamAsync();
var blob=new CloudAppendBlob(req.targetUri);
await blob.UploadFromStreamAsync(sourceStream);
}, options);
The block classes in the TPL Dataflow library can be used to construct processing pipelines similar to a shell script pipeline, with each block piping its output to the next block.
I came across IAsyncEnumerable while I am testing C# 8.0 features. I found remarkable examples from Anthony Chu (https://anthonychu.ca/post/async-streams-dotnet-core-3-iasyncenumerable/). It is async stream and replacement for Task<IEnumerable<T>>
// Data Access Layer.
public async IAsyncEnumerable<Product> GetAllProducts()
{
Container container = cosmosClient.GetContainer(DatabaseId, ContainerId);
var iterator = container.GetItemQueryIterator<Product>("SELECT * FROM c");
while (iterator.HasMoreResults)
{
foreach (var product in await iterator.ReadNextAsync())
{
yield return product;
}
}
}
// Usage
await foreach (var product in productsRepository.GetAllProducts())
{
Console.WriteLine(product);
}
I am wondering if this can be applied to read text files like below usage that read file line by line.
foreach (var line in File.ReadLines("Filename"))
{
// ...process line.
}
I really want to know how to apply async with IAsyncEnumerable<string>() to the above foreach loop so that it streams while reading.
How do I implement iterator so that I can use yield return to read line by line?
Exactly the same, however there is no async workload, so let's pretend
public async IAsyncEnumerable<string> SomeSortOfAwesomeness()
{
foreach (var line in File.ReadLines("Filename.txt"))
{
// simulates an async workload,
// otherwise why would be using IAsyncEnumerable?
// -- added due to popular demand
await Task.Delay(100);
yield return line;
}
}
or
This is just an wrapped APM workload, see Stephen Clearys comments for clarification
public static async IAsyncEnumerable<string> SomeSortOfAwesomeness()
{
using StreamReader reader = File.OpenText("Filename.txt");
while(!reader.EndOfStream)
yield return await reader.ReadLineAsync();
}
Usage
await foreach(var line in SomeSortOfAwesomeness())
{
Console.WriteLine(line);
}
Update from Stephen Cleary
File.OpenText sadly only allows synchronous I/O; the async APIs are
implemented poorly in that scenario. To open a true asynchronous file,
you'd need to use a FileStream constructor passing isAsync: true or
FileOptions.Asynchronous.
ReadLineAsync basically results in this code, as you can see, it's only the Stream APM Begin and End methods wrapped
private Task<Int32> BeginEndReadAsync(Byte[] buffer, Int32 offset, Int32 count)
{
return TaskFactory<Int32>.FromAsyncTrim(
this, new ReadWriteParameters { Buffer = buffer, Offset = offset, Count = count },
(stream, args, callback, state) => stream.BeginRead(args.Buffer, args.Offset, args.Count, callback, state), // cached by compiler
(stream, asyncResult) => stream.EndRead(asyncResult)); // cached by compiler
}
I did some performance tests and it seems that a large bufferSize is helpful, together with the FileOptions.SequentialScan option.
public static async IAsyncEnumerable<string> ReadLinesAsync(string filePath)
{
using var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read,
FileShare.Read, 32768, FileOptions.Asynchronous | FileOptions.SequentialScan);
using var reader = new StreamReader(stream);
while (true)
{
var line = await reader.ReadLineAsync().ConfigureAwait(false);
if (line == null) break;
yield return line;
}
}
The enumeration in not trully asynchronous though. According to my experiments (.NET Core 3.1) the xxxAsync methods of the StreamReader class are blocking the current thread for a duration longer than the awaiting period of the Task they return. For example reading a 6 MB file with the method ReadToEndAsync in my PC blocks the current thread for 120 msec before returning the task, and then the task is completed in just 20 msec. So I am not sure that there is much value at using these methods. Faking asynchrony is much easier by using the synchronous APIs together with some Linq.Async:
IAsyncEnumerable<string> lines = File.ReadLines("SomeFile.txt").ToAsyncEnumerable();
.NET 6 update: The implementation of the asynchronous filesystem APIs has been improved on .NET 6. For experimental data with the File.ReadAllLinesAsync method, see here.
I am writing a small logger and I want to open the log file once, keep writing reactively as log messages arrive, and dispose of everything on program termination.
I am not sure how I can keep the FileStream open and reactively write the messages as they arrive.
I would like to update the design from my previous solution where I had a ConcurrentQueue acting as a buffer, and a loop inside the using statements that consumed the queue.
Specifically, I want to simultaneously take advantage of the using statement construct, so I don't have to explicitly close the stream and writer, and of the reactive, loopless programming style. Currently I only know how to use one of these constructs at once: either the using/loop combination, or the explicit-stream-close/reactive combination.
Here's my code:
BufferBlock<LogEntry> _buffer = new BufferBlock<LogEntry>();
// CONSTRUCTOR
public DefaultLogger(string folder)
{
var filePath = Path.Combine(folder, $"{DateTime.Now.ToString("yyyy.MM.dd")}.log");
_cancellation = new CancellationTokenSource();
var observable = _buffer.AsObservable();
using (var stream = File.Create(_filePath))
using (var writer = new StreamWriter(stream))
using (var subscription = observable.Subscribe(entry =>
writer.Write(GetFormattedString(entry))))
{
while (!_cancellation.IsCancellationRequested)
{
// what do I do here?
}
}
}
You need to use Observable.Using. It's designed to create an IDisposble resource that gets disposed when the sequence ends.
Try something like this:
IDisposable subscription =
Observable.Using(() => File.Create(_filePath),
stream => Observable.Using(() => new StreamWriter(stream),
writer => _buffer.AsObservable().Select(entry => new { entry, writer })))
.Subscribe(x => x.writer.Write(GetFormattedString(x.entry)));
I'm writing a basic Http Live Stream (HLS) downloader, where I'm re-downloading a m3u8 media playlist at an interval specified by "#EXT-X-TARGETDURATION" and then download the *.ts segments as they become available.
This is what the m3u8 media playlist might look like when first downloaded.
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:12
#EXT-X-MEDIA-SEQUENCE:1
#EXTINF:7.975,
http://website.com/segment_1.ts
#EXTINF:7.941,
http://website.com/segment_2.ts
#EXTINF:7.975,
http://website.com/segment_3.ts
I'd like to download these *.ts segments all at the same time with HttpClient async/await. The segments do not have the same size, so even though the download of "segment_1.ts" is started first, it might finish after the other two segments.
These segments are all part of one large video, so it's important that the data of the downloaded segments are written in the order they were started, NOT in the order they finished.
My code below works perfectly fine if the segments are downloaded one after another, but not when multiple segments are downloaded at the same time, because sometimes they don't finish in the order they were started.
I thought about using Task.WhenAll, which guarantees a correct order, but I don't want to keep the the downloaded segments in memory unnecessarily, because they can be a few megabytes in size. If the download of "segment_1.ts" does finish first, then it should be written to disk right away, without having to wait for the other segments to finish. Writing all the *.ts segments to separate files and joining them in the end is not an option either, because it would require double disk space and the total video can be a few gigabytes in size.
I have no idea how to do this and I'm wondering if somebody can help me with that. I'm looking for a way that doesn't require me to create threads manually or block a ThreadPool thread for a long period of time.
Some of the code and exception handling have been removed to make it easier to see what is going on.
// Async BlockingCollection from the AsyncEx library
private AsyncCollection<byte[]> segmentDataQueue = new AsyncCollection<byte[]>();
public void Start()
{
RunConsumer();
RunProducer();
}
private async void RunProducer()
{
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
for (var i = 1; i < lines.Length; i++)
{
var line = lines[i];
if (line.StartsWith("#EXT-X-TARGETDURATION"))
{
ParseTargetDuration(line);
}
else if (line.StartsWith("#EXT-X-MEDIA-SEQUENCE"))
{
ParseMediaSequence(line);
}
else if (!line.StartsWith("#"))
{
if (_isNewSegment)
{
// Fire and forget
DownloadTsSegment(line);
}
}
}
// Wait until it's time to reload the m3u8 media playlist again
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
// async void. We never await this method, so we can download multiple segments at once
private async void DownloadTsSegment(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
// Add the downloaded segment data to the AsyncCollection
await segmentDataQueue.AddAsync(data, _cts.Token).ConfigureAwait(false);
}
private async void RunConsumer()
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
while (!_isCancelled)
{
// Wait until new segment data is added to the AsyncCollection and write it to disk
var data = await segmentDataQueue.TakeAsync(_cts.Token).ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}
I don't think you need a producer/consumer queue at all here. However, I do think you should avoid "fire and forget".
You can start them all at the same time, and just process them as they complete.
First, define how to download a single segment:
private async Task<byte[]> DownloadTsSegmentAsync(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
return await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
}
Then add the parsing of the playlist which results in a list of segment downloads (which are all in progress already):
private List<Task<byte[]>> DownloadTasks(string data)
{
var result = new List<Task<byte[]>>();
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
...
if (_isNewSegment)
{
result.Add(DownloadTsSegmentAsync(line));
}
...
return result;
}
Consume this list one at a time (in order) by writing to a file:
private async Task RunConsumerAsync(List<Task<byte[]>> downloads)
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
for (var task in downloads)
{
var data = await task.ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}
And kick it all off with a producer:
public async Task RunAsync()
{
// TODO: consider CancellationToken instead of a boolean.
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
var tasks = DownloadTasks(data);
await RunConsumerAsync(tasks);
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
Note that this solution does run all downloads concurrently, and this can cause memory pressure. If this is a problem, I recommend you restructure to use TPL Dataflow, which has built-in support for throttling.
Assign each download a sequence number. Put the results into a Dictionary<int, byte[]>. Each time a download completes it adds its own result.
It then checks if there are segments to write to disk:
while (dict.ContainsKey(lowestWrittenSegmentNumber + 1)) {
WriteSegment(dict[lowestWrittenSegmentNumber + 1]);
lowestWrittenSegmentNumber++;
}
That way all segments end up on disk, in order and with buffering.
RunConsumer();
RunProducer();
Make sure to use async Task so that you can wait for completion with await Task.WhenAll(RunConsumer(), RunProducer());. But you should not need RunConsumer any longer.