Upload blocks in parallel in blob storage

Upload blocks in parallel in blob storage - c#

I am trying to convert this to parallel to improve the upload times of a file but with what I have tried it has not had great changes in time.
I want to upload the blocks side-by-side and then confirm them. How could I manage to do it in parallel?
public static async Task UploadInBlocks
(BlobContainerClient blobContainerClient, string localFilePath, int blockSize)
{
string fileName = Path.GetFileName(localFilePath);
BlockBlobClient blobClient = blobContainerClient.GetBlockBlobClient(fileName);
FileStream fileStream = File.OpenRead(localFilePath);
ArrayList blockIDArrayList = new ArrayList();
byte[] buffer;
var bytesLeft = (fileStream.Length - fileStream.Position);
while (bytesLeft > 0)
{
if (bytesLeft >= blockSize)
{
buffer = new byte[blockSize];
await fileStream.ReadAsync(buffer, 0, blockSize);
}
else
{
buffer = new byte[bytesLeft];
await fileStream.ReadAsync(buffer, 0, Convert.ToInt32(bytesLeft));
bytesLeft = (fileStream.Length - fileStream.Position);
}
using (var stream = new MemoryStream(buffer))
{
string blockID = Convert.ToBase64String
(Encoding.UTF8.GetBytes(Guid.NewGuid().ToString()));
blockIDArrayList.Add(blockID);
await blobClient.StageBlockAsync(blockID, stream);
}
bytesLeft = (fileStream.Length - fileStream.Position);
}
string[] blockIDArray = (string[])blockIDArrayList.ToArray(typeof(string));
await blobClient.CommitBlockListAsync(blockIDArray);
}

Of course. You shouldn't expect any improvements - quite the opposite. Blob storage doesn't have any simplistic throughput throttling that would benefit from uploading in multiple streams, and you're already doing extremely light-weight I/O which is going to be entirely I/O bound.
Good I/O code has absolutely no benefits from parallelization. No matter how many workers you put on the job, the pipe is only this thick and will not allow you to pass more data through.
All your code just reimplements the already very efficient mechanisms that the blob storage library has... and you do it considerably worse, with pointless allocation, wrong arguments and new opportunities for bugs. Don't do that. The library can deal with streams just fine.

Related

How can I pause multi-file copy when one is working?

I started a small project for fun and I liked it so much I started to expand it. I needed a simple file explorer (like windows one) to just see and open files but now expanding it I have a problem with multiple files, I want to copy some from directory A and paste them in B then while doing it I want to copy some files from directory C and paste them in D and if the copy A -> B is in progress the copy C -> D is paused and when the copy A -> B finishes the second copy can start. For now, I can copy files from A to B. Is there anything not too complex I can try?
I'm using a new form to display the progress bar, file name, file count, and size when starting a copy and I'm using a BackgroundWorker

I am assuming you are just calling File.Move or File.Copy those don't give the ability to pause the actual operation, you will have to write your own Move/Copy operations
eg. to copy the file you could do the following
public void CopyFile(string sourceFileName, string destFileName, bool overwrite)
{
var outputFileMode = overwrite ? FileMode.Create : FileMode.CreateNew;
using (var inputStream = new FileStream(sourceFileName, FileMode.Open, FileAccess.Read, FileShare.Read))
using (var outputStream = new FileStream(destFileName, outputFileMode, FileAccess.Write, FileShare.None))
{
const int bufferSize = 16384; // 16 Kb
var buffer = new byte[bufferSize];
int bytesRead;
do
{
bytesRead = inputStream.Read(buffer, 0, bufferSize);
outputStream.Write(buffer, 0, bytesRead);
} while (bytesRead == bufferSize);
}
}
Now that you have this code you can simply add a while loop with a condition to "pause" the code eg.
while (_pause)
{
Thread.Sleep(100);
}
This while would go into the do loop from the above code.
Here the complete idea
public void CopyFile(string sourceFileName, string destFileName, bool overwrite)
{
var outputFileMode = overwrite ? FileMode.Create : FileMode.CreateNew;
using (var inputStream = new FileStream(sourceFileName, FileMode.Open, FileAccess.Read, FileShare.Read))
using (var outputStream = new FileStream(destFileName, outputFileMode, FileAccess.Write, FileShare.None))
{
const int bufferSize = 16384; // 16 Kb
var buffer = new byte[bufferSize];
int bytesRead;
do
{
//run this loop until _pause = false
while (_pause)
{
Thread.Sleep(100);
}
bytesRead = inputStream.Read(buffer, 0, bufferSize);
outputStream.Write(buffer, 0, bytesRead);
} while (bytesRead == bufferSize);
}
}

You could use objects to represent different operations, something like
public interface IFileOperation {
void Execute(Func<double> ReportProgress, CancellationToken cancel);
}
You could then create a queue of multiple operations, and create a task on another thread to process each item
private CancellationTokenSource cts = new CancellationTokenSource();
public double CurrentProgress {get; private set;}
public void CancelCurrentOperation() => cts.Cancel();
public BlockingCollection<IFileOperation> Queue = new BlockingCollection<IFileOperation>(new ConcurrentQueue<IFileOperation>());
public void RunOnWorkerThread(){
foreach(var op in Queue .GetConsumingEnumerable()){
cts = new CancellationTokenSource();
CurrentProgress = 0;
op.Execute(p => Progress = p, cts.Token );
}
}
This will run file operations, one at a time, on a background thread, while allowing new operations to be added from the main thread. To report progress you would need a non-modal progress bar. I.e. instead of showing a dialog you should add a progress bar control somewhere in your UI. Otherwise you would not be able to add new operations without cancelling the current operation. You will also need some way to connect the progress bar to the currently running operation, For example by running a timer on the main thread that updates the property that the progress bar is bound to. You can either run the method as a long Running task, or as a dedicated thread.
You could, if you wish, add a pause/resume method to the FileOperation. The answer by Rand Random shows how to copy files manually, so I will skip this here. You could also create a UI that will show a list of all queued file operations, and allow removing queued tasks. You could even, with some more work, run multiple operations in parallel, and show separate progress for each one.

FileStream.ReadAsync very slow compared to Read()

I have the following code to loop thru a file and read 1024 bytes at a time. The first iteration uses FileStream.Read() and the second iteration uses FileStream.ReadAsync().
private async void Button_Click(object sender, RoutedEventArgs e)
{
await Task.Run(() => Test()).ConfigureAwait(false);
}
private async Task Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int readSize;
int blockSize = 1024;
byte[] data = new byte[blockSize];
string theFile = #"C:\test.mp4";
long totalRead = 0;
using (FileStream fs = new FileStream(theFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
readSize = fs.Read(data, 0, blockSize);
while (readSize > 0)
{
totalRead += readSize;
readSize = fs.Read(data, 0, blockSize);
}
}
sw.Stop();
Console.WriteLine($"Read() Took {sw.ElapsedMilliseconds}ms and totalRead: {totalRead}");
sw.Reset();
sw.Start();
totalRead = 0;
using (FileStream fs = new FileStream(theFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, (blockSize*2), FileOptions.Asynchronous | FileOptions.SequentialScan))
{
readSize = await fs.ReadAsync(data, 0, blockSize).ConfigureAwait(false);
while (readSize > 0)
{
totalRead += readSize;
readSize = await fs.ReadAsync(data, 0, blockSize).ConfigureAwait(false);
}
}
sw.Stop();
Console.WriteLine($"ReadAsync() Took {sw.ElapsedMilliseconds}ms and totalRead: {totalRead}");
}
And the result:
Read() Took 162ms and totalRead: 162835040
ReadAsync() Took 15597ms and totalRead: 162835040
The ReadAsync() is about 100 times slower. Am I missing anything? The only thing I can think of is the overhead to create and destroy task using ReadAsync(), but is the overhead that much?
UPDATE:
I've changed the above code to reflect the suggestion by #Cory. There is a slight improvement:
Read() Took 142ms and totalRead: 162835040
ReadAsync() Took 12288ms and totalRead: 162835040
When I increase the read block size to 1MB as suggested by #Alexandru, the results are much more acceptable:
Read() Took 32ms and totalRead: 162835040
ReadAsync() Took 76ms and totalRead: 162835040
So, it hinted to me that it is indeed the overhead of the number of tasks which causes the slowness. But, if the creation and destroy of task only takes merely 100µs, things still don't really adds up for the slowness with a small block size.

Stick with big buffers if you're doing async and make sure to turn on async mode in the FileStream constructor, and you should be okay. Async methods that you await like this will trap in and out of the current thread (mind you the current thread is the UI thread in your case, which can be lagged by any other async method facilitating the same in and out thread trapping) and so there will be some overhead involved in this process if you have a large number of calls (imagine calling a new thread constructor and awaiting for it to finish about 100K times, and especially if you're dealing with a UI app where you need to wait for the UI thread to be free in order to trap back into it once the async function completes). So, to reduce these calls, we simply read in larger increments of data and focus the application on reading more data at a time by increasing the buffer size. Make sure to test this code in Release mode so all of the compiler optimizations are available to us and also such that the debugger does not slow us down:
class Program
{
static void Main(string[] args)
{
DoStuff();
Console.ReadLine();
}
public static async void DoStuff()
{
var filename = #"C:\Example.txt";
var sw = new Stopwatch();
sw.Start();
ReadAllFile(filename);
sw.Stop();
Console.WriteLine("Sync: " + sw.Elapsed);
sw.Restart();
await ReadAllFileAsync(filename);
sw.Stop();
Console.WriteLine("Async: " + sw.Elapsed);
}
static void ReadAllFile(string filename)
{
byte[] buffer = new byte[131072];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read, buffer.Length, false))
while (true)
if (file.Read(buffer, 0, buffer.Length) <= 0)
break;
}
static async Task ReadAllFileAsync(string filename)
{
byte[] buffer = new byte[131072];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read, buffer.Length, true))
while (true)
if ((await file.ReadAsync(buffer, 0, buffer.Length)) <= 0)
break;
}
}
Results:
Sync: 00:00:00.3092809
Async: 00:00:00.5541262
Pretty negligible...the file is about 1 GB.
Let's say I go even bigger, a 1 MB buffer, AKA new byte[1048576] (come on man, everyone has 1 MB of RAM these days):
Sync: 00:00:00.2925763
Async: 00:00:00.3402034
Then its just a few hundredths of a second difference. If you blink, you'll miss it.

Your method signature suggests you're doing this from an WPF app. While the blocking code will take up the UI thread during this time, the async code will be forced to go through the UI message queue every time an asynchronous operation completes, slowing it down and competing with any UI messages. You should try removing it from the UI thread like so:
void Button_Click(object sender, RoutedEventArgs e)
{
Task.Run(() => Button_Click_Impl());
}
async Task Button_Click_Impl()
{
// put code here.
}
Next, open the file in async mode. If you don't do this, async is emulated and will go much slower:
new FileStream(theFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 4096,
FileOptions.Asynchronous | FileOptions.SequentialScan)
Finally, you may also be able to extract some small performance using ConfigureAwait(false) to avoid moving between threads:
readSize = await fs.ReadAsync(data, 0, 1024).ConfigureAwait(false);

The overhead of a single ReadAsync operation is much higher than of a single Read operation (especially if you do not use the right mode upon opening the file, see other answers). If you eventually end up with the whole file in memory anyway, just query the file's size, allocate a large enough buffer and read all at once. Otherwise, you can still increase the buffer size to e.g. 32 MiB, or even larger if you expect larger file sizes. That should considerably speed up everything.
Only bother with launching a new task if there is considerable CPU-bound work for each block. Otherwise the UI should be kept responsive by the ReadAsync operation (with a sufficiently large buffer) taking their time (if it completes immediately, you may still be blocking the UI, see remarks at Task.Yield()).

How to make sure that the data of multiple Async downloads are saved in the order they were started?

I'm writing a basic Http Live Stream (HLS) downloader, where I'm re-downloading a m3u8 media playlist at an interval specified by "#EXT-X-TARGETDURATION" and then download the *.ts segments as they become available.
This is what the m3u8 media playlist might look like when first downloaded.
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:12
#EXT-X-MEDIA-SEQUENCE:1
#EXTINF:7.975,
http://website.com/segment_1.ts
#EXTINF:7.941,
http://website.com/segment_2.ts
#EXTINF:7.975,
http://website.com/segment_3.ts
I'd like to download these *.ts segments all at the same time with HttpClient async/await. The segments do not have the same size, so even though the download of "segment_1.ts" is started first, it might finish after the other two segments.
These segments are all part of one large video, so it's important that the data of the downloaded segments are written in the order they were started, NOT in the order they finished.
My code below works perfectly fine if the segments are downloaded one after another, but not when multiple segments are downloaded at the same time, because sometimes they don't finish in the order they were started.
I thought about using Task.WhenAll, which guarantees a correct order, but I don't want to keep the the downloaded segments in memory unnecessarily, because they can be a few megabytes in size. If the download of "segment_1.ts" does finish first, then it should be written to disk right away, without having to wait for the other segments to finish. Writing all the *.ts segments to separate files and joining them in the end is not an option either, because it would require double disk space and the total video can be a few gigabytes in size.
I have no idea how to do this and I'm wondering if somebody can help me with that. I'm looking for a way that doesn't require me to create threads manually or block a ThreadPool thread for a long period of time.
Some of the code and exception handling have been removed to make it easier to see what is going on.
// Async BlockingCollection from the AsyncEx library
private AsyncCollection<byte[]> segmentDataQueue = new AsyncCollection<byte[]>();
public void Start()
{
RunConsumer();
RunProducer();
}
private async void RunProducer()
{
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
for (var i = 1; i < lines.Length; i++)
{
var line = lines[i];
if (line.StartsWith("#EXT-X-TARGETDURATION"))
{
ParseTargetDuration(line);
}
else if (line.StartsWith("#EXT-X-MEDIA-SEQUENCE"))
{
ParseMediaSequence(line);
}
else if (!line.StartsWith("#"))
{
if (_isNewSegment)
{
// Fire and forget
DownloadTsSegment(line);
}
}
}
// Wait until it's time to reload the m3u8 media playlist again
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
// async void. We never await this method, so we can download multiple segments at once
private async void DownloadTsSegment(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
// Add the downloaded segment data to the AsyncCollection
await segmentDataQueue.AddAsync(data, _cts.Token).ConfigureAwait(false);
}
private async void RunConsumer()
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
while (!_isCancelled)
{
// Wait until new segment data is added to the AsyncCollection and write it to disk
var data = await segmentDataQueue.TakeAsync(_cts.Token).ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}

I don't think you need a producer/consumer queue at all here. However, I do think you should avoid "fire and forget".
You can start them all at the same time, and just process them as they complete.
First, define how to download a single segment:
private async Task<byte[]> DownloadTsSegmentAsync(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
return await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
}
Then add the parsing of the playlist which results in a list of segment downloads (which are all in progress already):
private List<Task<byte[]>> DownloadTasks(string data)
{
var result = new List<Task<byte[]>>();
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
...
if (_isNewSegment)
{
result.Add(DownloadTsSegmentAsync(line));
}
...
return result;
}
Consume this list one at a time (in order) by writing to a file:
private async Task RunConsumerAsync(List<Task<byte[]>> downloads)
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
for (var task in downloads)
{
var data = await task.ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}
And kick it all off with a producer:
public async Task RunAsync()
{
// TODO: consider CancellationToken instead of a boolean.
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
var tasks = DownloadTasks(data);
await RunConsumerAsync(tasks);
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
Note that this solution does run all downloads concurrently, and this can cause memory pressure. If this is a problem, I recommend you restructure to use TPL Dataflow, which has built-in support for throttling.

Assign each download a sequence number. Put the results into a Dictionary<int, byte[]>. Each time a download completes it adds its own result.
It then checks if there are segments to write to disk:
while (dict.ContainsKey(lowestWrittenSegmentNumber + 1)) {
WriteSegment(dict[lowestWrittenSegmentNumber + 1]);
lowestWrittenSegmentNumber++;
}
That way all segments end up on disk, in order and with buffering.
RunConsumer();
RunProducer();
Make sure to use async Task so that you can wait for completion with await Task.WhenAll(RunConsumer(), RunProducer());. But you should not need RunConsumer any longer.

Downloading multiple files by fastly and efficiently(async)

I have so many files that i have to download. So i try to use power of new async features as below.
var streamTasks = urls.Select(async url => (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream()).ToList();
var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
}
What i am afraid of about this code it will cause big memory usage because if there are 1000 files that contains 2MB file so this code will load 1000*2MB streams into memory?
I may missing something or i am totally right. If am not missed something so it is better to await every request and consume stream is best approach ?

Both options could be problematic. Downloading only one at a time doesn't scale and takes time while downloading all files at once could be too much of a load (also, no need to wait for all to download before you process them).
I prefer to always cap such operation with a configurable size. A simple way to do so is to use an AsyncLock (which utilizes SemaphoreSlim). A more robust way is to use TPL Dataflow with a MaxDegreeOfParallelism.
var block = new ActionBlock<string>(url =>
{
var stream = (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream();
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });

Your code will load the stream into memory whether you use async or not. Doing async work handles the I/O part by returning to the caller until your ResponseStream returns.
The choice you have to make dosent concern async, but rather the implementation of your program concerning reading a big stream input.
If I were you, I would think about how to split the work load into chunks. You might read the ResponseStream in parallel and save each stream to a different source (might be to a file) and release it from memory.

This is my own answer chunking idea from Yuval Itzchakov and i provide implementation. Please provide feedback for this implementation.
foreach (var chunk in urls.Batch(5))
{
var streamTasks = chunk
.Select(async url => await WebRequest.CreateHttp(url).GetResponseAsync())
.Select(async response => (await response).GetResponseStream());
var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
using (var fileStream = new FileStream("blabla", FileMode.Create))
{
await stream.CopyToAsync(fileStream);
}
}
}
Batch is extension method that is simply as below.
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int chunksize)
{
while (source.Any())
{
yield return source.Take(chunksize);
source = source.Skip(chunksize);
}
}

Multithread Write byte[] into file

I hoping someone can help me, if have a question about writing into a file using multiple threads/Tasks. See my code sample below...
AddFile return a array of longs holding the values, blobNumber, the offset inside the blob and the size of the data writing into the blob
public long[] AddFile(byte[] data){
long[] values = new long[3];
values[0] = WorkingIndex = getBlobIndex(data); //blobNumber
values[1] = blobFS[WorkingIndex].Position; //Offset
values[2] = length = data.length; //size
//BlobFS is a filestream
blobFS[WorkingIndex].Write(data, 0, data.Length);
return values;
}
So lets say I use the AddFile function inside a foreach loop like the one below.
List<Task> tasks = new List<Task>(System.Environment.ProcessorCount);
foreach(var file in Directory.GetFiles(#"C:\Documents"){
var task = Task.Factory.StartNew(() => {
byte[] data = File.ReadAllBytes(file);
long[] info = blob.AddFile(data);
return info
});
task.ContinueWith(// do some stuff);
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray);
return result;
I can imagine that this will totally fail, in the way that files will override each other inside the blob due to the fact that the Write function hasn't finished writing file1 and an other task is writing file2 at the same time.
So what is the best way to solve this problem? Maybe using asynchronous write functions...
Your help would be appreciated!
Kind regards,
Martijn

My suggestion here would be to not run these tasks in parallel. It's likely that disk IO will be the bottleneck for any file-based operation, and so running them in parallel will just cause each thread to be blocked accessing the disk. Ultimately, you'll quite possibly find that your code runs significantly slower as you've written it than it would run in serial.
Is there a particular reason that you want these in parallel? Can you handle the disk writes serially and just call ContinueWith() on separate threads instead? This would have the benefit of removing the problem you're posting about, too.
EDIT: an example naive reimplementation of your for loop:
foreach(var file in Directory.GetFiles(#"C:\Documents"){
byte[] data = File.ReadAllBytes(file); // this happens on the main thread
// processing of each file is handled in multiple threads in parallel to disk IO
var task = Task.Factory.StartNew(() => {
long[] info = blob.AddFile(data);
return info
});
task.ContinueWith(// do some stuff);
tasks.Add(task);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Upload blocks in parallel in blob storage - c#

Related

How can I pause multi-file copy when one is working?

FileStream.ReadAsync very slow compared to Read()

How to make sure that the data of multiple Async downloads are saved in the order they were started?

Downloading multiple files by fastly and efficiently(async)

Multithread Write byte[] into file

Categories

Resources