When uploading a file to S3 using the TransportUtility class, there is an option to either use FilePath or an input stream. I'm using multi-part uploads.
I'm uploading a variety of things, of which some are files on disk and others are raw streams. I'm currently using the InputStream variety for everything, which works OK, but I'm wondering if I should specialize the method further. For the files on disk, I'm basically using File.OpenRead and passing that stream to the InputStream of the transfer request.
Are there any performance gains or otherwise to prefer the FilePath method over the InputStream one where the input is known to be a file.
In short: Is this the same thing
using (var fs = File.OpenRead("some path"))
{
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
InputStream = fs,
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
}
As:
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
FilePath = "some path",
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
Or are there any significant difference between the two? I know if files are large or not, and could prefer one method or another based on that.
Edit: I've also done some decompiling of the S3Client, and there does indeed seem to be some difference in regards to the concurrency level of the transfer, as found in MultipartUploadCommand.cs
private int CalculateConcurrentServiceRequests()
{
int num = !this._fileTransporterRequest.IsSetFilePath() || this._s3Client is AmazonS3EncryptionClient ? 1 : this._config.ConcurrentServiceRequests;
if (this._totalNumberOfParts < num)
num = this._totalNumberOfParts;
return num;
}
From the TransferUtility documentation:
When uploading large files by specifying file paths instead of a
stream, TransferUtility uses multiple threads to upload multiple parts
of a single upload at once. When dealing with large content sizes and
high bandwidth, this can increase throughput significantly.
Which tells that using the file paths will use the MultiPart upload, but using the stream wont.
But when I read through this Upload Method (stream, bucketName, key):
Uploads the contents of the specified stream. For large uploads, the
file will be divided and uploaded in parts using Amazon S3's multipart
API. The parts will be reassembled as one object in Amazon S3.
Which means that MultiPart is used on Streams as well.
Amazon recommend to use MultiPart upload if the file size is larger than 100MB http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. In general,
when your object size reaches 100 MB, you should consider using
multipart uploads instead of uploading the object in a single
operation.
Using multipart upload provides the following advantages:
Improved throughput—You can upload parts in parallel to improve
throughput. Quick recovery from any network issues—Smaller part size
minimizes the impact of restarting a failed upload due to a network
error. Pause and resume object uploads—You can upload object parts
over time. Once you initiate a multipart upload there is no expiry;
you must explicitly complete or abort the multipart upload. Begin an
upload before you know the final object size—You can upload an object
as you are creating it.
So based on Amazon S3 there is no different between using Stream or File Path, but It might make a slightly performance difference based on your code and OS.
I think the difference may be that they both use Multipart Upload API, but using a FilePath allows for concurrent uploads, however,
When you're using a stream for the source of data, the TransferUtility
class does not do concurrent uploads.
https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingTheMPDotNetAPI.html
Related
I have the following request flow where the customer can request to download a CSV file from the Server. The issue is that the blob file is too large and the customer has to wait a lot longer before the actual download starts (the customer thinks that there is some issue and closes the browser). How can the download be made more efficient using streams?
Current sequence is as below:
Request Sequence:
Client clicks the download button from the browser.
Backend receives the request.
Backend Server Downloads the Blob from the Azure Storage Account.
There is some custom processing that needs to be done.
Once the processing is completed, start sending the response back to the client.
Now the issue is that while using the DownloadTo(Stream) function of BlobBaseClient, the file is entirely downloaded to memory before I could do anything.
How can I download the blob file in chunks, do the processing and start sending it to the customer?
Part of Download Controller:
var contentDisposition = new ContentDispositionHeaderValue("attachment")
{
FileName = "customer-file.csv",
CreationDate = DateTimeOffset.UtcNow
};
Response.Headers.Add("Content-Disposition", contentDisposition.ToString());
var result = blobService.DownloadAndProcessContent();
foreach (var line in result)
{
yield return line ;
}
Response.BodyWriter.FlushAsync();
Part of DownloadAndProcessContent Function:
var stream = new MemoryStream();
var blob = container.GetAppendBlobClient(blobName);
blob.DownloadTo(stream);
// Processing is done on the Blob Data
var streamReader = new StreamReader(stream);
while (!streamReader.EndOfStream)
{
string currentLine= streamReader.ReadLine();
// process the line.
string processDataLine = ProcessData(currentLine);
yield return processDataLine;
}
Did you consider using built-in method OpenRead so you can apply the StreamReader directly to the blob stream without needing a MemoryStream in the middle? This should give you a way process line-by-line as you do in the loop.
Also note it's recommended to take the async-await approach all the way so your controller code (made async) would be much more scalable by not blocking on I/O turning the .Net thread-pool into a bottleneck for handling concurrent requests to your API.
This answer doesn't address returning an HTTP response with streaming, that's separate from streaming a downloaded blob.
I am converting an application that currently uses the Windows file system to read and store files.
While reading each line of an input file, it modifies the data, and then writes it out to an output file:
using (var writer = new StreamWriter(#"C:\temp\out.txt", false))
{
using (var reader = new StreamReader(#"C:\temp\in.txt", Encoding.UTF8))
{
while ((line = reader.ReadLine()) != null)
{
//Create modifiedLine from line data
...
writer.WriteLine(modifiedLine);
}
}
}
So far, I have been able to write to S3 using a PutObjectRequest, but only with the entire file contents at once:
//Set up stream
var stream = new MemoryStream();
var writer = new StreamWriter(stream);
writer.Write(theEntireModifiedFileContents);
writer.Flush();
stream.Position = 0;
var putRequest = new PutObjectRequest()
{
BucketName = destinationBucket,
Key = destinationKey,
InputStream = stream
};
var response = await s3Client.PutObjectAsync(putRequest);
Given that these are going to be large files, I would prefer to keep the line-by-line approach rather than having to send the entire file contents at once.
Is there any way to maintain a similar behavior to the file system example above with S3?
S3 is an object store and does not support modifications in-place, appending, etc.
However, it is possible to meet your goals if certain criteria is met / understood:
1) Realize that it will take more code to do this than simply modifying your code to buffer the line output and then upload as a single object.
2) You can upload each line as part of the REST API PUT stream. This means that you will need to continuously upload data until complete. Basically you are doing a slow upload of a single S3 object while you process each line.
3) You can use the multi-part API to upload each line as a single part of a multi-part transfer. Then combine parts once complete. Note: I don't remember if each part has to be the same size (except for the last part). The limit to the total number of parts is 1,000. If your number of lines is more than 1,000 than you will need to buffer, so go back to method #1 or add buffering to reduce the number of parts to 1,000.
Unless you are a really motivated developer, realize that method #1 is going to be far easier to implement and test. Methods #2 and #3 will require you to understand how S3 works at a very low level using HTTP PUT requests.
my azure function receives large video files and images and stores it in Azure blob. Client API is sending data in chunks to my Azure htttp trigger function. Do I have to do something at receiving-end to improve performance like receiving data in chunks?
Bruce, actually Client code is being developed by some other team. right now i am testing it by postman and getting files from multipart http request.
foreach (HttpContent ctnt in provider.Contents) {
var dataStream = await ctnt.ReadAsStreamAsync();
if (ctnt.Headers.ContentDisposition.Name.Trim().Replace("\"", "") == "file")
{
byte[] ImageBytes = ReadFully(dataStream);
var fileName = WebUtility.UrlDecode(ctnt.Headers.ContentDisposition.FileName);
} }
ReadFully Function
public static byte[] ReadFully(Stream input){
using (MemoryStream ms = new MemoryStream())
{
input.CopyTo(ms);
return ms.ToArray();
}}
As BlobRequestOptions.ParallelOperationThread states as follows:
Gets or sets the number of blocks that may be simultaneously uploaded.
Remarks:
When using the UploadFrom* methods on a blob, the blob will be broken up into blocks. Setting this
value limits the number of outstanding I/O "put block" requests that the library will have in-flight
at a given time. Default is 1 (no parallelism). Setting this value higher may result in
faster blob uploads, depending on the network between the client and the Azure Storage service.
If blobs are small (less than 256 MB), keeping this value equal to 1 is advised.
I assumed that you could explicitly set the ParallelOperationThreadCount for faster blob uploading.
var requestOption = new BlobRequestOptions()
{
ParallelOperationThreadCount = 5 //Gets or sets the number of blocks that may be simultaneously uploaded.
};
//upload a blob from the local file system
await blockBlob.UploadFromFileAsync("{your-file-path}",null,requestOption,null);
//upload a blob from the stream
await blockBlob.UploadFromStreamAsync({stream-for-upload},null,requestOption,null);
foreach (HttpContent ctnt in provider.Contents)
Based on your code, I assumed that you retrieve the provider instance as follows:
MultipartMemoryStreamProvider provider = await request.Content.ReadAsMultipartAsync();
At this time, you could use the following code for uploading your new blob:
var blobname = ctnt.Headers.ContentDisposition.FileName.Trim('"');
CloudBlockBlob blockBlob = container.GetBlockBlobReference(blobname);
//set the content-type for the current blob
blockBlob.Properties.ContentType = ctnt.Headers.ContentType.MediaType;
await blockBlob.UploadFromStreamAsync(await ctnt.Content.ReadAsStreamAsync(), null,requestOption,null);
I would prefer use MultipartFormDataStreamProvider which would store the uploaded files from the client to the file system instead of MultipartMemoryStreamProvider which would use the server memory for temporarily storing the data sent from the client. For the MultipartFormDataStreamProvider approach, you could follow this similar issue. Moreover, I would prefer use the Azure Storage Client Library with my Azure function, you could follow Get started with Azure Blob storage using .NET.
UPDATE:
Moreover, you could follow this tutorial about breaking a large file into small chunks, upload them in the client side, then merge them back in your server side.
I have an object that has to be converted to Json format and uploaded via Stream object. This is the AWS S3 upload code:
AWSS3Client.PutObjectAsync(new PutObjectRequest()
{
InputStream = stream,
BucketName = name,
Key = keyName
}).Wait();
Here stream is Stream type which is read by AWSS3Client.
The data that I am uploading is a complex object that has to be in Json format.
I can convert object to string using JsonConvert.SerializeObject or serialize to file using JsonSerializer but since amount of data is quite significant I would prefer to avoid temporary string or file and convert object to readable Stream right away. My ideal code would look something like this:
AWSS3Client.PutObjectAsync(new PutObjectRequest()
{
InputStream = MagicJsonConverter.ToStream(myDataObject),
BucketName = name,
Key = keyName
}).Wait();
Is there a way to achieve this using Newtonsoft.Json ?
You need two things here: one is producer/consumer stream, e.g. BlockingStream from this StackOverflow question, and second, Json.Net serializer writing to this stream like in this another SO question.
Another practical option is to wrap the memory stream with gzip stream (2 lines of code).
Usually, JSON files will have great compression (1GB file can be compressed to 50MB).
Then when serving the stream to S3, wrap it with gzip stream which decompresses it.
I guess the trade-off comparing to temp file is CPU vs IO (both will probably work well). If you can save it compressed on S3 it will save you space and increase networking efficiency too.
Example code:
var compressed = new MemoryStream();
using (var zip = new GZipStream(compressed, CompressionLevel.Fastest, true))
{
-> Write to zip stream...
}
compressed.Seek(0, SeekOrigin.Begin);
-> Use stream to upload to S3
I'm looking for a way to store files in a zip file without a memory stream. My goal is to save a maximum of system memory, while direct disk IO is no problem.
I iterate over a database result set where I have collected some blobs. These are byte-arrays.
What I do it the following (System.IO.Compression):
using (var archive = ZipFile.Open("data.zip", ZipArchiveMode.Update))
{
foreach (var result in results)
{
string fileName = $"{result.Id}.bin";
using (var fileStream = new FileStream(fileName, FileMode.Create, FileAccess.Write))
{
// write the blob data from result.Value
fileStream.Write(result.Value, 0, result.Value.Length);
fileStream.Close();
}
archive.CreateEntryFromFile(fileName, fileName);
}
}
There are 2 problems with this implementation.
I have my *.bin files AND the one *.zip (only need the zip)
I don't know why, but this uses a lot of RAM (~100MB for 15x1.5MB bin files)
Is there a way to completely bybass the memory?
UPDATE:
What I'm trying to achieve is to generate one ZIP file that contains single binary files generated from database blobs. This should happen inside a ASP.NET Web API controller. A user can request the data, but instead of sending the whole data in the HTTP response, I generate the ZIP file in the time of the request, save it to a local file server and send a download link back to the user.
I think your >100 MBs coming from
the results object which should contain at least 15x1.5 MB of blob data
holding the resulting data.zip open inside the foreach Scope.
to minimize the RAM amount of the worker process:
create empty zip-file
do {
(single BLOB query from DB)
(write blob to new or overwrite File)
(open zip file for append)
(append file to zip)
(close and dispose **both** file handles / objects )
}