Write to S3 using PutObjectRequest while still generating stream - c#

I am converting an application that currently uses the Windows file system to read and store files.
While reading each line of an input file, it modifies the data, and then writes it out to an output file:
using (var writer = new StreamWriter(#"C:\temp\out.txt", false))
{
using (var reader = new StreamReader(#"C:\temp\in.txt", Encoding.UTF8))
{
while ((line = reader.ReadLine()) != null)
{
//Create modifiedLine from line data
...
writer.WriteLine(modifiedLine);
}
}
}
So far, I have been able to write to S3 using a PutObjectRequest, but only with the entire file contents at once:
//Set up stream
var stream = new MemoryStream();
var writer = new StreamWriter(stream);
writer.Write(theEntireModifiedFileContents);
writer.Flush();
stream.Position = 0;
var putRequest = new PutObjectRequest()
{
BucketName = destinationBucket,
Key = destinationKey,
InputStream = stream
};
var response = await s3Client.PutObjectAsync(putRequest);
Given that these are going to be large files, I would prefer to keep the line-by-line approach rather than having to send the entire file contents at once.
Is there any way to maintain a similar behavior to the file system example above with S3?

S3 is an object store and does not support modifications in-place, appending, etc.
However, it is possible to meet your goals if certain criteria is met / understood:
1) Realize that it will take more code to do this than simply modifying your code to buffer the line output and then upload as a single object.
2) You can upload each line as part of the REST API PUT stream. This means that you will need to continuously upload data until complete. Basically you are doing a slow upload of a single S3 object while you process each line.
3) You can use the multi-part API to upload each line as a single part of a multi-part transfer. Then combine parts once complete. Note: I don't remember if each part has to be the same size (except for the last part). The limit to the total number of parts is 1,000. If your number of lines is more than 1,000 than you will need to buffer, so go back to method #1 or add buffering to reduce the number of parts to 1,000.
Unless you are a really motivated developer, realize that method #1 is going to be far easier to implement and test. Methods #2 and #3 will require you to understand how S3 works at a very low level using HTTP PUT requests.

Related

Azure Form Recognizer only analyzes the first file in a stream

I am testing some AI Document analysis stuff, and am currently trying to allow users to Upload Files to a WebApp, which in turn sends them to Azure Form Recognizer and processes the results.
I am however not able to do so in a single Request.
This is how the Files are represented:
[BindProperty] public List<IFormFile> Upload { get; set; }
I can iterate over these and get the expected results, but this makes the operation take quite long. I would like to just send all of the files in one request (as shown below), but it only ever analyzes the first one. I am using Azure.AI.FormRecognizer.DocumentAnalysis, so the client and StartAnalyzeDocument Method is from there.
using (var stream = new MemoryStream())
{
foreach (IFormFile formFile in Upload)
{
formFile.CopyTo(stream);
}
stream.Seek(0, SeekOrigin.Begin);
AnalyzeDocumentOperation operation = client.StartAnalyzeDocument(modelId, stream);
operation.WaitForCompletion();
Console.WriteLine("This many documents were analysed: " + operation.Value.Documents.Count);
result = operation.Value;
};
"result" is what I process later on. I am quite stumped on this, as I would have expected the appended stream to just work. If anyone has a solution or could point me in the right direction, it would be much appreciated.
Form Recognizer does not yet support processing multiple documents in a single analyze operation for prebuilt-invoice and custom models. Furthermore, most file formats cannot just be appended together to concatenate the content.
One way to speed up the analysis of multiple files in a batch is to call the analyze operation in parallel. Here is a sketch.
var results = Upload.AsParallel().ForAll(formFile =>
{
using (var stream = formFile.OpenReadStream())
{
var operation = client.StartAnalyzeDocument(modelId, stream);
operation.WaitForCompletion();
return operation.Value;
}
}).ToArray();

Download Large File from Azure Blob Storage, process it and send back to the Client

I have the following request flow where the customer can request to download a CSV file from the Server. The issue is that the blob file is too large and the customer has to wait a lot longer before the actual download starts (the customer thinks that there is some issue and closes the browser). How can the download be made more efficient using streams?
Current sequence is as below:
Request Sequence:
Client clicks the download button from the browser.
Backend receives the request.
Backend Server Downloads the Blob from the Azure Storage Account.
There is some custom processing that needs to be done.
Once the processing is completed, start sending the response back to the client.
Now the issue is that while using the DownloadTo(Stream) function of BlobBaseClient, the file is entirely downloaded to memory before I could do anything.
How can I download the blob file in chunks, do the processing and start sending it to the customer?
Part of Download Controller:
var contentDisposition = new ContentDispositionHeaderValue("attachment")
{
FileName = "customer-file.csv",
CreationDate = DateTimeOffset.UtcNow
};
Response.Headers.Add("Content-Disposition", contentDisposition.ToString());
var result = blobService.DownloadAndProcessContent();
foreach (var line in result)
{
yield return line ;
}
Response.BodyWriter.FlushAsync();
Part of DownloadAndProcessContent Function:
var stream = new MemoryStream();
var blob = container.GetAppendBlobClient(blobName);
blob.DownloadTo(stream);
// Processing is done on the Blob Data
var streamReader = new StreamReader(stream);
while (!streamReader.EndOfStream)
{
string currentLine= streamReader.ReadLine();
// process the line.
string processDataLine = ProcessData(currentLine);
yield return processDataLine;
}
Did you consider using built-in method OpenRead so you can apply the StreamReader directly to the blob stream without needing a MemoryStream in the middle? This should give you a way process line-by-line as you do in the loop.
Also note it's recommended to take the async-await approach all the way so your controller code (made async) would be much more scalable by not blocking on I/O turning the .Net thread-pool into a bottleneck for handling concurrent requests to your API.
This answer doesn't address returning an HTTP response with streaming, that's separate from streaming a downloaded blob.

How to remove Header from a CSV FileStream

I have to work on a certain set of limitations for csv file upload :
I will be working with 'large' CSV files (containing header row)
I need to remove the first header row from the CSV file
The file-upload code needs a FileStream (not containing the header) as input! (as I am restricted to do a lot of stream operations on top of this stream (containing headerless csv data))
Wrapper C# Code :
using (var stream = File.OpenRead("C:\~~~\~~~\~~~\SampleFile.csv"))
{
//CSV Header removal snippet - which gives me a new stream containing data without headers.
~
~
~
~
//All my stream handling code of chunking stream into 100mb and then uploading each chunk to azure storage (which is not part of this question)
}
Now I already know - that I can simply remove headers of a csv file using libraries like - CSVHelper (How to exclude header when writing data to CSV)
Using the above way I can create a header-less copy of a file and read the new file back as FileStream - but the problem is that I'm dealing with large files and making a copy of a file just to remove headers will be a space-consuming job.
So for the first time - I am asking a question in StackOverflow - to find a good solution to the above problem. I hope I was able to explain the problem clearly.
This should work to seek to the end of the first line.
using (var stream = File.OpenRead("~~filepath~~"))
using (var reader = new StreamReader(stream))
{
string line = null;
if ((line = reader.ReadLine()) != null)
{
stream.Position = line.Length + 2;
// The 2 is for NewLine(\r\n)
}
//All my stream handling code of chunking stream into 100mb and then uploading each chunk to azure storage (which is not part of this question)
}

Amazon S3 Transferutility use FilePath or Stream?

When uploading a file to S3 using the TransportUtility class, there is an option to either use FilePath or an input stream. I'm using multi-part uploads.
I'm uploading a variety of things, of which some are files on disk and others are raw streams. I'm currently using the InputStream variety for everything, which works OK, but I'm wondering if I should specialize the method further. For the files on disk, I'm basically using File.OpenRead and passing that stream to the InputStream of the transfer request.
Are there any performance gains or otherwise to prefer the FilePath method over the InputStream one where the input is known to be a file.
In short: Is this the same thing
using (var fs = File.OpenRead("some path"))
{
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
InputStream = fs,
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
}
As:
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
FilePath = "some path",
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
Or are there any significant difference between the two? I know if files are large or not, and could prefer one method or another based on that.
Edit: I've also done some decompiling of the S3Client, and there does indeed seem to be some difference in regards to the concurrency level of the transfer, as found in MultipartUploadCommand.cs
private int CalculateConcurrentServiceRequests()
{
int num = !this._fileTransporterRequest.IsSetFilePath() || this._s3Client is AmazonS3EncryptionClient ? 1 : this._config.ConcurrentServiceRequests;
if (this._totalNumberOfParts < num)
num = this._totalNumberOfParts;
return num;
}
From the TransferUtility documentation:
When uploading large files by specifying file paths instead of a
stream, TransferUtility uses multiple threads to upload multiple parts
of a single upload at once. When dealing with large content sizes and
high bandwidth, this can increase throughput significantly.
Which tells that using the file paths will use the MultiPart upload, but using the stream wont.
But when I read through this Upload Method (stream, bucketName, key):
Uploads the contents of the specified stream. For large uploads, the
file will be divided and uploaded in parts using Amazon S3's multipart
API. The parts will be reassembled as one object in Amazon S3.
Which means that MultiPart is used on Streams as well.
Amazon recommend to use MultiPart upload if the file size is larger than 100MB http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. In general,
when your object size reaches 100 MB, you should consider using
multipart uploads instead of uploading the object in a single
operation.
Using multipart upload provides the following advantages:
Improved throughput—You can upload parts in parallel to improve
throughput. Quick recovery from any network issues—Smaller part size
minimizes the impact of restarting a failed upload due to a network
error. Pause and resume object uploads—You can upload object parts
over time. Once you initiate a multipart upload there is no expiry;
you must explicitly complete or abort the multipart upload. Begin an
upload before you know the final object size—You can upload an object
as you are creating it.
So based on Amazon S3 there is no different between using Stream or File Path, but It might make a slightly performance difference based on your code and OS.
I think the difference may be that they both use Multipart Upload API, but using a FilePath allows for concurrent uploads, however,
When you're using a stream for the source of data, the TransferUtility
class does not do concurrent uploads.
https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingTheMPDotNetAPI.html

Overriding WebHostBufferPolicySelector for Non-Buffered File Upload

In an attempt to create a non-buffered file upload I have extended System.Web.Http.WebHost.WebHostBufferPolicySelector, overriding function UseBufferedInputStream() as described in this article: http://www.strathweb.com/2012/09/dealing-with-large-files-in-asp-net-web-api/. When a file is POSTed to my controller, I can see in trace output that the overridden function UseBufferedInputStream() is definitely returning FALSE as expected. However, using diagnostic tools I can see the memory growing as the file is being uploaded.
The heavy memory usage appears to be occurring in my custom MediaTypeFormatter (something like the FileMediaFormatter here: http://lonetechie.com/). It is in this formatter that I would like to incrementally write the incoming file to disk, but I also need to parse json and do some other operations with the Content-Type:multipart/form-data upload. Therefore I'm using HttpContent method ReadAsMultiPartAsync(), which appears to be the source of the memory growth. I have placed trace output before/after the "await", and it appears that while the task is blocking the memory usage is increasing fairly rapidly.
Once I find the file content in the parts returned by ReadAsMultiPartAsync(), I am using Stream.CopyTo() in order to write the file contents to disk. This writes to disk as expected, but unfortunately the source file is already in memory by this point.
Does anyone have any thoughts about what might be going wrong? It seems that ReadAsMultiPartAsync() is buffering the whole post data; if that is true why do we require var fileStream = await fileContent.ReadAsStreamAsync() to get the file contents? Is there another way to accomplish the splitting of the parts without reading them into memory? The code in my MediaTypeFormatter looks something like this:
// save the stream so we can seek/read again later
Stream stream = await content.ReadAsStreamAsync();
var parts = await content.ReadAsMultipartAsync(); // <- memory usage grows rapidly
if (!content.IsMimeMultipartContent())
{
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
}
//
// pull data out of parts.Contents, process json, etc.
//
// find the file data in the multipart contents
var fileContent = parts.Contents.FirstOrDefault(
x => x.Headers.ContentDisposition.DispositionType.ToLower().Trim() == "form-data" &&
x.Headers.ContentDisposition.Name.ToLower().Trim() == "\"" + DATA_CONTENT_DISPOSITION_NAME_FILE_CONTENTS + "\"");
// write the file to disk
using (var fileStream = await fileContent.ReadAsStreamAsync())
{
using (FileStream toDisk = File.OpenWrite("myUploadedFile.bin"))
{
((Stream)fileStream).CopyTo(toDisk);
}
}
WebHostBufferPolicySelector only specifies if the underlying request is bufferless. This is what Web API will do under the hood:
IHostBufferPolicySelector policySelector = _bufferPolicySelector.Value;
bool isInputBuffered = policySelector == null ? true : policySelector.UseBufferedInputStream(httpContextBase);
Stream inputStream = isInputBuffered
? requestBase.InputStream
: httpContextBase.ApplicationInstance.Request.GetBufferlessInputStream();
So if your implementation returns false, then the request is bufferless.
However, ReadAsMultipartAsync() loads everything into MemoryStream - because if you don't specify a provider, it defaults to MultipartMemoryStreamProvider.
To get the files to save automatically to disk as every part is processed use MultipartFormDataStreamProvider (if you deal with files and form data) or MultipartFileStreamProvider (if you deal with just files).
There is an example on asp.net or here. In these examples everything happens in controllers, but there is no reason why you wouldn't use it in i.e. a formatter.
Another option, if you really want to play with streams is to implement a custom class inheritng from MultipartStreamProvider that would fire whatever processing you want as soon as it grabs part of the stream. The usage would be similar to the aforementioned providers - you'd need to pass it to the ReadAsMultipartAsync(provider) method.
Finally - if you are feeling suicidal - since the underlying request stream is bufferless theoretically you could use something like this in your controller or formatter:
Stream stream = HttpContext.Current.Request.GetBufferlessInputStream();
byte[] b = new byte[32*1024];
while ((n = stream.Read(b, 0, b.Length)) > 0)
{
//do stuff with stream bit
}
But of course that's very, for the lack of better word, "ghetto."

Categories