Overriding WebHostBufferPolicySelector for Non-Buffered File Upload

Overriding WebHostBufferPolicySelector for Non-Buffered File Upload - c#

In an attempt to create a non-buffered file upload I have extended System.Web.Http.WebHost.WebHostBufferPolicySelector, overriding function UseBufferedInputStream() as described in this article: http://www.strathweb.com/2012/09/dealing-with-large-files-in-asp-net-web-api/. When a file is POSTed to my controller, I can see in trace output that the overridden function UseBufferedInputStream() is definitely returning FALSE as expected. However, using diagnostic tools I can see the memory growing as the file is being uploaded.
The heavy memory usage appears to be occurring in my custom MediaTypeFormatter (something like the FileMediaFormatter here: http://lonetechie.com/). It is in this formatter that I would like to incrementally write the incoming file to disk, but I also need to parse json and do some other operations with the Content-Type:multipart/form-data upload. Therefore I'm using HttpContent method ReadAsMultiPartAsync(), which appears to be the source of the memory growth. I have placed trace output before/after the "await", and it appears that while the task is blocking the memory usage is increasing fairly rapidly.
Once I find the file content in the parts returned by ReadAsMultiPartAsync(), I am using Stream.CopyTo() in order to write the file contents to disk. This writes to disk as expected, but unfortunately the source file is already in memory by this point.
Does anyone have any thoughts about what might be going wrong? It seems that ReadAsMultiPartAsync() is buffering the whole post data; if that is true why do we require var fileStream = await fileContent.ReadAsStreamAsync() to get the file contents? Is there another way to accomplish the splitting of the parts without reading them into memory? The code in my MediaTypeFormatter looks something like this:
// save the stream so we can seek/read again later
Stream stream = await content.ReadAsStreamAsync();
var parts = await content.ReadAsMultipartAsync(); // <- memory usage grows rapidly
if (!content.IsMimeMultipartContent())
{
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
}
//
// pull data out of parts.Contents, process json, etc.
//
// find the file data in the multipart contents
var fileContent = parts.Contents.FirstOrDefault(
x => x.Headers.ContentDisposition.DispositionType.ToLower().Trim() == "form-data" &&
x.Headers.ContentDisposition.Name.ToLower().Trim() == "\"" + DATA_CONTENT_DISPOSITION_NAME_FILE_CONTENTS + "\"");
// write the file to disk
using (var fileStream = await fileContent.ReadAsStreamAsync())
{
using (FileStream toDisk = File.OpenWrite("myUploadedFile.bin"))
{
((Stream)fileStream).CopyTo(toDisk);
}
}

WebHostBufferPolicySelector only specifies if the underlying request is bufferless. This is what Web API will do under the hood:
IHostBufferPolicySelector policySelector = _bufferPolicySelector.Value;
bool isInputBuffered = policySelector == null ? true : policySelector.UseBufferedInputStream(httpContextBase);
Stream inputStream = isInputBuffered
? requestBase.InputStream
: httpContextBase.ApplicationInstance.Request.GetBufferlessInputStream();
So if your implementation returns false, then the request is bufferless.
However, ReadAsMultipartAsync() loads everything into MemoryStream - because if you don't specify a provider, it defaults to MultipartMemoryStreamProvider.
To get the files to save automatically to disk as every part is processed use MultipartFormDataStreamProvider (if you deal with files and form data) or MultipartFileStreamProvider (if you deal with just files).
There is an example on asp.net or here. In these examples everything happens in controllers, but there is no reason why you wouldn't use it in i.e. a formatter.
Another option, if you really want to play with streams is to implement a custom class inheritng from MultipartStreamProvider that would fire whatever processing you want as soon as it grabs part of the stream. The usage would be similar to the aforementioned providers - you'd need to pass it to the ReadAsMultipartAsync(provider) method.
Finally - if you are feeling suicidal - since the underlying request stream is bufferless theoretically you could use something like this in your controller or formatter:
Stream stream = HttpContext.Current.Request.GetBufferlessInputStream();
byte[] b = new byte[32*1024];
while ((n = stream.Read(b, 0, b.Length)) > 0)
{
//do stuff with stream bit
}
But of course that's very, for the lack of better word, "ghetto."

Related

How to dynamically generate file for download in Razor Pages

I want to add a button that will download a dynamically generated CSV file.
I think I need to use FileStreamResult (or possibly FileContentResult) but I have been unable to find an example that shows how to do this.
I've seen examples that create a physical file, and then download that. But my ideal solution would write directly to the response stream, which would be far more efficient than creating a file or first building the string in memory.
Has anyone seen an example of dynamically generating a file for download in Razor Pages (not MVC)?

So here's what I came up with.
Markup:
<a class="btn btn-success" asp-page-handler="DownloadCsv">
Download CSV
</a>
Handler:
public IActionResult OnGetDownloadCsv()
{
using MemoryStream memoryStream = new MemoryStream();
using CsvWriter writer = new CsvWriter(memoryStream);
// Write to memoryStream using SoftCircuits.CsvParser
writer.Flush(); // This is important!
FileContentResult result = new FileContentResult(memoryStream.GetBuffer(), "text/csv")
{
FileDownloadName = "Filename.csv""
};
return result;
}
This code works but I wish it used memory more efficiently. As is, it writes the entire file contents to memory, and then copies that memory to the result. So a large file would exist twice in memory before anything is written to the response stream. I was curious about FileStreamResult but wasn't able to get that working.
If someone can improve on this, I'd gladly mark your answer as the accepted one.
UPDATE:
So I realized I can adapt the code above to use FileStreamResult by replacing the last block with this:
memoryStream.Seek(0, SeekOrigin.Begin);
FileStreamResult result = new FileStreamResult(memoryStream, "text/csv")
{
FileDownloadName = "Filename.csv"
};
return result;
This works almost the same except that, instead of calling memoryStream.GetBuffer() to copy all the bytes, it just passes the memory stream object. This is an improvement as I am not needlessly copying the bytes.
However, the downside is that I have to remove my two using statements or else I'll get an exception:
ObjectDisposedException: Cannot access a closed Stream.
Looks like it's a trade off between copying the bytes an extra time or not cleaning up my streams and CSV writer.
In the end, I'm able to prevent the CSV writer from closing the stream when it's disposed, and since MemoryStream does not have unmanaged resources there should be no harm in leaving it open.

Write to S3 using PutObjectRequest while still generating stream

I am converting an application that currently uses the Windows file system to read and store files.
While reading each line of an input file, it modifies the data, and then writes it out to an output file:
using (var writer = new StreamWriter(#"C:\temp\out.txt", false))
{
using (var reader = new StreamReader(#"C:\temp\in.txt", Encoding.UTF8))
{
while ((line = reader.ReadLine()) != null)
{
//Create modifiedLine from line data
...
writer.WriteLine(modifiedLine);
}
}
}
So far, I have been able to write to S3 using a PutObjectRequest, but only with the entire file contents at once:
//Set up stream
var stream = new MemoryStream();
var writer = new StreamWriter(stream);
writer.Write(theEntireModifiedFileContents);
writer.Flush();
stream.Position = 0;
var putRequest = new PutObjectRequest()
{
BucketName = destinationBucket,
Key = destinationKey,
InputStream = stream
};
var response = await s3Client.PutObjectAsync(putRequest);
Given that these are going to be large files, I would prefer to keep the line-by-line approach rather than having to send the entire file contents at once.
Is there any way to maintain a similar behavior to the file system example above with S3?

S3 is an object store and does not support modifications in-place, appending, etc.
However, it is possible to meet your goals if certain criteria is met / understood:
1) Realize that it will take more code to do this than simply modifying your code to buffer the line output and then upload as a single object.
2) You can upload each line as part of the REST API PUT stream. This means that you will need to continuously upload data until complete. Basically you are doing a slow upload of a single S3 object while you process each line.
3) You can use the multi-part API to upload each line as a single part of a multi-part transfer. Then combine parts once complete. Note: I don't remember if each part has to be the same size (except for the last part). The limit to the total number of parts is 1,000. If your number of lines is more than 1,000 than you will need to buffer, so go back to method #1 or add buffering to reduce the number of parts to 1,000.
Unless you are a really motivated developer, realize that method #1 is going to be far easier to implement and test. Methods #2 and #3 will require you to understand how S3 works at a very low level using HTTP PUT requests.

Amazon S3 Transferutility use FilePath or Stream?

When uploading a file to S3 using the TransportUtility class, there is an option to either use FilePath or an input stream. I'm using multi-part uploads.
I'm uploading a variety of things, of which some are files on disk and others are raw streams. I'm currently using the InputStream variety for everything, which works OK, but I'm wondering if I should specialize the method further. For the files on disk, I'm basically using File.OpenRead and passing that stream to the InputStream of the transfer request.
Are there any performance gains or otherwise to prefer the FilePath method over the InputStream one where the input is known to be a file.
In short: Is this the same thing
using (var fs = File.OpenRead("some path"))
{
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
InputStream = fs,
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
}
As:
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
FilePath = "some path",
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
Or are there any significant difference between the two? I know if files are large or not, and could prefer one method or another based on that.
Edit: I've also done some decompiling of the S3Client, and there does indeed seem to be some difference in regards to the concurrency level of the transfer, as found in MultipartUploadCommand.cs
private int CalculateConcurrentServiceRequests()
{
int num = !this._fileTransporterRequest.IsSetFilePath() || this._s3Client is AmazonS3EncryptionClient ? 1 : this._config.ConcurrentServiceRequests;
if (this._totalNumberOfParts < num)
num = this._totalNumberOfParts;
return num;
}

From the TransferUtility documentation:
When uploading large files by specifying file paths instead of a
stream, TransferUtility uses multiple threads to upload multiple parts
of a single upload at once. When dealing with large content sizes and
high bandwidth, this can increase throughput significantly.
Which tells that using the file paths will use the MultiPart upload, but using the stream wont.
But when I read through this Upload Method (stream, bucketName, key):
Uploads the contents of the specified stream. For large uploads, the
file will be divided and uploaded in parts using Amazon S3's multipart
API. The parts will be reassembled as one object in Amazon S3.
Which means that MultiPart is used on Streams as well.
Amazon recommend to use MultiPart upload if the file size is larger than 100MB http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. In general,
when your object size reaches 100 MB, you should consider using
multipart uploads instead of uploading the object in a single
operation.
Using multipart upload provides the following advantages:
Improved throughput—You can upload parts in parallel to improve
throughput. Quick recovery from any network issues—Smaller part size
minimizes the impact of restarting a failed upload due to a network
error. Pause and resume object uploads—You can upload object parts
over time. Once you initiate a multipart upload there is no expiry;
you must explicitly complete or abort the multipart upload. Begin an
upload before you know the final object size—You can upload an object
as you are creating it.
So based on Amazon S3 there is no different between using Stream or File Path, but It might make a slightly performance difference based on your code and OS.

I think the difference may be that they both use Multipart Upload API, but using a FilePath allows for concurrent uploads, however,
When you're using a stream for the source of data, the TransferUtility
class does not do concurrent uploads.
https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingTheMPDotNetAPI.html

Crash safe on-the-fly compression with GZipStream

I'm compressing a log file as data is written to it, something like:
using (var fs = new FileStream("Test.gz", FileMode.Create, FileAccess.Write, FileShare.None))
{
using (var compress = new GZipStream(fs, CompressionMode.Compress))
{
for (int i = 0; i < 1000000; i++)
{
// Clearly this isn't what is happening in production, just
// a simply example
byte[] message = RandomBytes();
compress.Write(message, 0, message.Length);
// Flush to disk (in production we will do this every x lines,
// or x milliseconds, whichever comes first)
if (i % 20 == 0)
{
compress.Flush();
}
}
}
}
What I want to ensure is that if the process crashes or is killed, the archive is still valid and readable. I had hoped that anything since the last flush would be safe, but instead I am just ending up with a corrupt archive.
Is there any way to ensure I end up with a readable archive after each flush?
Note: it isn't essential that we use GZipStream, if something else will give us the desired result.

An option is to let Windows handle the compression. Just enable compression on the folder where you're storing your log files. There are some performance considerations you should be aware of when copying the compressed files, and I don't know how well NT compression performs in comparision to GZipStream or other compression options. You'll probably want to compare compression ratios and CPU load.
There's also the option of opening a compressed file, if you don't want to enable compression on the entire folder. I haven't tried this, but you might want to look into it: http://social.msdn.microsoft.com/forums/en-US/netfxbcl/thread/1b63b4a4-b197-4286-8f3f-af2498e3afe5

Good news: GZip is a streaming format. Therefore corruption at the end of the stream cannot affect the beginning which was already written.
So even if your streaming writes are interrupted at an arbitrary point, most of the stream is still good. You can write yourself a little tool that reads from it and just stops at the first exception it sees.
If you want an error-free solution I'd recommend splitting the log into one file every x seconds (maybe x = 1 or 10?). Write into a file with extensions ".gz.tmp" and rename to ".gz" after the file was completely written and closed.

Yes, but it's more involved than just flushing. Take a look at gzlog.h and gzlog.c in the zlib distribution. It does exactly what you want, efficiently adding short log entries to a gzip file, and always leaving a valid gzip file behind. It also has protection against crashes or shutdowns during the process, still leaving a valid gzip file behind and not losing any log entries.
I recommend not using GZIPStream. It is buggy and does not provide the necessary functionality. Use DotNetZip instead as your interface to zlib.

FileUpload to Amazon S3 results in 0 byte file

I'm trying to fix a bug where the following code results in a 0 byte file on S3, and no error message.
This code feeds in a Stream (from the poorly-named FileUpload4) which contains an image and the desired image path (from a database wrapper object) to Amazon's S3, but the file itself is never uploaded.
CloudUtils.UploadAssetToCloud(FileUpload4.FileContent, ((ImageContent)auxSRC.Content).PhysicalLocationUrl);
ContentWrapper.SaveOrUpdateAuxiliarySalesRoleContent(auxSRC);
The second line simply saves the database object which stores information about the (supposedly) uploaded picture. This save is going through, demonstrating that the above line runs without error.
The first line above calls in to this method, after retrieving an appropriate bucketname:
public static bool UploadAssetToCloud(Stream asset, string path, string bucketName, AssetSecurity security = AssetSecurity.PublicRead)
{
TransferUtility txferUtil;
S3CannedACL ACL = GetS3ACL(security);
using (txferUtil = new Amazon.S3.Transfer.TransferUtility(AWSKEY, AWSSECRETKEY))
{
TransferUtilityUploadRequest request = new TransferUtilityUploadRequest()
.WithBucketName(bucketName)
.WithTimeout(TWO_MINUTES)
.WithCannedACL(ACL)
.WithKey(path);
request.InputStream = asset;
txferUtil.Upload(request);
}
return true;
}
I have made sure that the stream is a good stream - I can save it anywhere else I have permissions for, the bucket exists and the path is fine (the file is created at the destination on S3, it just doesn't get populated with the content of the stream). I'm close to my wits end, here - what am I missing?
EDIT: One of my coworkers pointed out that it would be better to the FileUpload's PostedFile property. I'm now pulling the stream off of that, instead. It still isn't working.

Is the stream positioned correctly? Check asset.Position to make sure the position is set to the beginning of the stream.
asset.Seek(0, SeekOrigin.Begin);
Edit
OK, more guesses (I'm down to guesses, though):
(all of this is assuming that you can still read from your incoming stream just fine "by hand")
Just for testing, try one of the simpler Upload methods on the TransferUtility -- maybe one that just takes a file path string. If that works, then maybe there are additional properties to set on the UploadRequest object.
If you hook the UploadProgressEvent on the UploadRequest object, do you get any additional clues to what's going wrong?
I noticed that the UploadRequest's api includes both an InputStream property, and a WithInputStream fluent API. Maybe there's a bug with setting InputStream? Maybe try using the .WithInputStream API instead

Which Stream are you using ? Does the stream you are using, support mark() and reset() method.
Might be while upload method first calculate the MD5 for the given stream and then upload it, So if you stream is not supporting these two method then at the time of MD5 calculation it reaches at eof and then unable to preposition for the stream to upload the object.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.