HttpContent.CopyToAsync for large files - c#

Hope you're all doing well!
Lets say I'm downloading a file from an HTTP API endpoint and file size is quite large. API returns application/octet-stream i.e. HttpContent in my download method.
when I use
using (FileStream fs = new FileStrean(somepath, FileMode.Create))
{
// this operation takes a few seconds to write to disk
await httpContent.CopyToAsync(fs);
}
As soon as the using statement is executed - I see the file created on the file system at given path, although it is 0 KB at this point, but when CopyToAsync() finishes executing, file size is as expected.
Problem is there's another service running which is constantly polling the folder where above files are saved and often times 0 KB are picked up or sometimes even partial files (this seems to be the case when I use WriteAsync(bytes[]).
Is there a way to not save the file on file system until its ready to be saved...?
One weird work around I could think of was:
using (var memStream = new MemoryStream())
{
await httpContent.CopyToAsync(memStream);
using (FileStream file = new FileStream(destFilePath, FileMode.Create, FileAccess.Write))
{
memStream.Position = 0;
await memStream.CopyToAsync(file);
}
}
I copy the HttpContent over to a MemoryStream and then copy the memorystream over to FileStream... this seems to have worked but there's a cost to memory consumption...
Another work around I could think of was to first save the files into a secondary location and when operation is complete, Move the file over to Primary folder.
Thank you in Advance,
Johny

I ended up saving the file into a temporary folder and when the operation is complete, I move the downloaded file to my primary folder. Since Move is atomic I do not have this issue anymore.
Thank you for those who commented!

Related

Unzip a LARGE zip file in Azure File Storage w/o "Out of Memory" exception"

Here's what I'm dealing with...
Some process (out of our control) will occasionally drop a zip file into a directory in Azure File Storage. That directory name is InBound. So let's say a file called bigbook.zip is dropped into the InBound folder.
I need to create an Azure Function App that runs every 5 minutes and looks for zip files in the InBound directory. If any exists, then one-by-one, we create a new directory by the same name as the zip file in another directory (called InProcess). So in our example, I would create InProcess/bigbook.
Now inside InProcess/bigbook, I need to unzip bigbook.zip. So by the time the process is done running InProcess/bigbook will contain all the contents of bigbook.zip.
Please note: This function I am creating is a Console App that will run as an Azure Function App. So there will be no file system access (at least, as far as I'm aware, anyway.) There is no option to download the zip file, unzip it, and then move the contents.
I am having a devil of a time figuring out how to do this in memory only. No matter what I try, I keep running into an Out Of Memory exception. For now, I am just doing this on my localhost running in debug in Visual Studio 2017, .NET 4.7. In that setting, I am not able to convert the test zip file, which is 515,069KB.
This was my first attempt:
private async Task<MemoryStream> GetMemoryStreamAsync(CloudFile inBoundfile)
{
MemoryStream memstream = new MemoryStream();
await inBoundfile.DownloadToStreamAsync(memstream).ConfigureAwait(false);
return memstream;
}
And this (with high hopes) was my second attempt, thinking that DownloadRangeToStream would work better than just DownloadToStream.
private MemoryStream GetMemoryStreamByRange(CloudFile inBoundfile)
{
MemoryStream outPutStream = new MemoryStream();
inBoundfile.FetchAttributes();
int bufferLength = 1 * 1024 * 1024;//1 MB chunk
long blobRemainingLength = inBoundfile.Properties.Length;
long offset = 0;
while (blobRemainingLength > 0)
{
long chunkLength = (long)Math.Min(bufferLength, blobRemainingLength);
using (var ms = new MemoryStream())
{
inBoundfile.DownloadRangeToStream(ms, offset, chunkLength);
lock (outPutStream)
{
outPutStream.Position = offset;
var bytes = ms.ToArray();
outPutStream.Write(bytes, 0, bytes.Length);
}
}
offset += chunkLength;
blobRemainingLength -= chunkLength;
}
return outPutStream;
}
But either way, I am running into memory issues. I presume it's because the MemoryStream I am trying to create gets too large?
How else can I tackle this? And again, downloading the zip file is not an option, as the app will ultimately be an Azure Function App. I'm also pretty sure that using a FileStream isn't an option either, as that requires a local file path, which I don't have. (I only have a remote Azure URL)
Could I somehow create a temp file in the same Azure Storage account that the zip file is in, and stream the zip file to that temp file instead of to a memory stream? (Thinking out loud.)
The goal is to get the stream into a ZipArchive using:
ZipArchive archive = new ZipArchive(stream)
And from there I can extract all the contents. But getting to that point w/o memory errors is proving a real bugger.
Any ideas?
Using Azure Storage File Share this is the only way it worked for me without loading the entire ZIP into Memory. I tested with a 3GB ZIP File (with thousands of files or with a big file inside) and Memory/CPU was low and stable. I hope it helps!
var zipFiles = _directory.ListFilesAndDirectories()
.OfType<CloudFile>()
.Where(x => x.Name.ToLower().Contains(".zip"))
.ToList();
foreach (var zipFile in zipFiles)
{
using (var zipArchive = new ZipArchive(zipFile.OpenRead()))
{
foreach (var entry in zipArchive.Entries)
{
if (entry.Length > 0)
{
CloudFile extractedFile = _directory.GetFileReference(entry.Name);
using (var entryStream = entry.Open())
{
byte[] buffer = new byte[16 * 1024];
using (var ms = extractedFile.OpenWrite(entry.Length))
{
int read;
while ((read = entryStream.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, read);
}
}
}
}
}
}
}
I would suggest you use memory snapshots to see why you are running out of memory within Visual Studio. You can use the tutorial in this article to find the culprit. Doing local development with a smaller file may help you continue to work if your machine is simply running out of memory.
When it comes to doing this within Azure, a node in the Consumption plan is limited to 1.5GB of total memory. If you expect to receive files larger than that then you should look at one of the other App Service plans that give you more memory to work with.
It is possible to store files within the function's local directory, so that is an option. You can't guaruntee that you will be using the same local directory between executions, but this should work as long as you are using the file you downloaded within the same execution.

EPPlus Open File and lock file through multiple saves

I want to be able to open an Excel file (or create if it doesn't exist) and add data to it asynchronously. I have the async component working quite well using a blocking collection, though if I want to save every loop of my while statement i keep getting issues.
I can either get file corruption, or the data never saves at all. Or sometimes it only saves the first or second data segment in my two part test.
I have the following code to show a similar cut down version of my issue:
BlockingCollection<Excel_Data> collection = null;
FileStream fs = new FileStream(this.path, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.Read);
ExcelPackage excel = new ExcelPackage(fs);
int i = 0;
while (true) {
//---- do some asyc operations
Excel_Data dict_item = collection.Take();
excel.Workbook.Worksheets.Add("sheet" + i.ToString());
//excel.Save();
excel.SaveAs(fs);
if (++i == 2) {
break;
}
}
fs.Close();
In the above example after simply create 2 sheets, the file already becomes corrupted and I am unsure how to fix this issue without going purely with FileInfo over FileStream. But then i will never be able to lock my file for writing for the duration of my app.

Generate files and ZIP without memory stream

I'm looking for a way to store files in a zip file without a memory stream. My goal is to save a maximum of system memory, while direct disk IO is no problem.
I iterate over a database result set where I have collected some blobs. These are byte-arrays.
What I do it the following (System.IO.Compression):
using (var archive = ZipFile.Open("data.zip", ZipArchiveMode.Update))
{
foreach (var result in results)
{
string fileName = $"{result.Id}.bin";
using (var fileStream = new FileStream(fileName, FileMode.Create, FileAccess.Write))
{
// write the blob data from result.Value
fileStream.Write(result.Value, 0, result.Value.Length);
fileStream.Close();
}
archive.CreateEntryFromFile(fileName, fileName);
}
}
There are 2 problems with this implementation.
I have my *.bin files AND the one *.zip (only need the zip)
I don't know why, but this uses a lot of RAM (~100MB for 15x1.5MB bin files)
Is there a way to completely bybass the memory?
UPDATE:
What I'm trying to achieve is to generate one ZIP file that contains single binary files generated from database blobs. This should happen inside a ASP.NET Web API controller. A user can request the data, but instead of sending the whole data in the HTTP response, I generate the ZIP file in the time of the request, save it to a local file server and send a download link back to the user.
I think your >100 MBs coming from
the results object which should contain at least 15x1.5 MB of blob data
holding the resulting data.zip open inside the foreach Scope.
to minimize the RAM amount of the worker process:
create empty zip-file
do {
(single BLOB query from DB)
(write blob to new or overwrite File)
(open zip file for append)
(append file to zip)
(close and dispose **both** file handles / objects )
}

Reading file after writing it

I have a strange problem. So my code follows as following.
The exe takes some data from the user
Call a web service to write(and create CSV for the data) the file at perticular network location(say \some-server\some-directory).
Although this web service is hosted at the same location where this
folder is (i.e i can also change it to be c:\some-directory). It then
returns after writing the file
the exe checks for the file to exists, if the file exists then further processing else quite with error.
The problem I am having is at step 3. When I try to read the file immediately after it has been written, I always get file not found exception(but the file there is present). I do not get this exception when I am debugging (because then I am putting a delay by debugging the code) or when Thread.Sleep(3000) before reading the file.
This is really strange because I close the StreamWriter before I return the call to exe. Now according to the documention, close should force the flush of the stream. This is also not related to the size of the file. Also I am not doing Async thread calls for writing and reading the file. They are running in same thread serially one after another(only writing is done by a web service and reading is done by exe. Still the call is serial)
I do not know, but it feels like there is some time difference between the file actually gets written on the disk and when you do Close(). However this baffling because this is not at all related to size. This happens for all file size. I have tried this with file with 10, 50, 100,200 lines of data.
Another thing which I suspected was since I was writing this file to a network location, it could be windows is optimizing the call by writing first to cache and then to network location. So I went ahead and changed the code to write it on drive(i.e use c:\some-directory), rather than network location. But it also resulted in same error.
There is no error in code(for reading and writing). As explained earlier, by putting a delay, it starts working fine. Some other useful information
The exe is .Net Framework 3.5
Windows Server 2008(64 bit, 4 GB Ram)
Edit 1
File.AppendAllText() is not correct solution, as it creates a new file, if it does not exits
Edit 2
code for writing
using (FileStream fs = new FileStream(outFileName, FileMode.Create))
{
using (StreamWriter writer = new StreamWriter(fs, Encoding.Unicode))
{
writer.WriteLine(someString)
}
}
code for reading
StreamReader rdr = new StreamReader(File.OpenRead(CsvFilePath));
string header = rdr.ReadLine();
rdr.Close();
Edit 3
used textwriter, same error
using (TextWriter writer = File.CreateText(outFileName))
{
}
Edit 3
Finally as suggested by some users, I am doing a check for the file in while loop for certain number of times before I throw the exception of file not found.
int i = 1;
while (i++ < 10)
{
bool fileExists = File.Exists(CsvFilePath);
if (!fileExists)
System.Threading.Thread.Sleep(500);
else
break;
}
So you are writing a stream to a file, then reading the file back to a stream? Do you need to write the file then post process it, or can you not just use the source stream directly?
If you need the file, I would use a loop that keeps checking if the file exists every second until it appears (or a silly amount of time has passed) - the writer would give you an error if you couldn't write the file, so you know it will turn up eventually.
Since you're writing over a network, most optimal solution would be to save your file in the local system first, then copy it to network location. This way you can avoid network connection problems. And as well have a backup in case of network failure.
Based on your update, Try this instead:
File.WriteAllText(outFileName, someString);
header = null;
using(StreamReader reader = new StreamReader(CsvFilePath)) {
header = reader.ReadLine();
}
Have you tried to read after disposing the writer FileStream?
Like this:
using (FileStream fs = new FileStream(outFileName, FileMode.Create))
{
using (StreamWriter writer = new StreamWriter(fs, Encoding.Unicode))
{
writer.WriteLine(someString)
}
}
using (StreamReader rdr = new StreamReader(File.OpenRead(CsvFilePath)))
{
string header = rdr.ReadLine();
}

Reusing a filestream

In the past I've always used a FileStream object to write or rewrite an entire file after which I would immediately close the stream. However, now I'm working on a program in which I want to keep a FileStream open in order to allow the user to retain access to the file while they are working in between saves. ( See my previous question).
I'm using XmlSerializer to serialize my classes to a from and XML file. But now I'm keeping the FileStream open to be used to save (reserialized) my class instance later. Are there any special considerations I need to make if I'm reusing the same File Stream over and over again, versus using a new file stream? Do I need to reset the stream to the beginning between saves? If a later save is smaller in size than the previous save will the FileStream leave the remainder bytes from the old file, and thus create a corrupted file? Do I need to do something to clear the file so it will behave as if I'm writing an entirely new file each time?
Your suspicion is correct - if you reset the position of an open file stream and write content that's smaller than what's already in the file, it will leave trailing data and result in a corrupt file (depending on your definition of "corrupt", of course).
If you want to overwrite the file, you really should close the stream when you're finished with it and create a new stream when you're ready to re-save.
I notice from your linked question that you are holding the file open in order to prevent other users from writing to it at the same time. This probably wouldn't be my choice, but if you are going to do that, then I think you can "clear" the file by invoking stream.SetLength(0) between successive saves.
There are various ways to do this; if you are re-opening the file, perhaps set it to truncate:
using(var file = new FileStream(path, FileMode.Truncate)) {
// write
}
If you are overwriting the file while already open, then just trim it after writing:
file.SetLength(file.Position); // assumes we're at the new end
I would try to avoid delete/recreate, since this loses any ACLs etc.
Another option might be to use SetLength(0) to truncate the file before you start rewriting it.
Recently ran into the same requirement. In fact, previously, I used to create a new FileStream within a using statement and overwrite the previous file. Seems like the simple and effective thing to do.
using (var stream = new FileStream(path, FileMode.Create, FileAccess.Write)
{
ProtoBuf.Serializer.Serialize(stream , value);
}
However, I ran into locking issues where some other process is locking the target file. In my attempt to thwart this I retried the write several times before pushing the error up the stack.
int attempt = 0;
while (true)
{
try
{
using (var stream = new FileStream(path, FileMode.Create, FileAccess.Write)
{
ProtoBuf.Serializer.Serialize(stream , value);
}
break;
}
catch (IOException)
{
// could be locked by another process
// make up to X attempts to write the file
attempt++;
if (attempt >= X)
{
throw;
}
Thread.Sleep(100);
}
}
That seemed to work for almost everyone. Then that problem machine came along and forced me down the path of maintaining a lock on the file the entire time. So in lieu of retrying to write the file in the case it's already locked, I'm now making sure I get and hold the stream open so there are no locking issues with later writes.
int attempt = 0;
while (true)
{
try
{
_stream = new FileStream(path, FileMode.Open, FileAccess.ReadWrite, FileShare.Read);
break;
}
catch (IOException)
{
// could be locked by another process
// make up to X attempts to open the file
attempt++;
if (attempt >= X)
{
throw;
}
Thread.Sleep(100);
}
}
Now when I write the file the FileStream position must be reset to zero, as Aaronaught said. I opted to "clear" the file by calling _stream.SetLength(0). Seemed like the simplest choice. Then using our serializer of choice, Marc Gravell's protobuf-net, serialize the value to the stream.
_stream.SetLength(0);
ProtoBuf.Serializer.Serialize(_stream, value);
This works just fine most of the time and the file is completely written to the disk. However, on a few occasions I've observed the file not being immediately written to the disk. To ensure the stream is flushed and the file is completely written to disk I also needed to call _stream.Flush(true).
_stream.SetLength(0);
ProtoBuf.Serializer.Serialize(_stream, value);
_stream.Flush(true);
Based on your question I think you'd be better served closing/re-opening the underlying file. You don't seem to be doing anything other than writing the whole file. The value you can add by re-writing Open/Close/Flush/Seek will be next to 0. Concentrate on your business problem.

Categories