Generate files and ZIP without memory stream - c#

I'm looking for a way to store files in a zip file without a memory stream. My goal is to save a maximum of system memory, while direct disk IO is no problem.
I iterate over a database result set where I have collected some blobs. These are byte-arrays.
What I do it the following (System.IO.Compression):
using (var archive = ZipFile.Open("data.zip", ZipArchiveMode.Update))
{
foreach (var result in results)
{
string fileName = $"{result.Id}.bin";
using (var fileStream = new FileStream(fileName, FileMode.Create, FileAccess.Write))
{
// write the blob data from result.Value
fileStream.Write(result.Value, 0, result.Value.Length);
fileStream.Close();
}
archive.CreateEntryFromFile(fileName, fileName);
}
}
There are 2 problems with this implementation.
I have my *.bin files AND the one *.zip (only need the zip)
I don't know why, but this uses a lot of RAM (~100MB for 15x1.5MB bin files)
Is there a way to completely bybass the memory?
UPDATE:
What I'm trying to achieve is to generate one ZIP file that contains single binary files generated from database blobs. This should happen inside a ASP.NET Web API controller. A user can request the data, but instead of sending the whole data in the HTTP response, I generate the ZIP file in the time of the request, save it to a local file server and send a download link back to the user.

I think your >100 MBs coming from
the results object which should contain at least 15x1.5 MB of blob data
holding the resulting data.zip open inside the foreach Scope.
to minimize the RAM amount of the worker process:
create empty zip-file
do {
(single BLOB query from DB)
(write blob to new or overwrite File)
(open zip file for append)
(append file to zip)
(close and dispose **both** file handles / objects )
}

Related

How do I process a large number of zipped files loaded in memory without having to write them to the disk?

I have a number of a zipped files in a folder stored on a Samsung EVO 970 SSD. Each zip file is 2GB+(while compressed) with 200K+ text files contained within, each file being between 5 to 1.5MB, essentially a large number of small text files.
Rather than extract the zip file and process each text file individually to an SSD, I'm trying to load each zip file in memory in full at the start of the processing and then read each file like shown at the end here.
My (maybe naive ) thinking is that if I could figure out a way to hold the whole zip file contents in the ram and process the text contents without decompressing the zip contents to disk, I would see material boost in the processing performance.
Currently it is taking about 10millisec on average to process every single text file even with the approach taken below.
var myMS = new MemoryStream();
using(var file = File.OpenRead(zipFile))
{
file.CopyTo(myMS);
using(var zip = new ZipArchive(myMS, ZipArchiveMode.Read))
{
foreach(var entry in zip.Entries)
{
using(var reader = new StreamReader(entry.Open(),Encoding.UTF8))
{
string fileContents = reader.ReadToEnd();
//do something with the file
My question is does this approach make sense ? Given that the total number of files, including all zip files is in the millions, I could be sat here for a week waiting for processing to finish.

Unzip a LARGE zip file in Azure File Storage w/o "Out of Memory" exception"

Here's what I'm dealing with...
Some process (out of our control) will occasionally drop a zip file into a directory in Azure File Storage. That directory name is InBound. So let's say a file called bigbook.zip is dropped into the InBound folder.
I need to create an Azure Function App that runs every 5 minutes and looks for zip files in the InBound directory. If any exists, then one-by-one, we create a new directory by the same name as the zip file in another directory (called InProcess). So in our example, I would create InProcess/bigbook.
Now inside InProcess/bigbook, I need to unzip bigbook.zip. So by the time the process is done running InProcess/bigbook will contain all the contents of bigbook.zip.
Please note: This function I am creating is a Console App that will run as an Azure Function App. So there will be no file system access (at least, as far as I'm aware, anyway.) There is no option to download the zip file, unzip it, and then move the contents.
I am having a devil of a time figuring out how to do this in memory only. No matter what I try, I keep running into an Out Of Memory exception. For now, I am just doing this on my localhost running in debug in Visual Studio 2017, .NET 4.7. In that setting, I am not able to convert the test zip file, which is 515,069KB.
This was my first attempt:
private async Task<MemoryStream> GetMemoryStreamAsync(CloudFile inBoundfile)
{
MemoryStream memstream = new MemoryStream();
await inBoundfile.DownloadToStreamAsync(memstream).ConfigureAwait(false);
return memstream;
}
And this (with high hopes) was my second attempt, thinking that DownloadRangeToStream would work better than just DownloadToStream.
private MemoryStream GetMemoryStreamByRange(CloudFile inBoundfile)
{
MemoryStream outPutStream = new MemoryStream();
inBoundfile.FetchAttributes();
int bufferLength = 1 * 1024 * 1024;//1 MB chunk
long blobRemainingLength = inBoundfile.Properties.Length;
long offset = 0;
while (blobRemainingLength > 0)
{
long chunkLength = (long)Math.Min(bufferLength, blobRemainingLength);
using (var ms = new MemoryStream())
{
inBoundfile.DownloadRangeToStream(ms, offset, chunkLength);
lock (outPutStream)
{
outPutStream.Position = offset;
var bytes = ms.ToArray();
outPutStream.Write(bytes, 0, bytes.Length);
}
}
offset += chunkLength;
blobRemainingLength -= chunkLength;
}
return outPutStream;
}
But either way, I am running into memory issues. I presume it's because the MemoryStream I am trying to create gets too large?
How else can I tackle this? And again, downloading the zip file is not an option, as the app will ultimately be an Azure Function App. I'm also pretty sure that using a FileStream isn't an option either, as that requires a local file path, which I don't have. (I only have a remote Azure URL)
Could I somehow create a temp file in the same Azure Storage account that the zip file is in, and stream the zip file to that temp file instead of to a memory stream? (Thinking out loud.)
The goal is to get the stream into a ZipArchive using:
ZipArchive archive = new ZipArchive(stream)
And from there I can extract all the contents. But getting to that point w/o memory errors is proving a real bugger.
Any ideas?
Using Azure Storage File Share this is the only way it worked for me without loading the entire ZIP into Memory. I tested with a 3GB ZIP File (with thousands of files or with a big file inside) and Memory/CPU was low and stable. I hope it helps!
var zipFiles = _directory.ListFilesAndDirectories()
.OfType<CloudFile>()
.Where(x => x.Name.ToLower().Contains(".zip"))
.ToList();
foreach (var zipFile in zipFiles)
{
using (var zipArchive = new ZipArchive(zipFile.OpenRead()))
{
foreach (var entry in zipArchive.Entries)
{
if (entry.Length > 0)
{
CloudFile extractedFile = _directory.GetFileReference(entry.Name);
using (var entryStream = entry.Open())
{
byte[] buffer = new byte[16 * 1024];
using (var ms = extractedFile.OpenWrite(entry.Length))
{
int read;
while ((read = entryStream.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, read);
}
}
}
}
}
}
}
I would suggest you use memory snapshots to see why you are running out of memory within Visual Studio. You can use the tutorial in this article to find the culprit. Doing local development with a smaller file may help you continue to work if your machine is simply running out of memory.
When it comes to doing this within Azure, a node in the Consumption plan is limited to 1.5GB of total memory. If you expect to receive files larger than that then you should look at one of the other App Service plans that give you more memory to work with.
It is possible to store files within the function's local directory, so that is an option. You can't guaruntee that you will be using the same local directory between executions, but this should work as long as you are using the file you downloaded within the same execution.

Out Of Memory Exception in Foreach

I am trying to create a function that will retrieve all the uploaded files (which are now saved as byte in the database) and download it in a single zip file. I currently have 6000 files to download (and the number could grow).
The functionality is already working (from retrieval to download) if I limit the number of files being downloaded, otherwise, I get an OutOfMemoryException on the ForEach loop.
Here's a pseudo code: (files variable is a list of byte array and file name)
var files = getAllFilesFromDB();
foreach (var file in files)
{
var tempFilePath = Path.Combine(path, filename);
using (FileStream stream = new FileStream(tempfileName, FileMode.Create, FileAccess.ReadWrite))
{
stream.Write(file.byteArray, 0, file.byteArray.Length);
}
}
private readonly IEntityRepository<File> fileRepository;
IEnumerable<FileModel> getAllFilesFromDb()
{
return fileRepository.Select(f => new FileModel(){ fileData = f.byteArray, filename = f.fileName});
}
My question is, is there any other way to do this to avoid getting such errors?
To avoid this problem, you could avoid loading all the contents of all the files in one go. Most likely you will need to split your database call in to two database calls.
Retrieve a list of all the files without their contents but with some identifier - like the PK of the table.
A method which retrieves the contents of an individual file.
Then your (pseudo)code becomes
get list of all files
for each file
get the file contents
write the file to disk
Another possibility is to alter the way your query works currently, so that it uses deferred execution - this means it will not actually load all the files at once, but stream them one at a time from the database - but without seeing more code from your repository implementation, I cannot/ will not guess the right solution for you.

Storing files in SQL Server database

I want to store files in my SQL Server database by C# which I have done it without problem.
This is my code:
byte[] file;
using (var stream = new FileStream(letter.FilePath, FileMode.Open, FileAccess.Read))
{
using (var reader = new BinaryReader(stream))
{
file = reader.ReadBytes((int)stream.Length);
letter.ltr_Image = file;
}
}
LetterDB letterDB = new LetterDB();
id = letterDB.LetterActions(letter);
The insert SQL action in the LetterActions module. But I want to know, in order to reduce the size of the database (which increases daily) is there any solution for compressing the files and then store them in the database?
Yes , you can zip your files before storing them in the database, using the ZipFile class. Take a look here: https://msdn.microsoft.com/en-us/library/system.io.compression.zipfile(v=vs.110).aspx
Plenty of sample code out there too. See here:http://imar.spaanjaars.com/414/storing-uploaded-files-in-a-database-or-in-the-file-system-with-aspnet-20
You can compress file like this. Then insert compressed file stream into DB, but when you read it you need decompress it.
If you really need store file in DB, suggest you compress and decompress it by client.
And better way handle file is store them in disk, and only store file path in DB, when client need file use file path get file.

Amazon S3 Transferutility use FilePath or Stream?

When uploading a file to S3 using the TransportUtility class, there is an option to either use FilePath or an input stream. I'm using multi-part uploads.
I'm uploading a variety of things, of which some are files on disk and others are raw streams. I'm currently using the InputStream variety for everything, which works OK, but I'm wondering if I should specialize the method further. For the files on disk, I'm basically using File.OpenRead and passing that stream to the InputStream of the transfer request.
Are there any performance gains or otherwise to prefer the FilePath method over the InputStream one where the input is known to be a file.
In short: Is this the same thing
using (var fs = File.OpenRead("some path"))
{
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
InputStream = fs,
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
}
As:
var uploadMultipartRequest = new TransferUtilityUploadRequest
{
BucketName = "defaultBucket",
Key = "key",
FilePath = "some path",
PartSize = partSize
};
using (var transferUtility = new TransferUtility(s3Client))
{
await transferUtility.UploadAsync(uploadMultipartRequest);
}
Or are there any significant difference between the two? I know if files are large or not, and could prefer one method or another based on that.
Edit: I've also done some decompiling of the S3Client, and there does indeed seem to be some difference in regards to the concurrency level of the transfer, as found in MultipartUploadCommand.cs
private int CalculateConcurrentServiceRequests()
{
int num = !this._fileTransporterRequest.IsSetFilePath() || this._s3Client is AmazonS3EncryptionClient ? 1 : this._config.ConcurrentServiceRequests;
if (this._totalNumberOfParts < num)
num = this._totalNumberOfParts;
return num;
}
From the TransferUtility documentation:
When uploading large files by specifying file paths instead of a
stream, TransferUtility uses multiple threads to upload multiple parts
of a single upload at once. When dealing with large content sizes and
high bandwidth, this can increase throughput significantly.
Which tells that using the file paths will use the MultiPart upload, but using the stream wont.
But when I read through this Upload Method (stream, bucketName, key):
Uploads the contents of the specified stream. For large uploads, the
file will be divided and uploaded in parts using Amazon S3's multipart
API. The parts will be reassembled as one object in Amazon S3.
Which means that MultiPart is used on Streams as well.
Amazon recommend to use MultiPart upload if the file size is larger than 100MB http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. In general,
when your object size reaches 100 MB, you should consider using
multipart uploads instead of uploading the object in a single
operation.
Using multipart upload provides the following advantages:
Improved throughput—You can upload parts in parallel to improve
throughput. Quick recovery from any network issues—Smaller part size
minimizes the impact of restarting a failed upload due to a network
error. Pause and resume object uploads—You can upload object parts
over time. Once you initiate a multipart upload there is no expiry;
you must explicitly complete or abort the multipart upload. Begin an
upload before you know the final object size—You can upload an object
as you are creating it.
So based on Amazon S3 there is no different between using Stream or File Path, but It might make a slightly performance difference based on your code and OS.
I think the difference may be that they both use Multipart Upload API, but using a FilePath allows for concurrent uploads, however,
When you're using a stream for the source of data, the TransferUtility
class does not do concurrent uploads.
https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingTheMPDotNetAPI.html

Categories