How to stream archive to S3 by parts in lambda function?

How to stream archive to S3 by parts in lambda function? - c#

I have stored a lot of a user's files on my Amazon S3 storage. I have to provide all these files to the user by request.
For this purpose, I have implemented the lambda function that collects path to user's files, creates a zip archive and store back this archive on s3. Where a user could download it.
My code looks like:
using (var s3Client = new AmazonS3Client()){
using (var memoryStream = new MemoryStream()){
using (var zip = new ZipArchive(memoryStream, ZipArchiveMode.Create, true)){
foreach (var file in m_filePathsOnS3){
var response = await s3Client.GetObjectAsync(m_sourceBucket, file);
var name = file.Split('/').Last();
ZipArchiveEntry entry = zip.CreateEntry(name);
using (Stream entryStream = entry.Open()){
await response.ResponseStream.CopyToAsync(entryStream);
}
}
}
memoryStream.Position = 0;
var putRequest = new PutObjectRequest{
BucketName = m_resultBucket,
Key = m_archivePath,
InputStream = memoryStream
};
await s3Client.PutObjectAsync(putRequest);
}
}
But, lambda function has the limitation in 3008 MB max memory allocation. So if I understand correctly, I will have the issue when trying to make the archive more than 3008 MB.
I looked for a way to stream and archive files on the fly.
Currently, I see only one way - move this lambda function to EC2 instance as service.

You are correct. There is also a limit of 500MB of disk space, which would impact this.
Therefore, AWS Lambda is not a good use-case for creating potentially very large zip files.

Related

Read and Zip entries files in parallel

I am trying to create a Zip from a list of files in parallel and stream it to client.
I have a working code where I iterate over files sequentially, but I want it instead to be zipped in parallel (multiple files with >100mb each).
using ZipArchive zipArchive = new(Response.BodyWriter.AsStream(), ZipArchiveMode.Create, leaveOpen: false);
for (int i = 0; i < arrLocalFilesPath.Length; i++) // iterate over files
{
string strFilePath = arrLocalFilesPath[i]; // list of files path
string strFileName = Path.GetFileName(strFilePath);
ZipArchiveEntry zipEntry = zipArchive.CreateEntry(strFileName, CompressionLevel.Optimal);
using Stream zipStream = zipEntry.Open();
using FileStream fileStream = System.IO.File.Open(strFilePath, FileMode.Open, FileAccess.Read);
fileStream.CopyTo(zipStream);
}
return new EmptyResult();
Parallel.For and Parallel.ForEach do not work with ZipArchive
Since ZipArchive is not thread safe, I am trying to use DotNetZip to accomplish this task.
I looked at the docs and here's what I have so far using DotNetZip
using Stream streamResponseBody = Response.BodyWriter.AsStream();
Parallel.For(0, arrLocalFilesPath.Length, i =>
{
string strFilePath = arrLocalFilesPath[i]; // list of files path
string strFileName = Path.GetFileName(strFilePath);
string strCompressedOutputFile = strFilePath + ".compressed";
byte[] arrBuffer = new byte[8192]; //[4096];
int n = -1;
using FileStream input = System.IO.File.OpenRead(strFilePath);
using FileStream raw = new(strCompressedOutputFile, FileMode.Create, FileAccess.ReadWrite);
using Stream compressor = new ParallelDeflateOutputStream(raw);
while ((n = input.Read(arrBuffer, 0, arrBuffer.Length)) != 0)
{
compressor.Write(arrBuffer, 0, n);
}
input.CopyTo(streamResponseBody);
});
return new EmptyResult();
However, this doesn't zip files and send to client (it only creates local zip files on the server).
Using MemoryStream or creating a local zip file is out of the question and not what I am looking for.
The server should seamlessly stream read bytes of a file, zip it on the fly and send it to client as chunks (like in my ZipArchive), but with the added benefits of reading those files in parallel and creating a zip of them.
I know that parallelism is usually not optimal for I/O (sometimes a bit worse), but parallel zipping multiple big files should be faster for this case.
I also tried to use SharpZipLib without success.
Usage of any other libraries is fine as long as it read and stream files to client seamlessly without impacting memory.
Any help is appreciated.

If these files are on the same drive there won't be any speed up. The parallelization is used to compress/decompress data, but the disk IO operation cannot be done in parallel.
Assuming that files are not on the same drive and there is a chance to speed up this process...
Are you sure the Stream.CopyTo() is thread safe? Either check the docs or use single thread or set lock on it.
EDIT:
I've checked my old codes, where I was packing huge amount of data into a zip file using ZipArchive. I did it in parallel, but there was no IO read there.
You can use ZipArchive with Parallel.For but you need to use lock:
//create zip into stream
using (ZipArchive zipArchive = new ZipArchive(zipFS, ZipArchiveMode.Update, false))
{
//use parallel foreach instead of parallel, but not for IO read operation!
Parallel.ForEach(listOfFiles, filename =>
{
//create a file entry
ZipArchiveEntry zipFileEntry = zipArchive.CreateEntry(filename);
//prepare memory for the entry
MemoryStream ms = new MemoryStream();
/*fill the memory stream here - I did another packing with BZip2OutputStream, because the zip was packed without compression to speed up random decompression */
//only one thread can write to zip!
lock (zipFileEntry)
{
//open stream for writing
using (Stream zipEntryStream = zipFileEntry.Open())
{
ms.Position = 0; // rewind the stream
StreamUtils.Copy(ms, zipEntryStream, new byte[4096]); //from ICSharpCode.SharpZipLib.Core, copy memory stream data into zip entry with packing.
}
}
}
}
Anyway, if you need to read the files first, it's your performance bottleneck. You won't gain a lot (if anything) from parallel approach here.

Does ZipArchive load entire zip file into memory

If I stream a zip file like so:
using var zip = new ZipArchive(fileStream, ZipArchiveMode.Read);
using var sr = new StreamReader(zip.Entries[0].Open());
var line = sr.ReadLine(); //etc..
Am I streaming the zip file entry or is it loading the entire zip file into memory then I am streaming the uncompressed file?

It depends on how the fileStream was created. Was it created from a file on disk? If so, then ZipArchive will read from disk as it needs data. It won't put the entire thing in memory then read it. That would be incredibly inefficient.
I have a bunch of experience in this... I worked on a project where I had to unarchive 25 GB. Zip files. .NET's ZipArchive was very quick and very memory efficient.
You can have MemoryStreams that contain data that ZipArchive can read from, so you aren't limited to just Zip files on disk.
Here is a slightly efficient way to unzip a ZipArchive:
var di = new DirectoryInfo(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData), "MyDirectoryToExtractTo"));
var filesToExtract = _zip.Entries.Where(x =>
!string.IsNullOrEmpty(x.Name) &&
!x.FullName.EndsWith("/", StringComparison.Ordinal));
foreach(var x in filesToExtract)
{
var fi = new FileInfo(Path.Combine(di.FullName, x.FullName));
if (!fi.Directory.Exists) { fi.Directory.Create(); }
using (var i = x.Open())
using (var o = fi.OpenWrite())
{
i.CopyTo(o);
}
}
This will extract all the files to C:\ProgramData\MyDirectoryToExtractTo\ keeping directory structure.
If you'd like to see how ZipArchive was implemented to verify, take a look here.

Out Of Memory Exception when zipping memory stream

I have a text box that a user can submit a list of document ids to download those files zipped up from an Azure blob.
How the code currently functions building a zip memory stream and then for each document id submitted we build a memory stream, get the file from that stream, and then add it to the zip file. The issue is that we we are building the memory stream and getting a file that is larger than 180 mb the program throws an out of memory exception.
There is the code
public async Task<byte[]> BuildZipStream(string valueDataUploadContainerName, IEnumerable<Document> docs)
{
var zipMemStream = new MemoryStream();
using (Ionic.Zip.ZipFile zip = new Ionic.Zip.ZipFile())
{
zip.Name = System.IO.Path.GetTempFileName();
var insertedEntries = new List<string>();
foreach (var doc in docs)
{
var EntryName = $"{doc.Name}{Path.GetExtension(doc.DocumentPath)}";
if (insertedEntries.Contains(EntryName))
{
EntryName = $"{doc.Name} (1){Path.GetExtension(doc.DocumentPath)}";
var i = 1;
while (insertedEntries.Contains(EntryName))
{
EntryName = $"{doc.Name} ({i.ToString()}){Path.GetExtension(doc.DocumentPath)}";
i++;
}
}
insertedEntries.Add(EntryName);
var file = await GetFileStream(blobFolderName, doc.DocumentPath);
if (file != null)
zip.AddEntry($"{EntryName}", file);
}
zip.Save(zipMemStream);
}
zipMemStream.Seek(0, 0);
return zipMemStream.ToArray();
And then for actually getting the file from the blob storage
public async Task<byte[]> GetFileStream(string container, string filename)
{
var blobStorageAccount = _keyVaultService.GetSecret(new KeyVaultModel { Key = storageLocation });
var storageAccount = CloudStorageAccount.Parse(blobStorageAccount ?? _config.Value.StorageConnection);
var blobClient = storageAccount.CreateCloudBlobClient();
var blobContainer = blobClient.GetContainerReference(container);
await blobContainer.CreateIfNotExistsAsync();
var blockBlob = blobContainer.GetBlockBlobReference(filename);
if (blockBlob.Exists())
{
using (var mStream = new MemoryStream())
{
await blockBlob.DownloadToStreamAsync(mStream);
mStream.Seek(0, 0);
return mStream.ToArray();
}
}
}
The problem occurs when the program hits await blockBlob.DownloadToStreamAsync(mStream); it will sit and spin for a while and then throw an out of memory exception.
I have read a few different solutions which have not been working for me, the most common being to change the Platform target under properties to be at least x64 and I am running this at x86. Another solution I could see would be to move the GetFileStream logic into the method for BuildZipStream, but then I feel the method would be doing too much.
Any suggestions?
EDIT:
The problem is actually occurring when the program hits zip.Save(zipMemStream)

Your methodology here is flawed. Because you do not know:
The Amount of Files.
The Size of Each File.
You cannot accurately determine if you'll have the memory in RAM within the server to actually HOUSE all the files in memory. What you are doing here is collecting every Azure Blob File they list and putting that into a Zip File IN MEMORY, while downloading each file IN MEMORY. It's no wonder you're getting a Out-Of-Memory exception, even with 128gb of RAM, if the user requests a big enough file, you'll Run out of Memory.
Your best solution, and the most common practice found with downloading and zipping multiple Azure Blob Files is to utilize a Temp Blob file.
Instead of writing to a MemoryStream, you write to FileStream and place that Zipped File onto the Azure Blob, and then Serve that zipped Blob File. Once the file is served you remove it from the Blob.
Hope this helps.

How to extract multi-volume archive within Azure Blob Storage?

I have a multi-volume archive stored in Azure Blob Storage that is split into a series of zips titled like this: Archive-Name.zip.001, Archive-Name.zip.002, etc. . . Archive-Name.zip.010. Each file is 250 MB and contains hundreds of PDFs.
Currently we were trying to iterate through each archive part and extract the PDFs. This works except when the past PDF in an archive has been split between two archive parts, ZipFile in C# is unable to process the split file and throws an exception.
We tried reading all the archive parts into a single MemoryStream and then extracting the files, however then we are finding the memory streams exceed 2GBs which is the limit - so this method does not work either.
It is not feasible to download the archive into a machines memory, extract, then upload the PDFs to a new file. The extraction needs to be done in Azure where the program will run.
This is the code we are currently using - it is unable to handle PDFs split between two archive parts.
public static void UnzipTaxForms(TextWriter log, string type, string fiscalYear)
{
var folderName = "folderName";
var outPutContainer = GetContainer("containerName");
CreateIfNotExists(outPutContainer);
var fileItems = ListFileItems(folderName);
fileItems = fileItems.Where(i => i.Name.Contains(".zip")).ToList();
foreach (var file in fileItems)
{
using (var ziped = ZipFile.Read(GetMemoryStreamFromFile(folderName, file.Name)))
{
foreach (var zipEntry in ziped)
{
using (var outPutStream = new MemoryStream())
{
zipEntry.Extract(outPutStream);
var blockblob = outPutContainer.GetBlockBlobReference(zipEntry.FileName);
outPutStream.Seek(0, SeekOrigin.Begin);
blockblob.UploadFromStream(outPutStream);
}
}
}
}
}
Another note. We are unable to change the way the multi-volume archive is generated. Any help would be appreciated.

Download file from one stream and save it to zip in another stream without putting all data in memory

I'm a complete beginner to dealing with streams or anything like this, so if you see anything obviously wrong... that's why. I have some files stored on azure. I need to take the files, zip them up, and return the zip.
I've tested this with a 1GB file, and although it works, it ends up using 2.5GB of memory. Memory usage spikes between when the last line starts and completes. I'm not sure why this is loading everything in memory, so I'm not quite sure what i'm supposed to do to prevent that from happening. What's the correct way to do it? The only thing I can think of is to specify buffer sizes somewhere, but everywhere i've seen where it's possible has a small default.
FileStream zipToOpen = new FileStream("test.zip", FileMode.Create);
ZipArchive archive = new ZipArchive(zipToOpen, ZipArchiveMode.Update, true);
ZipArchiveEntry zipEntry = archive.CreateEntry("entryName", CompressionLevel.Optimal);
// ... some azure code
file = await dir.GetFileReference(fileName);
fileContents = await file.OpenReadAsync().ConfigureAwait(false);
await fileContents.CopyToAsync(zipEntry.Open()).ConfigureAwait(false);

Just create archive as
ZipArchive archive = new ZipArchive(zipToOpen, ZipArchiveMode.Create);
Your memory consumption will drop to minimal (In my test case it dropped from 900M to 36M)...
Seems like problem is related with ZipArchiveMode.Update
void Zip(IEnumerable<string> files, Stream inputStream)
{
using (var zip = new System.IO.Compression.ZipArchive(inputStream, System.IO.Compression.ZipArchiveMode.Create))
{
foreach (var file in files.Select(f => new FileInfo(f)))
{
var entry = zip.CreateEntry(file.Name, System.IO.Compression.CompressionLevel.Fastest);
using (var s = entry.Open())
{
using (var f = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
f.CopyTo(s);
}
}
}
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to stream archive to S3 by parts in lambda function? - c#

You are correct. There is also a limit of 500MB of disk space, which would impact this. Therefore, AWS Lambda is not a good use-case for creating potentially very large zip files.

Related

Read and Zip entries files in parallel

Does ZipArchive load entire zip file into memory

Out Of Memory Exception when zipping memory stream

How to extract multi-volume archive within Azure Blob Storage?

Download file from one stream and save it to zip in another stream without putting all data in memory

Categories

Resources