I have a text box that a user can submit a list of document ids to download those files zipped up from an Azure blob.
How the code currently functions building a zip memory stream and then for each document id submitted we build a memory stream, get the file from that stream, and then add it to the zip file. The issue is that we we are building the memory stream and getting a file that is larger than 180 mb the program throws an out of memory exception.
There is the code
public async Task<byte[]> BuildZipStream(string valueDataUploadContainerName, IEnumerable<Document> docs)
{
var zipMemStream = new MemoryStream();
using (Ionic.Zip.ZipFile zip = new Ionic.Zip.ZipFile())
{
zip.Name = System.IO.Path.GetTempFileName();
var insertedEntries = new List<string>();
foreach (var doc in docs)
{
var EntryName = $"{doc.Name}{Path.GetExtension(doc.DocumentPath)}";
if (insertedEntries.Contains(EntryName))
{
EntryName = $"{doc.Name} (1){Path.GetExtension(doc.DocumentPath)}";
var i = 1;
while (insertedEntries.Contains(EntryName))
{
EntryName = $"{doc.Name} ({i.ToString()}){Path.GetExtension(doc.DocumentPath)}";
i++;
}
}
insertedEntries.Add(EntryName);
var file = await GetFileStream(blobFolderName, doc.DocumentPath);
if (file != null)
zip.AddEntry($"{EntryName}", file);
}
zip.Save(zipMemStream);
}
zipMemStream.Seek(0, 0);
return zipMemStream.ToArray();
And then for actually getting the file from the blob storage
public async Task<byte[]> GetFileStream(string container, string filename)
{
var blobStorageAccount = _keyVaultService.GetSecret(new KeyVaultModel { Key = storageLocation });
var storageAccount = CloudStorageAccount.Parse(blobStorageAccount ?? _config.Value.StorageConnection);
var blobClient = storageAccount.CreateCloudBlobClient();
var blobContainer = blobClient.GetContainerReference(container);
await blobContainer.CreateIfNotExistsAsync();
var blockBlob = blobContainer.GetBlockBlobReference(filename);
if (blockBlob.Exists())
{
using (var mStream = new MemoryStream())
{
await blockBlob.DownloadToStreamAsync(mStream);
mStream.Seek(0, 0);
return mStream.ToArray();
}
}
}
The problem occurs when the program hits await blockBlob.DownloadToStreamAsync(mStream); it will sit and spin for a while and then throw an out of memory exception.
I have read a few different solutions which have not been working for me, the most common being to change the Platform target under properties to be at least x64 and I am running this at x86. Another solution I could see would be to move the GetFileStream logic into the method for BuildZipStream, but then I feel the method would be doing too much.
Any suggestions?
EDIT:
The problem is actually occurring when the program hits zip.Save(zipMemStream)
Your methodology here is flawed. Because you do not know:
The Amount of Files.
The Size of Each File.
You cannot accurately determine if you'll have the memory in RAM within the server to actually HOUSE all the files in memory. What you are doing here is collecting every Azure Blob File they list and putting that into a Zip File IN MEMORY, while downloading each file IN MEMORY. It's no wonder you're getting a Out-Of-Memory exception, even with 128gb of RAM, if the user requests a big enough file, you'll Run out of Memory.
Your best solution, and the most common practice found with downloading and zipping multiple Azure Blob Files is to utilize a Temp Blob file.
Instead of writing to a MemoryStream, you write to FileStream and place that Zipped File onto the Azure Blob, and then Serve that zipped Blob File. Once the file is served you remove it from the Blob.
Hope this helps.
Related
I'm trying to upload file to blob container via HTTP.
On request receiving file through:
public class UploadFileFunction
{
//Own created wrapper on BlobContainerClient
private readonly IBlobFileStorageClient _blobFileStorageClient;
public UploadFileFunction(IBlobFileStorageClient blobFileStorageClient)
{
_blobFileStorageClient = blobFileStorageClient;
}
[FunctionName("UploadFile")]
public async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = "/equipments/{route}")]
HttpRequest request,
[FromRoute] string route)
{
IFormFile file = request.Form.Files["File"];
if (file is null)
{
return new BadRequestResult();
}
string fileNamePath = $"{route}_{request.Query["fileName"]}_{request.Query["fileType"]}";
BlobClient blob = _blobFileStorageClient.Container.GetBlobClient(fileNamePath);
try
{
await blob.UploadAsync(file.OpenReadStream(), new BlobHttpHeaders { ContentType = file.ContentType });
}
catch (Exception)
{
return new ConflictResult();
}
return new OkResult();
}
}
Than making request with file:
On UploadAsync whole stream of the file is uploaded in process memory
Is exists some way to upload directly to blob without uploading in process memory?
Thank you in advance.
The best way to avoid this is to not upload your file via your own HTTP endpoint at all. Asking how to avoid the uploaded data not ending up in the process memory (via an HTTP endpoint) makes no sense.
Simply use the Azure Blob Storage REST API to directly upload this file to the Azure blob storage. Your own HTTP endpoint simply needs to issue a Shared access signature (SAS) token for a file upload and the client can upload the file directly to the Blob storage.
This pattern should be used for file uploads unless you have a very good reason not to. Your trigger function is only called after the HTTPRunTime is finished with the HTTP request, hence the trigger's HttpRequest object is allocated in the process memory which is then passed to the trigger.
I also suggest block blobs if you want to upload in multiple stages.
Thats the default way UploadAsync works, this will be ok for files that are small. I ran into an out of memory issue with large files; the solution here is to use AppendBlobAsync
You will need to create the blob as an append blob, so you can keep appending to end of the blob. Basic gist is:
Create an append blob
Go through the existing file and grab xMB(say 2 MB) chunks at a time
Append these chunks to the append blob until the end of file
pseudo code something like below
var appendBlobClient = _blobFileStorageClient.GetAppendBlobClient(fileNamePath);
await appendBlobClient.CreateIfNotExistsAsync();
var appendBlobMaxAppendBlockBytes = appendBlobClient.AppendBlobMaxAppendBlockBytes;
using (var file = file.OpenReadStream())
{
int bytesRead;
var buffer = new byte[appendBlobMaxAppendBlockBytes];
while ((bytesRead = file.Read(buffer, 0, buffer.Length)) > 0)
{
//Stream stream = new MemoryStream(buffer);
var newArray = new Span<byte>(buffer, 0, bytesRead).ToArray();
Stream stream = new MemoryStream(newArray);
stream.Position = 0;
appendBlobClient.AppendBlock(stream);
}
}
If I stream a zip file like so:
using var zip = new ZipArchive(fileStream, ZipArchiveMode.Read);
using var sr = new StreamReader(zip.Entries[0].Open());
var line = sr.ReadLine(); //etc..
Am I streaming the zip file entry or is it loading the entire zip file into memory then I am streaming the uncompressed file?
It depends on how the fileStream was created. Was it created from a file on disk? If so, then ZipArchive will read from disk as it needs data. It won't put the entire thing in memory then read it. That would be incredibly inefficient.
I have a bunch of experience in this... I worked on a project where I had to unarchive 25 GB. Zip files. .NET's ZipArchive was very quick and very memory efficient.
You can have MemoryStreams that contain data that ZipArchive can read from, so you aren't limited to just Zip files on disk.
Here is a slightly efficient way to unzip a ZipArchive:
var di = new DirectoryInfo(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData), "MyDirectoryToExtractTo"));
var filesToExtract = _zip.Entries.Where(x =>
!string.IsNullOrEmpty(x.Name) &&
!x.FullName.EndsWith("/", StringComparison.Ordinal));
foreach(var x in filesToExtract)
{
var fi = new FileInfo(Path.Combine(di.FullName, x.FullName));
if (!fi.Directory.Exists) { fi.Directory.Create(); }
using (var i = x.Open())
using (var o = fi.OpenWrite())
{
i.CopyTo(o);
}
}
This will extract all the files to C:\ProgramData\MyDirectoryToExtractTo\ keeping directory structure.
If you'd like to see how ZipArchive was implemented to verify, take a look here.
I have stored a lot of a user's files on my Amazon S3 storage. I have to provide all these files to the user by request.
For this purpose, I have implemented the lambda function that collects path to user's files, creates a zip archive and store back this archive on s3. Where a user could download it.
My code looks like:
using (var s3Client = new AmazonS3Client()){
using (var memoryStream = new MemoryStream()){
using (var zip = new ZipArchive(memoryStream, ZipArchiveMode.Create, true)){
foreach (var file in m_filePathsOnS3){
var response = await s3Client.GetObjectAsync(m_sourceBucket, file);
var name = file.Split('/').Last();
ZipArchiveEntry entry = zip.CreateEntry(name);
using (Stream entryStream = entry.Open()){
await response.ResponseStream.CopyToAsync(entryStream);
}
}
}
memoryStream.Position = 0;
var putRequest = new PutObjectRequest{
BucketName = m_resultBucket,
Key = m_archivePath,
InputStream = memoryStream
};
await s3Client.PutObjectAsync(putRequest);
}
}
But, lambda function has the limitation in 3008 MB max memory allocation. So if I understand correctly, I will have the issue when trying to make the archive more than 3008 MB.
I looked for a way to stream and archive files on the fly.
Currently, I see only one way - move this lambda function to EC2 instance as service.
You are correct. There is also a limit of 500MB of disk space, which would impact this.
Therefore, AWS Lambda is not a good use-case for creating potentially very large zip files.
I have a multi-volume archive stored in Azure Blob Storage that is split into a series of zips titled like this: Archive-Name.zip.001, Archive-Name.zip.002, etc. . . Archive-Name.zip.010. Each file is 250 MB and contains hundreds of PDFs.
Currently we were trying to iterate through each archive part and extract the PDFs. This works except when the past PDF in an archive has been split between two archive parts, ZipFile in C# is unable to process the split file and throws an exception.
We tried reading all the archive parts into a single MemoryStream and then extracting the files, however then we are finding the memory streams exceed 2GBs which is the limit - so this method does not work either.
It is not feasible to download the archive into a machines memory, extract, then upload the PDFs to a new file. The extraction needs to be done in Azure where the program will run.
This is the code we are currently using - it is unable to handle PDFs split between two archive parts.
public static void UnzipTaxForms(TextWriter log, string type, string fiscalYear)
{
var folderName = "folderName";
var outPutContainer = GetContainer("containerName");
CreateIfNotExists(outPutContainer);
var fileItems = ListFileItems(folderName);
fileItems = fileItems.Where(i => i.Name.Contains(".zip")).ToList();
foreach (var file in fileItems)
{
using (var ziped = ZipFile.Read(GetMemoryStreamFromFile(folderName, file.Name)))
{
foreach (var zipEntry in ziped)
{
using (var outPutStream = new MemoryStream())
{
zipEntry.Extract(outPutStream);
var blockblob = outPutContainer.GetBlockBlobReference(zipEntry.FileName);
outPutStream.Seek(0, SeekOrigin.Begin);
blockblob.UploadFromStream(outPutStream);
}
}
}
}
}
Another note. We are unable to change the way the multi-volume archive is generated. Any help would be appreciated.
This is my working code that I used to download multiple files as a zip file using Ionic.Zip dll. File contents is stored in a SQL database. This program works if I try to download 1-2 files at a time, but throws an OutOfMemory exception if I try to download multiple files as some of the files may very large.
Exception occurs when it's trying to write in to outputStream.
How can I improve this code to download multiple files or is there a better way to download multiple files one by one rather than zipping them to a one large file?
Code:
public ActionResult DownloadMultipleFiles()
{
string connectionString = "MY DB CONNECTIOBN STRING";
List<Document> documents = new List<Document>();
var query = "MY LIST OF FILES - FILE METADA DATA LIKE FILEID, FILENAME";
documents = query.Query<Document>(connectionString1).ToList();
List<Document> DOCS = documents.GetRange(0, 50); // 50 FILES
Response.Clear();
var outputStream = new MemoryStream();
using (var zip = new ZipFile())
{
foreach (var doc in DOCS)
{
Stream stream = new MemoryStream();
byte[] content = GetFileContent(doc.FileContentId); // This method returns file content
stream.Write(content, 0, content.Length);
zip.UseZip64WhenSaving = Zip64Option.AsNecessary // edited
zip.AddEntry(doc.FileName, content);
}
zip.Save(outputStream);
}
return File(outputStream, "application/zip", "allFiles.zip");
}
Download the files to disc instead of to memory, then use Ionic to zip them from disc. This means you don't need to have all the files in memory at once.