I have a multi-volume archive stored in Azure Blob Storage that is split into a series of zips titled like this: Archive-Name.zip.001, Archive-Name.zip.002, etc. . . Archive-Name.zip.010. Each file is 250 MB and contains hundreds of PDFs.
Currently we were trying to iterate through each archive part and extract the PDFs. This works except when the past PDF in an archive has been split between two archive parts, ZipFile in C# is unable to process the split file and throws an exception.
We tried reading all the archive parts into a single MemoryStream and then extracting the files, however then we are finding the memory streams exceed 2GBs which is the limit - so this method does not work either.
It is not feasible to download the archive into a machines memory, extract, then upload the PDFs to a new file. The extraction needs to be done in Azure where the program will run.
This is the code we are currently using - it is unable to handle PDFs split between two archive parts.
public static void UnzipTaxForms(TextWriter log, string type, string fiscalYear)
{
var folderName = "folderName";
var outPutContainer = GetContainer("containerName");
CreateIfNotExists(outPutContainer);
var fileItems = ListFileItems(folderName);
fileItems = fileItems.Where(i => i.Name.Contains(".zip")).ToList();
foreach (var file in fileItems)
{
using (var ziped = ZipFile.Read(GetMemoryStreamFromFile(folderName, file.Name)))
{
foreach (var zipEntry in ziped)
{
using (var outPutStream = new MemoryStream())
{
zipEntry.Extract(outPutStream);
var blockblob = outPutContainer.GetBlockBlobReference(zipEntry.FileName);
outPutStream.Seek(0, SeekOrigin.Begin);
blockblob.UploadFromStream(outPutStream);
}
}
}
}
}
Another note. We are unable to change the way the multi-volume archive is generated. Any help would be appreciated.
Related
I am attempting to create a zip file using the Ionic.Zip library in .NET. My procedure iterates over a list of files from various sources and puts each file (file bytes, file name) into the zip file my procedure builds (as shown below).
The procedure below works great for smaller lists of files. But with larger lists of files, my procedure throws an OutOfMemory exception. I "thought" I was putting the zipped contents directly into the zip file I am building. But since I am getting an OutOfMemory exception, it makes me think that my procedure is loading up everything into memory before saving it to disk.
This is my procedure:
using (var tempFileStream = new System.IO.FileStream(tempFileName, FileMode.OpenOrCreate))
using (var zipOutputStream = new ZipOutputStream(tempFileStream))
{
foreach (var fileRec in dbFileRecs)
{
var fileToPutInZip = getFileData();
if (fileToPutInZip != null)
{
var fileNameToDownload = removeIllegalChars(fileToPutInZip.FileName);
fileNameToDownload = ensureUniqueFilename(fileNameToDownload);
var entry = zipOutputStream.PutNextEntry(fileNameToDownload);
using (var ms = new System.IO.MemoryStream(fileToPutInZip.FileData)) // FileData is a byte array
{
ms.CopyTo(zipOutputStream);
}
zipOutputStream.Flush();
}
}
zipOutputStream.Flush();
zipOutputStream.Close();
}
Am I doing something wrong here? How can I write directly to the zip file and not load up the whole file into memory (and avoid those OutOfMemory errors)?
I've created a zip file method in my web api which returns a zip file to the front end (Angular / typescript) that should download a zip file in the browser. The issue I have is the file shows it has data by the number of kbs it has but on trying to extract the files it says it's empty. From a bit of research this is most likely down to the file being corrupt, but I want to know where I can find this is going wrong. Here's my code:
WebApi:
I won't show the controller as it basically just takes the inputs and passes them to the method. The DownloadFileResults passed in basically have a byte[] in the File property.
public FileContentResult CreateZipFile(IEnumerable<DownloadFileResult> files)
{
using (var compressedFileStream = new MemoryStream())
{
using (var zipArchive = new ZipArchive(compressedFileStream, ZipArchiveMode.Update))
{
foreach (var file in files)
{
var zipEntry = zipArchive.CreateEntry(file.FileName);
using (var entryStream = zipEntry.Open())
{
entryStream.Write(file.File, 0, file.File.Length);
}
}
}
return new FileContentResult(compressedFileStream.ToArray(), "application/zip");
}
}
This appears to work in that it generates a result with data. Here's my front end code:
let fileData = this._filePaths;
this._fileStorageProxy.downloadFile(Object.entries(fileData).map(([key, val]) => val), this._pId).subscribe(result => {
let data = result.data.fileContents;
const blob = new Blob([data], {
type: 'application/zip'
});
const url = window.URL.createObjectURL(blob);
window.open(url);
});
The front end code then displays me a zip file being downloaded, which as I say appears to have data due to it's size, but I can't extract it.
Update
I tried writing the compressedFileStream to a file on my local and I can see that it creates a zip file and I can extract the files within it. This leads me to believe it's something wrong with the front end, or at least with what the front end code is receiving.
2nd Update
Ok, turns out this is specific to how we do things here. The request goes through platform, but for downloads it can only handle a BinaryTransferObject and I needed to hit a different end point. So with a tweak to no longer returning a FileContentResult and hitting the right end point and making the url simply an ahref it's now working.
If I stream a zip file like so:
using var zip = new ZipArchive(fileStream, ZipArchiveMode.Read);
using var sr = new StreamReader(zip.Entries[0].Open());
var line = sr.ReadLine(); //etc..
Am I streaming the zip file entry or is it loading the entire zip file into memory then I am streaming the uncompressed file?
It depends on how the fileStream was created. Was it created from a file on disk? If so, then ZipArchive will read from disk as it needs data. It won't put the entire thing in memory then read it. That would be incredibly inefficient.
I have a bunch of experience in this... I worked on a project where I had to unarchive 25 GB. Zip files. .NET's ZipArchive was very quick and very memory efficient.
You can have MemoryStreams that contain data that ZipArchive can read from, so you aren't limited to just Zip files on disk.
Here is a slightly efficient way to unzip a ZipArchive:
var di = new DirectoryInfo(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData), "MyDirectoryToExtractTo"));
var filesToExtract = _zip.Entries.Where(x =>
!string.IsNullOrEmpty(x.Name) &&
!x.FullName.EndsWith("/", StringComparison.Ordinal));
foreach(var x in filesToExtract)
{
var fi = new FileInfo(Path.Combine(di.FullName, x.FullName));
if (!fi.Directory.Exists) { fi.Directory.Create(); }
using (var i = x.Open())
using (var o = fi.OpenWrite())
{
i.CopyTo(o);
}
}
This will extract all the files to C:\ProgramData\MyDirectoryToExtractTo\ keeping directory structure.
If you'd like to see how ZipArchive was implemented to verify, take a look here.
I have a text box that a user can submit a list of document ids to download those files zipped up from an Azure blob.
How the code currently functions building a zip memory stream and then for each document id submitted we build a memory stream, get the file from that stream, and then add it to the zip file. The issue is that we we are building the memory stream and getting a file that is larger than 180 mb the program throws an out of memory exception.
There is the code
public async Task<byte[]> BuildZipStream(string valueDataUploadContainerName, IEnumerable<Document> docs)
{
var zipMemStream = new MemoryStream();
using (Ionic.Zip.ZipFile zip = new Ionic.Zip.ZipFile())
{
zip.Name = System.IO.Path.GetTempFileName();
var insertedEntries = new List<string>();
foreach (var doc in docs)
{
var EntryName = $"{doc.Name}{Path.GetExtension(doc.DocumentPath)}";
if (insertedEntries.Contains(EntryName))
{
EntryName = $"{doc.Name} (1){Path.GetExtension(doc.DocumentPath)}";
var i = 1;
while (insertedEntries.Contains(EntryName))
{
EntryName = $"{doc.Name} ({i.ToString()}){Path.GetExtension(doc.DocumentPath)}";
i++;
}
}
insertedEntries.Add(EntryName);
var file = await GetFileStream(blobFolderName, doc.DocumentPath);
if (file != null)
zip.AddEntry($"{EntryName}", file);
}
zip.Save(zipMemStream);
}
zipMemStream.Seek(0, 0);
return zipMemStream.ToArray();
And then for actually getting the file from the blob storage
public async Task<byte[]> GetFileStream(string container, string filename)
{
var blobStorageAccount = _keyVaultService.GetSecret(new KeyVaultModel { Key = storageLocation });
var storageAccount = CloudStorageAccount.Parse(blobStorageAccount ?? _config.Value.StorageConnection);
var blobClient = storageAccount.CreateCloudBlobClient();
var blobContainer = blobClient.GetContainerReference(container);
await blobContainer.CreateIfNotExistsAsync();
var blockBlob = blobContainer.GetBlockBlobReference(filename);
if (blockBlob.Exists())
{
using (var mStream = new MemoryStream())
{
await blockBlob.DownloadToStreamAsync(mStream);
mStream.Seek(0, 0);
return mStream.ToArray();
}
}
}
The problem occurs when the program hits await blockBlob.DownloadToStreamAsync(mStream); it will sit and spin for a while and then throw an out of memory exception.
I have read a few different solutions which have not been working for me, the most common being to change the Platform target under properties to be at least x64 and I am running this at x86. Another solution I could see would be to move the GetFileStream logic into the method for BuildZipStream, but then I feel the method would be doing too much.
Any suggestions?
EDIT:
The problem is actually occurring when the program hits zip.Save(zipMemStream)
Your methodology here is flawed. Because you do not know:
The Amount of Files.
The Size of Each File.
You cannot accurately determine if you'll have the memory in RAM within the server to actually HOUSE all the files in memory. What you are doing here is collecting every Azure Blob File they list and putting that into a Zip File IN MEMORY, while downloading each file IN MEMORY. It's no wonder you're getting a Out-Of-Memory exception, even with 128gb of RAM, if the user requests a big enough file, you'll Run out of Memory.
Your best solution, and the most common practice found with downloading and zipping multiple Azure Blob Files is to utilize a Temp Blob file.
Instead of writing to a MemoryStream, you write to FileStream and place that Zipped File onto the Azure Blob, and then Serve that zipped Blob File. Once the file is served you remove it from the Blob.
Hope this helps.
This is my working code that I used to download multiple files as a zip file using Ionic.Zip dll. File contents is stored in a SQL database. This program works if I try to download 1-2 files at a time, but throws an OutOfMemory exception if I try to download multiple files as some of the files may very large.
Exception occurs when it's trying to write in to outputStream.
How can I improve this code to download multiple files or is there a better way to download multiple files one by one rather than zipping them to a one large file?
Code:
public ActionResult DownloadMultipleFiles()
{
string connectionString = "MY DB CONNECTIOBN STRING";
List<Document> documents = new List<Document>();
var query = "MY LIST OF FILES - FILE METADA DATA LIKE FILEID, FILENAME";
documents = query.Query<Document>(connectionString1).ToList();
List<Document> DOCS = documents.GetRange(0, 50); // 50 FILES
Response.Clear();
var outputStream = new MemoryStream();
using (var zip = new ZipFile())
{
foreach (var doc in DOCS)
{
Stream stream = new MemoryStream();
byte[] content = GetFileContent(doc.FileContentId); // This method returns file content
stream.Write(content, 0, content.Length);
zip.UseZip64WhenSaving = Zip64Option.AsNecessary // edited
zip.AddEntry(doc.FileName, content);
}
zip.Save(outputStream);
}
return File(outputStream, "application/zip", "allFiles.zip");
}
Download the files to disc instead of to memory, then use Ionic to zip them from disc. This means you don't need to have all the files in memory at once.