If I stream a zip file like so:
using var zip = new ZipArchive(fileStream, ZipArchiveMode.Read);
using var sr = new StreamReader(zip.Entries[0].Open());
var line = sr.ReadLine(); //etc..
Am I streaming the zip file entry or is it loading the entire zip file into memory then I am streaming the uncompressed file?
It depends on how the fileStream was created. Was it created from a file on disk? If so, then ZipArchive will read from disk as it needs data. It won't put the entire thing in memory then read it. That would be incredibly inefficient.
I have a bunch of experience in this... I worked on a project where I had to unarchive 25 GB. Zip files. .NET's ZipArchive was very quick and very memory efficient.
You can have MemoryStreams that contain data that ZipArchive can read from, so you aren't limited to just Zip files on disk.
Here is a slightly efficient way to unzip a ZipArchive:
var di = new DirectoryInfo(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData), "MyDirectoryToExtractTo"));
var filesToExtract = _zip.Entries.Where(x =>
!string.IsNullOrEmpty(x.Name) &&
!x.FullName.EndsWith("/", StringComparison.Ordinal));
foreach(var x in filesToExtract)
{
var fi = new FileInfo(Path.Combine(di.FullName, x.FullName));
if (!fi.Directory.Exists) { fi.Directory.Create(); }
using (var i = x.Open())
using (var o = fi.OpenWrite())
{
i.CopyTo(o);
}
}
This will extract all the files to C:\ProgramData\MyDirectoryToExtractTo\ keeping directory structure.
If you'd like to see how ZipArchive was implemented to verify, take a look here.
Related
I am trying to create a Zip from a list of files in parallel and stream it to client.
I have a working code where I iterate over files sequentially, but I want it instead to be zipped in parallel (multiple files with >100mb each).
using ZipArchive zipArchive = new(Response.BodyWriter.AsStream(), ZipArchiveMode.Create, leaveOpen: false);
for (int i = 0; i < arrLocalFilesPath.Length; i++) // iterate over files
{
string strFilePath = arrLocalFilesPath[i]; // list of files path
string strFileName = Path.GetFileName(strFilePath);
ZipArchiveEntry zipEntry = zipArchive.CreateEntry(strFileName, CompressionLevel.Optimal);
using Stream zipStream = zipEntry.Open();
using FileStream fileStream = System.IO.File.Open(strFilePath, FileMode.Open, FileAccess.Read);
fileStream.CopyTo(zipStream);
}
return new EmptyResult();
Parallel.For and Parallel.ForEach do not work with ZipArchive
Since ZipArchive is not thread safe, I am trying to use DotNetZip to accomplish this task.
I looked at the docs and here's what I have so far using DotNetZip
using Stream streamResponseBody = Response.BodyWriter.AsStream();
Parallel.For(0, arrLocalFilesPath.Length, i =>
{
string strFilePath = arrLocalFilesPath[i]; // list of files path
string strFileName = Path.GetFileName(strFilePath);
string strCompressedOutputFile = strFilePath + ".compressed";
byte[] arrBuffer = new byte[8192]; //[4096];
int n = -1;
using FileStream input = System.IO.File.OpenRead(strFilePath);
using FileStream raw = new(strCompressedOutputFile, FileMode.Create, FileAccess.ReadWrite);
using Stream compressor = new ParallelDeflateOutputStream(raw);
while ((n = input.Read(arrBuffer, 0, arrBuffer.Length)) != 0)
{
compressor.Write(arrBuffer, 0, n);
}
input.CopyTo(streamResponseBody);
});
return new EmptyResult();
However, this doesn't zip files and send to client (it only creates local zip files on the server).
Using MemoryStream or creating a local zip file is out of the question and not what I am looking for.
The server should seamlessly stream read bytes of a file, zip it on the fly and send it to client as chunks (like in my ZipArchive), but with the added benefits of reading those files in parallel and creating a zip of them.
I know that parallelism is usually not optimal for I/O (sometimes a bit worse), but parallel zipping multiple big files should be faster for this case.
I also tried to use SharpZipLib without success.
Usage of any other libraries is fine as long as it read and stream files to client seamlessly without impacting memory.
Any help is appreciated.
If these files are on the same drive there won't be any speed up. The parallelization is used to compress/decompress data, but the disk IO operation cannot be done in parallel.
Assuming that files are not on the same drive and there is a chance to speed up this process...
Are you sure the Stream.CopyTo() is thread safe? Either check the docs or use single thread or set lock on it.
EDIT:
I've checked my old codes, where I was packing huge amount of data into a zip file using ZipArchive. I did it in parallel, but there was no IO read there.
You can use ZipArchive with Parallel.For but you need to use lock:
//create zip into stream
using (ZipArchive zipArchive = new ZipArchive(zipFS, ZipArchiveMode.Update, false))
{
//use parallel foreach instead of parallel, but not for IO read operation!
Parallel.ForEach(listOfFiles, filename =>
{
//create a file entry
ZipArchiveEntry zipFileEntry = zipArchive.CreateEntry(filename);
//prepare memory for the entry
MemoryStream ms = new MemoryStream();
/*fill the memory stream here - I did another packing with BZip2OutputStream, because the zip was packed without compression to speed up random decompression */
//only one thread can write to zip!
lock (zipFileEntry)
{
//open stream for writing
using (Stream zipEntryStream = zipFileEntry.Open())
{
ms.Position = 0; // rewind the stream
StreamUtils.Copy(ms, zipEntryStream, new byte[4096]); //from ICSharpCode.SharpZipLib.Core, copy memory stream data into zip entry with packing.
}
}
}
}
Anyway, if you need to read the files first, it's your performance bottleneck. You won't gain a lot (if anything) from parallel approach here.
I have stored a lot of a user's files on my Amazon S3 storage. I have to provide all these files to the user by request.
For this purpose, I have implemented the lambda function that collects path to user's files, creates a zip archive and store back this archive on s3. Where a user could download it.
My code looks like:
using (var s3Client = new AmazonS3Client()){
using (var memoryStream = new MemoryStream()){
using (var zip = new ZipArchive(memoryStream, ZipArchiveMode.Create, true)){
foreach (var file in m_filePathsOnS3){
var response = await s3Client.GetObjectAsync(m_sourceBucket, file);
var name = file.Split('/').Last();
ZipArchiveEntry entry = zip.CreateEntry(name);
using (Stream entryStream = entry.Open()){
await response.ResponseStream.CopyToAsync(entryStream);
}
}
}
memoryStream.Position = 0;
var putRequest = new PutObjectRequest{
BucketName = m_resultBucket,
Key = m_archivePath,
InputStream = memoryStream
};
await s3Client.PutObjectAsync(putRequest);
}
}
But, lambda function has the limitation in 3008 MB max memory allocation. So if I understand correctly, I will have the issue when trying to make the archive more than 3008 MB.
I looked for a way to stream and archive files on the fly.
Currently, I see only one way - move this lambda function to EC2 instance as service.
You are correct. There is also a limit of 500MB of disk space, which would impact this.
Therefore, AWS Lambda is not a good use-case for creating potentially very large zip files.
I have a multi-volume archive stored in Azure Blob Storage that is split into a series of zips titled like this: Archive-Name.zip.001, Archive-Name.zip.002, etc. . . Archive-Name.zip.010. Each file is 250 MB and contains hundreds of PDFs.
Currently we were trying to iterate through each archive part and extract the PDFs. This works except when the past PDF in an archive has been split between two archive parts, ZipFile in C# is unable to process the split file and throws an exception.
We tried reading all the archive parts into a single MemoryStream and then extracting the files, however then we are finding the memory streams exceed 2GBs which is the limit - so this method does not work either.
It is not feasible to download the archive into a machines memory, extract, then upload the PDFs to a new file. The extraction needs to be done in Azure where the program will run.
This is the code we are currently using - it is unable to handle PDFs split between two archive parts.
public static void UnzipTaxForms(TextWriter log, string type, string fiscalYear)
{
var folderName = "folderName";
var outPutContainer = GetContainer("containerName");
CreateIfNotExists(outPutContainer);
var fileItems = ListFileItems(folderName);
fileItems = fileItems.Where(i => i.Name.Contains(".zip")).ToList();
foreach (var file in fileItems)
{
using (var ziped = ZipFile.Read(GetMemoryStreamFromFile(folderName, file.Name)))
{
foreach (var zipEntry in ziped)
{
using (var outPutStream = new MemoryStream())
{
zipEntry.Extract(outPutStream);
var blockblob = outPutContainer.GetBlockBlobReference(zipEntry.FileName);
outPutStream.Seek(0, SeekOrigin.Begin);
blockblob.UploadFromStream(outPutStream);
}
}
}
}
}
Another note. We are unable to change the way the multi-volume archive is generated. Any help would be appreciated.
I'm a complete beginner to dealing with streams or anything like this, so if you see anything obviously wrong... that's why. I have some files stored on azure. I need to take the files, zip them up, and return the zip.
I've tested this with a 1GB file, and although it works, it ends up using 2.5GB of memory. Memory usage spikes between when the last line starts and completes. I'm not sure why this is loading everything in memory, so I'm not quite sure what i'm supposed to do to prevent that from happening. What's the correct way to do it? The only thing I can think of is to specify buffer sizes somewhere, but everywhere i've seen where it's possible has a small default.
FileStream zipToOpen = new FileStream("test.zip", FileMode.Create);
ZipArchive archive = new ZipArchive(zipToOpen, ZipArchiveMode.Update, true);
ZipArchiveEntry zipEntry = archive.CreateEntry("entryName", CompressionLevel.Optimal);
// ... some azure code
file = await dir.GetFileReference(fileName);
fileContents = await file.OpenReadAsync().ConfigureAwait(false);
await fileContents.CopyToAsync(zipEntry.Open()).ConfigureAwait(false);
Just create archive as
ZipArchive archive = new ZipArchive(zipToOpen, ZipArchiveMode.Create);
Your memory consumption will drop to minimal (In my test case it dropped from 900M to 36M)...
Seems like problem is related with ZipArchiveMode.Update
void Zip(IEnumerable<string> files, Stream inputStream)
{
using (var zip = new System.IO.Compression.ZipArchive(inputStream, System.IO.Compression.ZipArchiveMode.Create))
{
foreach (var file in files.Select(f => new FileInfo(f)))
{
var entry = zip.CreateEntry(file.Name, System.IO.Compression.CompressionLevel.Fastest);
using (var s = entry.Open())
{
using (var f = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
f.CopyTo(s);
}
}
}
}
}
I am able to compress up to 1 GB of the folder but I have to compress more than 10 GB.
string filePath = C:\Work; // path of the source file
private void compressFile(string filePath)
{
using (ZipFile zip = new ZipFile())
{
zip.AddDirectory(Path.Combine(filePath, "Demo"));
if (File.Exists(Path.Combine(filePath, "Demo.zip")))
{
File.Delete(Path.Combine(filePath, "Demo.zip"));
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression;
zip.Save(Path.Combine(filePath, "Demo.zip"));
}
zip.Save(Path.Combine(filePath, "Demo.zip"));
}
}
I'm assuming the issue here is out-of-memory due to the fact that everything is in memory until the zip.Save call.
Suggestion: don't use DotNetZIP. If you use System.IO.Compression.ZipArchive instead, you start by giving it an output stream (in the constructor). Make that a FileStream and you should be set, without it needing to buffer everything in memory first.
You would need to use ZipArchiveMode.Create