I am decompressing .gz files in-memory using SevenZipSharp libary in c# and have encountered a strange behavior where the file size grew by 2-3 times, the decompression took significantly longer. More specifically, the avg size of .gz files are around 40MB (700-800MB when decompressed) and the decompression time was in the order of tens of seconds at most. But it took more than half an hour to decompress this specific .gz file with 90MB size (1.6GB when decompressed). Each .gz file was orginally compressed from a single txt file via 7-zip. I have attached the code:
for (int i = 0; i < fileNames.Length; i++)
{
using (FileStream fs = File.OpenRead(fileNames[i]))
{
using (var sze = new SevenZip.SevenZipExtractor(fs))
{
MemoryStream mem = new MemoryStream();
sze.ExtractFile(0, mem);
using (StreamReader sr = new StreamReader(mem))
{
// do something
}
}
}
}
Any idea why the decompression time exploded here? Is this just from the overhead associated with resizing of memory stream?
Related
I am trying to create a Zip from a list of files in parallel and stream it to client.
I have a working code where I iterate over files sequentially, but I want it instead to be zipped in parallel (multiple files with >100mb each).
using ZipArchive zipArchive = new(Response.BodyWriter.AsStream(), ZipArchiveMode.Create, leaveOpen: false);
for (int i = 0; i < arrLocalFilesPath.Length; i++) // iterate over files
{
string strFilePath = arrLocalFilesPath[i]; // list of files path
string strFileName = Path.GetFileName(strFilePath);
ZipArchiveEntry zipEntry = zipArchive.CreateEntry(strFileName, CompressionLevel.Optimal);
using Stream zipStream = zipEntry.Open();
using FileStream fileStream = System.IO.File.Open(strFilePath, FileMode.Open, FileAccess.Read);
fileStream.CopyTo(zipStream);
}
return new EmptyResult();
Parallel.For and Parallel.ForEach do not work with ZipArchive
Since ZipArchive is not thread safe, I am trying to use DotNetZip to accomplish this task.
I looked at the docs and here's what I have so far using DotNetZip
using Stream streamResponseBody = Response.BodyWriter.AsStream();
Parallel.For(0, arrLocalFilesPath.Length, i =>
{
string strFilePath = arrLocalFilesPath[i]; // list of files path
string strFileName = Path.GetFileName(strFilePath);
string strCompressedOutputFile = strFilePath + ".compressed";
byte[] arrBuffer = new byte[8192]; //[4096];
int n = -1;
using FileStream input = System.IO.File.OpenRead(strFilePath);
using FileStream raw = new(strCompressedOutputFile, FileMode.Create, FileAccess.ReadWrite);
using Stream compressor = new ParallelDeflateOutputStream(raw);
while ((n = input.Read(arrBuffer, 0, arrBuffer.Length)) != 0)
{
compressor.Write(arrBuffer, 0, n);
}
input.CopyTo(streamResponseBody);
});
return new EmptyResult();
However, this doesn't zip files and send to client (it only creates local zip files on the server).
Using MemoryStream or creating a local zip file is out of the question and not what I am looking for.
The server should seamlessly stream read bytes of a file, zip it on the fly and send it to client as chunks (like in my ZipArchive), but with the added benefits of reading those files in parallel and creating a zip of them.
I know that parallelism is usually not optimal for I/O (sometimes a bit worse), but parallel zipping multiple big files should be faster for this case.
I also tried to use SharpZipLib without success.
Usage of any other libraries is fine as long as it read and stream files to client seamlessly without impacting memory.
Any help is appreciated.
If these files are on the same drive there won't be any speed up. The parallelization is used to compress/decompress data, but the disk IO operation cannot be done in parallel.
Assuming that files are not on the same drive and there is a chance to speed up this process...
Are you sure the Stream.CopyTo() is thread safe? Either check the docs or use single thread or set lock on it.
EDIT:
I've checked my old codes, where I was packing huge amount of data into a zip file using ZipArchive. I did it in parallel, but there was no IO read there.
You can use ZipArchive with Parallel.For but you need to use lock:
//create zip into stream
using (ZipArchive zipArchive = new ZipArchive(zipFS, ZipArchiveMode.Update, false))
{
//use parallel foreach instead of parallel, but not for IO read operation!
Parallel.ForEach(listOfFiles, filename =>
{
//create a file entry
ZipArchiveEntry zipFileEntry = zipArchive.CreateEntry(filename);
//prepare memory for the entry
MemoryStream ms = new MemoryStream();
/*fill the memory stream here - I did another packing with BZip2OutputStream, because the zip was packed without compression to speed up random decompression */
//only one thread can write to zip!
lock (zipFileEntry)
{
//open stream for writing
using (Stream zipEntryStream = zipFileEntry.Open())
{
ms.Position = 0; // rewind the stream
StreamUtils.Copy(ms, zipEntryStream, new byte[4096]); //from ICSharpCode.SharpZipLib.Core, copy memory stream data into zip entry with packing.
}
}
}
}
Anyway, if you need to read the files first, it's your performance bottleneck. You won't gain a lot (if anything) from parallel approach here.
If I stream a zip file like so:
using var zip = new ZipArchive(fileStream, ZipArchiveMode.Read);
using var sr = new StreamReader(zip.Entries[0].Open());
var line = sr.ReadLine(); //etc..
Am I streaming the zip file entry or is it loading the entire zip file into memory then I am streaming the uncompressed file?
It depends on how the fileStream was created. Was it created from a file on disk? If so, then ZipArchive will read from disk as it needs data. It won't put the entire thing in memory then read it. That would be incredibly inefficient.
I have a bunch of experience in this... I worked on a project where I had to unarchive 25 GB. Zip files. .NET's ZipArchive was very quick and very memory efficient.
You can have MemoryStreams that contain data that ZipArchive can read from, so you aren't limited to just Zip files on disk.
Here is a slightly efficient way to unzip a ZipArchive:
var di = new DirectoryInfo(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData), "MyDirectoryToExtractTo"));
var filesToExtract = _zip.Entries.Where(x =>
!string.IsNullOrEmpty(x.Name) &&
!x.FullName.EndsWith("/", StringComparison.Ordinal));
foreach(var x in filesToExtract)
{
var fi = new FileInfo(Path.Combine(di.FullName, x.FullName));
if (!fi.Directory.Exists) { fi.Directory.Create(); }
using (var i = x.Open())
using (var o = fi.OpenWrite())
{
i.CopyTo(o);
}
}
This will extract all the files to C:\ProgramData\MyDirectoryToExtractTo\ keeping directory structure.
If you'd like to see how ZipArchive was implemented to verify, take a look here.
I'm a complete beginner to dealing with streams or anything like this, so if you see anything obviously wrong... that's why. I have some files stored on azure. I need to take the files, zip them up, and return the zip.
I've tested this with a 1GB file, and although it works, it ends up using 2.5GB of memory. Memory usage spikes between when the last line starts and completes. I'm not sure why this is loading everything in memory, so I'm not quite sure what i'm supposed to do to prevent that from happening. What's the correct way to do it? The only thing I can think of is to specify buffer sizes somewhere, but everywhere i've seen where it's possible has a small default.
FileStream zipToOpen = new FileStream("test.zip", FileMode.Create);
ZipArchive archive = new ZipArchive(zipToOpen, ZipArchiveMode.Update, true);
ZipArchiveEntry zipEntry = archive.CreateEntry("entryName", CompressionLevel.Optimal);
// ... some azure code
file = await dir.GetFileReference(fileName);
fileContents = await file.OpenReadAsync().ConfigureAwait(false);
await fileContents.CopyToAsync(zipEntry.Open()).ConfigureAwait(false);
Just create archive as
ZipArchive archive = new ZipArchive(zipToOpen, ZipArchiveMode.Create);
Your memory consumption will drop to minimal (In my test case it dropped from 900M to 36M)...
Seems like problem is related with ZipArchiveMode.Update
void Zip(IEnumerable<string> files, Stream inputStream)
{
using (var zip = new System.IO.Compression.ZipArchive(inputStream, System.IO.Compression.ZipArchiveMode.Create))
{
foreach (var file in files.Select(f => new FileInfo(f)))
{
var entry = zip.CreateEntry(file.Name, System.IO.Compression.CompressionLevel.Fastest);
using (var s = entry.Open())
{
using (var f = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
f.CopyTo(s);
}
}
}
}
}
I am able to compress up to 1 GB of the folder but I have to compress more than 10 GB.
string filePath = C:\Work; // path of the source file
private void compressFile(string filePath)
{
using (ZipFile zip = new ZipFile())
{
zip.AddDirectory(Path.Combine(filePath, "Demo"));
if (File.Exists(Path.Combine(filePath, "Demo.zip")))
{
File.Delete(Path.Combine(filePath, "Demo.zip"));
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression;
zip.Save(Path.Combine(filePath, "Demo.zip"));
}
zip.Save(Path.Combine(filePath, "Demo.zip"));
}
}
I'm assuming the issue here is out-of-memory due to the fact that everything is in memory until the zip.Save call.
Suggestion: don't use DotNetZIP. If you use System.IO.Compression.ZipArchive instead, you start by giving it an output stream (in the constructor). Make that a FileStream and you should be set, without it needing to buffer everything in memory first.
You would need to use ZipArchiveMode.Create
I ave writing an XML File of size more than 1GB but at the time of writing I want to compress that file so that the size of an xml file is reduces so that at tile of xmlDoc.Load(fileName) load the file in minimum time duration.
my code for Writing an XML File is
using (FileStream fileStream = new FileStream(_logFilePath, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.ReadWrite))
{
xmlDoc.Load(fileStream);
int byteLenght = fileStream.ReadByte();
byte[] intBytes = BitConverter.GetBytes(byteLenght);
intBytes = Compress(intBytes);
xmlDoc.DocumentElement.AppendChild(newelement);
fileStream.SetLength(0);
xmlDoc.Save(fileStream);
}
also for compression
private static byte[] Compress(byte[] data)
{
byte[] retVal;
using (MemoryStream compressedMemoryStream = new MemoryStream())
{
DeflateStream compressStream = new DeflateStream(compressedMemoryStream, CompressionMode.Compress, true);
compressStream.Write(data, 0, data.Length);
compressStream.Close();
retVal = new byte[compressedMemoryStream.Length];
compressedMemoryStream.Position = 0L;
compressedMemoryStream.Read(retVal, 0, retVal.Length);
compressedMemoryStream.Close();
compressStream.Close();
}
return retVal;
}
but its not work for compression the file.
Compressing the file on disk won't do much to improve the time spent loading the document, because the larger part of the time is in building up the object graph for the XmlDocument. It is so slow that it can take as long (or longer) as reading the uncompressed XML from disk. Although compression can save time here, it's only a minor gain if a fast media like an internal hdd is used.
If you want to improve performance working with large XML files, you'll need to use something like an XmlReader that streams the file instead of loading it all at once.