Performance issues at extracting 7z file to a directory c#

Performance issues at extracting 7z file to a directory c# - c#

I'm using SharpCompress library for extracting .7z files but it takes about 35 mins to extract 60mb .7z file. Is this normal or am I doing something wrong in terms of performance? .7z file is compressed in high compress mode and LZMA type.
using (var archive2 = ArchiveFactory.Open(source))
{
foreach (var entry in archive2.Entries)
{
if (!entry.IsDirectory)
{
entry.WriteToDirectory(destination, ExtractOptions.ExtractFullPath | ExtractOptions.Overwrite);
}
}
}
}

This is an old post, but I just had the same problem.
This line is the problem
foreach (var entry in archive2.Entries)
The problem is described here (ie. If there are 100 files, it decompresses the 1st file 100 times, 2nd file 99 times, and so on)
The solution is to use reader (forward-only). See the API.
But the sample code there doesn't support 7z.
For 7z you can use archive.ExtractAllEntries(), eg.
using (var archive = ArchiveFactory.Open(movedZipFile))
{
var reader = archive.ExtractAllEntries();
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
reader.WriteEntryToDirectory(extractDir, new ExtractionOptions() { ExtractFullPath = false, Overwrite = true });
}
}
It will be much faster.

Related

Why is my zip file not showing any contents?

I've created a zip file method in my web api which returns a zip file to the front end (Angular / typescript) that should download a zip file in the browser. The issue I have is the file shows it has data by the number of kbs it has but on trying to extract the files it says it's empty. From a bit of research this is most likely down to the file being corrupt, but I want to know where I can find this is going wrong. Here's my code:
WebApi:
I won't show the controller as it basically just takes the inputs and passes them to the method. The DownloadFileResults passed in basically have a byte[] in the File property.
public FileContentResult CreateZipFile(IEnumerable<DownloadFileResult> files)
{
using (var compressedFileStream = new MemoryStream())
{
using (var zipArchive = new ZipArchive(compressedFileStream, ZipArchiveMode.Update))
{
foreach (var file in files)
{
var zipEntry = zipArchive.CreateEntry(file.FileName);
using (var entryStream = zipEntry.Open())
{
entryStream.Write(file.File, 0, file.File.Length);
}
}
}
return new FileContentResult(compressedFileStream.ToArray(), "application/zip");
}
}
This appears to work in that it generates a result with data. Here's my front end code:
let fileData = this._filePaths;
this._fileStorageProxy.downloadFile(Object.entries(fileData).map(([key, val]) => val), this._pId).subscribe(result => {
let data = result.data.fileContents;
const blob = new Blob([data], {
type: 'application/zip'
});
const url = window.URL.createObjectURL(blob);
window.open(url);
});
The front end code then displays me a zip file being downloaded, which as I say appears to have data due to it's size, but I can't extract it.
Update
I tried writing the compressedFileStream to a file on my local and I can see that it creates a zip file and I can extract the files within it. This leads me to believe it's something wrong with the front end, or at least with what the front end code is receiving.
2nd Update
Ok, turns out this is specific to how we do things here. The request goes through platform, but for downloads it can only handle a BinaryTransferObject and I needed to hit a different end point. So with a tweak to no longer returning a FileContentResult and hitting the right end point and making the url simply an ahref it's now working.

Does ZipArchive load entire zip file into memory

If I stream a zip file like so:
using var zip = new ZipArchive(fileStream, ZipArchiveMode.Read);
using var sr = new StreamReader(zip.Entries[0].Open());
var line = sr.ReadLine(); //etc..
Am I streaming the zip file entry or is it loading the entire zip file into memory then I am streaming the uncompressed file?

It depends on how the fileStream was created. Was it created from a file on disk? If so, then ZipArchive will read from disk as it needs data. It won't put the entire thing in memory then read it. That would be incredibly inefficient.
I have a bunch of experience in this... I worked on a project where I had to unarchive 25 GB. Zip files. .NET's ZipArchive was very quick and very memory efficient.
You can have MemoryStreams that contain data that ZipArchive can read from, so you aren't limited to just Zip files on disk.
Here is a slightly efficient way to unzip a ZipArchive:
var di = new DirectoryInfo(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData), "MyDirectoryToExtractTo"));
var filesToExtract = _zip.Entries.Where(x =>
!string.IsNullOrEmpty(x.Name) &&
!x.FullName.EndsWith("/", StringComparison.Ordinal));
foreach(var x in filesToExtract)
{
var fi = new FileInfo(Path.Combine(di.FullName, x.FullName));
if (!fi.Directory.Exists) { fi.Directory.Create(); }
using (var i = x.Open())
using (var o = fi.OpenWrite())
{
i.CopyTo(o);
}
}
This will extract all the files to C:\ProgramData\MyDirectoryToExtractTo\ keeping directory structure.
If you'd like to see how ZipArchive was implemented to verify, take a look here.

How to extract multi-volume archive within Azure Blob Storage?

I have a multi-volume archive stored in Azure Blob Storage that is split into a series of zips titled like this: Archive-Name.zip.001, Archive-Name.zip.002, etc. . . Archive-Name.zip.010. Each file is 250 MB and contains hundreds of PDFs.
Currently we were trying to iterate through each archive part and extract the PDFs. This works except when the past PDF in an archive has been split between two archive parts, ZipFile in C# is unable to process the split file and throws an exception.
We tried reading all the archive parts into a single MemoryStream and then extracting the files, however then we are finding the memory streams exceed 2GBs which is the limit - so this method does not work either.
It is not feasible to download the archive into a machines memory, extract, then upload the PDFs to a new file. The extraction needs to be done in Azure where the program will run.
This is the code we are currently using - it is unable to handle PDFs split between two archive parts.
public static void UnzipTaxForms(TextWriter log, string type, string fiscalYear)
{
var folderName = "folderName";
var outPutContainer = GetContainer("containerName");
CreateIfNotExists(outPutContainer);
var fileItems = ListFileItems(folderName);
fileItems = fileItems.Where(i => i.Name.Contains(".zip")).ToList();
foreach (var file in fileItems)
{
using (var ziped = ZipFile.Read(GetMemoryStreamFromFile(folderName, file.Name)))
{
foreach (var zipEntry in ziped)
{
using (var outPutStream = new MemoryStream())
{
zipEntry.Extract(outPutStream);
var blockblob = outPutContainer.GetBlockBlobReference(zipEntry.FileName);
outPutStream.Seek(0, SeekOrigin.Begin);
blockblob.UploadFromStream(outPutStream);
}
}
}
}
}
Another note. We are unable to change the way the multi-volume archive is generated. Any help would be appreciated.

Best way to read multiple very large files

I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.
For example, the master file should look something like this:
File1_Row1,File2_Row1,File3_Row1,...
File1_Row2,File2_Row2,File3_Row2,...
File1_Row3,File2_Row3,File3_Row3,...
etc.
I've tried 2 solutions so far:
Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.
Does anybody have a better solution for reading so many large files in a fast way?

The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:
using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;
class MainClass
{
static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();
public static void Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();
try
{
using (StreamWriter writer = new StreamWriter("master.txt"))
{
string line = null;
do
{
for(int i = 0; i < readers.Count; i++)
{
if ((line = readers[i].ReadLine()) != null)
{
writer.Write(line);
}
if (i < readers.Count - 1)
writer.Write(",");
}
writer.WriteLine();
} while (line != null);
}
}
finally
{
foreach(var reader in readers)
{
reader.Close();
}
}
Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
}
}
I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.

Use Memory Mapped files seems what is suitable to you. Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations.
Here complete documentation: Memory-Mapped Files

If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as:
string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];
Parallel.Invoke(
() =>
{
ReadMyFile(file1,file1lines);
},
() =>
{
ReadMyFile(file2,file2lines);
},
() =>
{
ReadMyFile(file3,file3lines);
}
);
Each ReadMyFile method should just use the following sample code which, according to these benchmarks, is the fastest way to read a text file:
int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
while ((file1lines[x] = sr.ReadLine()) != null)
{
x += 1;
}
}
If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that.
Then you just need one method to write the contents to each string[] to the output as you desire.

Have an array of open file handles. Loop through this array and read a line from each file into a string array. Then combine this array into the master file, append a newline at the end.
This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one.
Of course you need to be error proof if there are files with less lines than others.

How can I list the contents of a .zip folder in c#?

How can I list the contents of a zipped folder in C#? For example how to know how many items are contained within a zipped folder, and what is their name?

.NET 4.5 or newer finally has built-in capability to handle generic zip files with the System.IO.Compression.ZipArchive class (http://msdn.microsoft.com/en-us/library/system.io.compression.ziparchive%28v=vs.110%29.aspx) in assembly System.IO.Compression. No need for any 3rd party library.
string zipPath = #"c:\example\start.zip";
using (ZipArchive archive = ZipFile.OpenRead(zipPath))
{
foreach (ZipArchiveEntry entry in archive.Entries)
{
Console.WriteLine(entry.FullName);
}
}

DotNetZip - Zip file manipulation in .NET languages
DotNetZip is a small, easy-to-use class library for manipulating .zip files. It can enable .NET applications written in VB.NET, C#, any .NET language, to easily create, read, and update zip files.
sample code to read a zip:
using (var zip = ZipFile.Read(PathToZipFolder))
{
int totalEntries = zip.Entries.Count;
foreach (ZipEntry e in zip.Entries)
{
e.FileName ...
e.CompressedSize ...
e.LastModified...
}
}

If you are using .Net Framework 3.0 or later, check out the System.IO.Packaging Namespace. This will remove your dependancy on an external library.
Specifically check out the ZipPackage Class.

Check into SharpZipLib
ZipInputStream inStream = new ZipInputStream(File.OpenRead(fileName));
while (inStream.GetNextEntry())
{
ZipEntry entry = inStream.GetNextEntry();
//write out your entry's filename
}

Ick - that code using the J# runtime is hideous! And I don't agree that it is the best way - J# is out of support now. And it is a HUGE runtime, if all you want is ZIP support.
How about this - it uses DotNetZip (Free, MS-Public license)
using (ZipFile zip = ZipFile.Read(zipfile) )
{
bool header = true;
foreach (ZipEntry e in zip)
{
if (header)
{
System.Console.WriteLine("Zipfile: {0}", zip.Name);
if ((zip.Comment != null) && (zip.Comment != ""))
System.Console.WriteLine("Comment: {0}", zip.Comment);
System.Console.WriteLine("\n{1,-22} {2,9} {3,5} {4,9} {5,3} {6,8} {0}",
"Filename", "Modified", "Size", "Ratio", "Packed", "pw?", "CRC");
System.Console.WriteLine(new System.String('-', 80));
header = false;
}
System.Console.WriteLine("{1,-22} {2,9} {3,5:F0}% {4,9} {5,3} {6:X8} {0}",
e.FileName,
e.LastModified.ToString("yyyy-MM-dd HH:mm:ss"),
e.UncompressedSize,
e.CompressionRatio,
e.CompressedSize,
(e.UsesEncryption) ? "Y" : "N",
e.Crc32);
if ((e.Comment != null) && (e.Comment != ""))
System.Console.WriteLine(" Comment: {0}", e.Comment);
}
}

I'm relatively new here so maybe I'm not understanding what's going on. :-)
There are currently 4 answers on this thread where the two best answers have been voted down. (Pearcewg's and cxfx's) The article pointed to by pearcewg is important because it clarifies some licensing issues with SharpZipLib.
We recently evaluated several .Net compression libraries, and found that DotNetZip is currently the best aleternative.
Very short summary:
System.IO.Packaging is significantly slower than DotNetZip.
SharpZipLib is GPL - see article.
So for starters, I voted those two answers up.
Kim.

If you are like me and do not want to use an external component, here is some code I developed last night using .NET's ZipPackage class.
var zipFilePath = "c:\\myfile.zip";
var tempFolderPath = "c:\\unzipped";
using (Package package = ZipPackage.Open(zipFilePath, FileMode.Open, FileAccess.Read))
{
foreach (PackagePart part in package.GetParts())
{
var target = Path.GetFullPath(Path.Combine(tempFolderPath, part.Uri.OriginalString.TrimStart('/')));
var targetDir = target.Remove(target.LastIndexOf('\\'));
if (!Directory.Exists(targetDir))
Directory.CreateDirectory(targetDir);
using (Stream source = part.GetStream(FileMode.Open, FileAccess.Read))
{
source.CopyTo(File.OpenWrite(target));
}
}
}
Things to note:
The ZIP archive MUST have a [Content_Types].xml file in its root. This was a non-issue for my requirements as I will control the zipping of any ZIP files that get extracted through this code. For more information on the [Content_Types].xml file, please refer to: A New Standard For Packaging Your Data There is an example file below Figure 13 of the article.
This code uses the Stream.CopyTo method in .NET 4.0

The best way is to use the .NET built in J# zip functionality, as shown in MSDN: http://msdn.microsoft.com/en-us/magazine/cc164129.aspx. In this link there is a complete working example of an application reading and writing to zip files. For the concrete example of listing the contents of a zip file (in this case a Silverlight .xap application package), the code could look like this:
ZipFile package = new ZipFile(packagePath);
java.util.Enumeration entries = package.entries();
//We have to use Java enumerators because we
//use java.util.zip for reading the .zip files
while ( entries.hasMoreElements() )
{
ZipEntry entry = (ZipEntry) entries.nextElement();
if (!entry.isDirectory())
{
string name = entry.getName();
Console.WriteLine("File: " + name + ", size: " + entry.getSize() + ", compressed size: " + entry.getCompressedSize());
}
else
{
// Handle directories...
}
}
Aydsman had a right pointer, but there are problems. Specifically, you might find issues opening zip files, but is a valid solution if you intend to only create pacakges. ZipPackage implements the abstract Package class and allows manipulation of zip files. There is a sample of how to do it in MSDN: http://msdn.microsoft.com/en-us/library/ms771414.aspx. Roughly the code would look like this:
string packageRelationshipType = #"http://schemas.microsoft.com/opc/2006/sample/document";
string resourceRelationshipType = #"http://schemas.microsoft.com/opc/2006/sample/required-resource";
// Open the Package.
// ('using' statement insures that 'package' is
// closed and disposed when it goes out of scope.)
foreach (string packagePath in downloadedFiles)
{
Logger.Warning("Analyzing " + packagePath);
using (Package package = Package.Open(packagePath, FileMode.Open, FileAccess.Read))
{
Logger.OutPut("package opened");
PackagePart documentPart = null;
PackagePart resourcePart = null;
// Get the Package Relationships and look for
// the Document part based on the RelationshipType
Uri uriDocumentTarget = null;
foreach (PackageRelationship relationship in
package.GetRelationshipsByType(packageRelationshipType))
{
// Resolve the Relationship Target Uri
// so the Document Part can be retrieved.
uriDocumentTarget = PackUriHelper.ResolvePartUri(
new Uri("/", UriKind.Relative), relationship.TargetUri);
// Open the Document Part, write the contents to a file.
documentPart = package.GetPart(uriDocumentTarget);
//ExtractPart(documentPart, targetDirectory);
string stringPart = documentPart.Uri.ToString().TrimStart('/');
Logger.OutPut(" Got: " + stringPart);
}
// Get the Document part's Relationships,
// and look for required resources.
Uri uriResourceTarget = null;
foreach (PackageRelationship relationship in
documentPart.GetRelationshipsByType(
resourceRelationshipType))
{
// Resolve the Relationship Target Uri
// so the Resource Part can be retrieved.
uriResourceTarget = PackUriHelper.ResolvePartUri(
documentPart.Uri, relationship.TargetUri);
// Open the Resource Part and write the contents to a file.
resourcePart = package.GetPart(uriResourceTarget);
//ExtractPart(resourcePart, targetDirectory);
string stringPart = resourcePart.Uri.ToString().TrimStart('/');
Logger.OutPut(" Got: " + stringPart);
}
}
}
The best way seems to use J#, as shown in MSDN: http://msdn.microsoft.com/en-us/magazine/cc164129.aspx
There are pointers to more c# .zip libraries with different licenses, like SharpNetZip and DotNetZip in this article: how to read files from uncompressed zip in c#?. They might be unsuitable because of the license requirements.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Performance issues at extracting 7z file to a directory c# - c#

Related

Why is my zip file not showing any contents?

Does ZipArchive load entire zip file into memory

How to extract multi-volume archive within Azure Blob Storage?

Best way to read multiple very large files

How can I list the contents of a .zip folder in c#?

Categories

Resources