SharpCompress & LZMA2 7z archive - very slow extraction of specific file. Why? Alternatives? - c#

I have a 7zip archive craeted with LZMA2 compression (compression level: ultra).
The archive contains 1,749 files, which in total originally had a size of 661mb.
The zipped file is 39mb in size.
Now I'm trying to use C# to extract a tiny (~200kb'ish) single file from this archive.
I'm getting the corresponding IArchiveEntry from the IArchive (which works relatively fast),
but then calling IArchiveEntry.WriteToFile(targetPath) takes around 33 seconds! And similarly long if I write to a memory stream instead. (edit: When I'm running this on a 7z LZMA2 archive with compression level = normal, it still takes 9 seconds)
When I'm opening the same archive in the actual 7zip application and extract the same file from there, it takes around 2-3 seconds only.
I suspected it's some sort of multicore (7zip) vs single core (SharpCompress probably?) thing, but I don't notice any CPU usage spike during decompression with 7zip.. maybe its too fast to be noticeable though..
Does anyone know what could be the issue for such slow speeds with SharpCompress? Am I maybe missing some setting or using a wrong factory (ArchiveFactory) ?
If not - is there any C# library out there that might be significantly faster at decompressing this?
For reference, here's a sketch of how I'm using SharpCompress to extract:
private void Extract()
{
using(var archive = GetArchive())
{
var entryPath = /* ... path to entry .. */
var entry = TryGetEntry(archive, entryPath);
entry.WriteToFile(some_target_path);
}
}
private IArchive GetArchive()
{
string path = /* .. path to my .7z file */;
return ArchiveFactory.Open(path);
}
private IArchiveEntry TryGetEntry(IArchive archive, string path)
{
path = path.Replace("\\", "/");
foreach (var entry in archive.Entries)
{
if (!entry.IsDirectory)
{
if (entry.Key == path)
return entry;
}
}
return null;
}
Update: For a temporary solution, I'm now including the 7zr.exe from the 7zip SDK in my application, and run this in a new process to extract a single file, reading the process' output into a binary stream.
This works in around ~3 seconds compared to the ~33seconds with SharpCompress. Works for now, but kind of ugly.. so still curious why SharpCompress seems to be so slow there

This line is the problem
foreach (var entry in archive.Entries)
The problem is described here (ie. If there are 100 files, it decompresses the 1st file 100 times, 2nd file 99 times, and so on)
You need to use reader (forward-only). See the API.
But the sample code there doesn't support 7z.
For 7z you can use archive.ExtractAllEntries(), eg.
var reader = archive.ExtractAllEntries();
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
reader.WriteEntryToDirectory(extractDir, new ExtractionOptions() { ExtractFullPath = false, Overwrite = true });
}
It will be much faster.

If you need all the files you could also do:
using var reader = archive.ExtractAllEntries();
reader.WriteAllToDirectory(targetPath, new ExtractionOptions() { ExtractFullPath = true, Overwrite = true });

Related

What would cause files being added to a zip file to not be included in the zip?

I work with a program that takes large amounts of data, turns the data into xml files, then takes those xml files and zips them for use in another program. Occasionally, during the zipping process, one or two xml files gets left out. It is fairly rare, once or twice a month, but when it does happen it's a big mess. I am looking for help figuring out why the files don't get zipped and how to prevent it. This code is straightforward:
public string AddToZip(string outfile, string toCompress)
{
if (!File.Exists(toCompress)) throw new FileNotFoundException("Could not find the file to compress", toCompress);
string dir = Path.GetDirectoryName(outfile);
if(!Directory.Exists(dir))
{
Directory.CreateDirectory(dir);
}
// The program that gets this data can't handle files over
// 20 MB, so it splits it up into two or more files if it hits the
// limit.
if (File.Exists(outfile))
{
FileInfo tooBig = new FileInfo(outfile);
int converter = 1024;
float fileSize = tooBig.Length / converter; //bytes to KB
fileSize = fileSize / converter; //KB to MB
int limit = CommonTypes.Helpers.ConfigHelper.GetConfigEntryInt("zipLimit", "19");
if (fileSize >= limit)
{
outfile = MakeNewName(outfile);
}
}
using (ZipFile zf = new ZipFile(outfile))
{
zf.AddFile(toCompress,"");
zf.Save();
}
return outfile;
}
Ultimately, what I want to do is have a check that sees if any xml files weren't added to the zip after the zip file is created, but stopping the problem in its tracks are best overall. Thanks for the help.
Make sure you have that code inside a try... catch statement. Also make sure that if you have done that, you do something with the exception. It would not be the first case that has this type of exception handling:
try
{
//...
}
catch { }
Given the code above if you have any exception on your process, you will never notice.
It's hard to judge from this function alone, here's a list of things that can go wrong:
- The toCompress file can be gone by the time zf.AddFile is called (but after the Exists test). Test return value or add exception handling to detect this.
- The zip outFile can be just below the size limit, adding a new file can make it go over the limit.
- The AddToZip() may be called concurrently, that may cause adding to fail.
How is the toCompress file remove handled? I think adding locking to the AddoZip() on a function scope might also be a good idea.
This could be a timing issue. You are checking to see if outfile is too big before trying to add the toCompress file. What you should be doing is:
Add toCompress to outfile
Check to see if adding the file made outfile too big
If outfile is now too big, remove toCompress, create new outfile, add toCompress to new outfile.
I suspect that you occasionally have an outfile that is just under the limit, but adding toCompress puts it over. Then the receiving program does not process outfile because it is too big.
I could be completely off base, but it is something to check.

Add Files Into Existing Zip - performance issue

I have a WCF webservice that saves files to a folder(about 200,000 small files).
After that, I need to move them to another server.
The solution I've found was to zip them then move them.
When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast.
But in production, zipping 200,000 files takes more than 2 hours.
Here is my code to zip the folder :
using (ZipFile zipFile = new ZipFile())
{
zipFile.UseZip64WhenSaving = Zip64Option.Always;
zipFile.CompressionLevel = CompressionLevel.None;
zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);
zipFile.Save(DestinationCurrentFileInfo.FullName);
}
I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.
I use the following code to test:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.
Is there a way to optimize this ? Or if you have other suggestions ?
EDIT
The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file.
As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?
I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.
Reasoning
Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
// (1)
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
// (2)
zip.AddFile(additionFile.FullName);
// (3)
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file
On my computer this takes about an hour.
Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.
Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:
You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:
Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.
Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.
A possible solution
In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.
Basically the algorithm then becomes:
Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).
Again, we can reason about it. The first 16 files aren't a problem, we've already established that.
We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.
Hacking in the ZIP file
Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.
Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.
I haven't worked this out, but it doesn't seem overly complicated to me.
The easiest solution is the most practical
What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.
Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:
static void Main()
{
try { File.Delete(#"c:\tmp\test.zip"); }
catch { }
var sw = Stopwatch.StartNew();
using (var zip = new ZipFile(#"c:\tmp\test.zip"))
{
zip.UseZip64WhenSaving = Zip64Option.Always;
for (int i = 0; i < 200000; ++i)
{
string filename = "foo" + i.ToString() + ".txt";
byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddEntry(filename, contents);
}
zip.Save();
}
Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
Console.ReadLine();
}
Whop; that finishes in 4,5 seconds. Much better.
I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive).
Two ideas to explore:
Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).
If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).
Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.
If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.
You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".aes"))
.Select(f => new FileInfo(f));
using (var zip = ZipFile.Read(nameOfExistingZip))
{
foreach (var additionFile in listAes)
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
}
zip.Save();
}
If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.
If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.
You can zip all the files using .Net TPL (Task Parallel Library) like this:
while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
{
tasks[taskCounter] = Task.Factory.StartNew(() =>
CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
eventSignal.WaitOne(-1); // Line 2
taskCounter++; // Line 3
bufferRead = new byte[sliceBytes]; // Line 4
}
Task.WaitAll(tasks); // Line 6
There is a compiled library and source code here:
http://www.codeproject.com/Articles/49264/Parallel-fast-compression-unleashing-the-power-of

Out of memory exception while updating zip

I am getting OutofMemoryException while trying to add files to a .zip file. I am using 32-bit architecture for building and running the application.
string[] filePaths = Directory.GetFiles(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData) + "\\capture\\capture");
System.IO.Compression.ZipArchive zip = ZipFile.Open(filePaths1[c], ZipArchiveMode.Update);
foreach (String filePath in filePaths)
{
string nm = Path.GetFileName(filePath);
zip.CreateEntryFromFile(filePath, "capture/" + nm, CompressionLevel.Optimal);
}
zip.Dispose();
zip = null;
I am unable to understand the reason behind it.
The exact reason depends on a variety of factors, but most likely you are simply just adding too much to the archive. Try using the ZipArchiveMode.Create option instead, which writes the archive directly to disk without caching it in memory.
If you are really trying to update an existing archive, you can still use ZipArchiveMode.Create. But it will require opening the existing archive, copying all of its contents to a new archive (using Create), and then adding the new content.
Without a good, minimal, complete code example, it would not be possible to say for sure where the exception is coming from, never mind how to fix it.
EDIT:
Here is what I mean by "…opening the existing archive, copying all of its contents to a new archive (using Create), and then adding the new content":
string[] filePaths = Directory.GetFiles(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData) + "\\capture\\capture");
using (ZipArchive zipFrom = ZipFile.Open(filePaths1[c], ZipArchiveMode.Read))
using (ZipArchive zipTo = ZipFile.Open(filePaths1[c] + ".tmp", ZipArchiveMode.Create))
{
foreach (ZipArchiveEntry entryFrom in zipFrom.Entries)
{
ZipArchiveEntry entryTo = zipTo.CreateEntry(entryFrom.FullName);
using (Stream streamFrom = entryFrom.Open())
using (Stream streamTo = entryTo.Open())
{
streamFrom.CopyTo(streamTo);
}
}
foreach (String filePath in filePaths)
{
string nm = Path.GetFileName(filePath);
zipTo.CreateEntryFromFile(filePath, "capture/" + nm, CompressionLevel.Optimal);
}
}
File.Delete(filePaths1[c]);
File.Move(filePaths1[c] + ".tmp", filePaths1[c]);
Or something like that. Lacking a good, minimal, complete code example, I just wrote the above in my browser. I didn't try to compile it, never mind test it. And you may want to adjust some specifics (e.g. the handling of the temp file). But hopefully you get the idea.
The reason is simple. OutOfMemoryException means memory is not enough for the execution.
Compression consumes a lot of memory. There is no guarantee that a change of logic can solve the problem. But you can consider different methods to alleviate it.
1.
Since your main program must be 32-bit, you can consider starting another 64-bit process to do the compression (use System.Diagnostics.Process.Start). After the 64-bit process finishes its job and exits, your 32-bit main program can continue. You can simply use a tool already installed on the system, or write a simple program yourself.
2.
Another method is to dispose each time you add an entry.
ZipArchive.Dispose saves the file. After each iteration, memory allocated for the ZipArchive can be freed.
foreach (String filePath in filePaths)
{
System.IO.Compression.ZipArchive zip = ZipFile.Open(filePaths1[c], ZipArchiveMode.Update);
string nm = Path.GetFileName(filePath);
zip.CreateEntryFromFile(filePath, "capture/" + nm, CompressionLevel.Optimal);
zip.Dispose();
}
This approach is not straightforward, and it might not be as effective as the first approach.

C# Is opening and reading from a Stream slow?

I have 22k text (rtf) files which I must append to one final one.
The code looks something like this:
using (TextWriter mainWriter = new StreamWriter(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (TextReader currentFileRader = new StreamReader(currentFile))
{
string fileContent = currentFileRader.ReadToEnd();
mainWriter.Write(fileContent);
}
}
}
Clearly, this opens 22k times a stream to read from the files.
My questions are :
1) in general, is opening a stream a slow operation? Is reading from a stream a slow operation ?
2) is there any difference if I read the file as byte[] and append it as byte[] than using the file text?
3) any better ideas to merge 22k files ?
Thanks.
1) in general, is opening a stream a slow operation?
No, not at all. Opening a stream is blazing fast, it's only a matter of reserving a handle from the underlying Operating System.
2) is there any difference if I read the file as byte[] and append it
as byte[] than using the file text?
Sure, it might be a bit faster, rather than converting the bytes into strings using some encoding, but the improvement would be negligible (especially if you are dealing with really huge files) compared to what I suggest you in the next point.
3) any ways to achieve this better ? ( merge 22k files )
Yes, don't load the contents of every single file in memory, just read it in chunks and spit it to the output stream:
using (var output = File.OpenWrite(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (var input = File.OpenRead(currentFile))
{
input.CopyTo(output);
}
}
}
The Stream.CopyTo method from the BCL will take care of the heavy lifting in my example.
Probably the best way to speed this up is to make sure that the output file is on a different physical disk drive than the input files.
Also, you can get some increase in speed by creating the output file with a large buffer. For example:
using (var fs = new FileStream(filename, FileMode.Create, FileAccess.Write, FileShare.None, BufferSize))
{
using (var mainWriter = new StreamWriter(fs))
{
// do your file copies here
}
}
That said, your primary bottleneck will be opening the files. That's especially true if those 22,000 files are all in the same directory. NTFS has some problems with large directories. You're better off splitting that one large directory into, say, 22 directories with 1,000 files each. Opening a file from a directory that contains tens of thousands of files is much slower than opening a file in a directory that has only a few hundred files.
What's slow about reading data from a file is the fact that you aren't moving around electrons which can propagate a signal at speeds that are...really fast. To read information in files you have to actually spin these metal disks around and use magnets to read data off of them. These disks are spinning at far slower than electrons can propagate signals through wires. Regardless of what mechanism you use in code to tell these disks to spin around, you're still going to have to wait for them to go a spinin' and that's going to take time.
Whether you treat the data as bytes or text isn't particularly relevant no.

Faster file move method other than File.Move

I have a console application that is going to take about 625 days to complete. Unless there is a way to make it faster.
First off I am working in a directory that has around 4,000,000 files in if not more. I'm working in a database that has a row for each file and then some.
Now working with the SQL is relatively fast, the bottleneck is when I use File.Move() each move takes 18 seconds to complete.
Is there a faster way than File.Move()?
This is the bottleneck:
File.Move(Path.Combine(location, fileName), Path.Combine(rootDir, fileYear, fileMonth, fileName));
All of the other code runs pretty fast. All I need to do is move one file to a new location and then update the database location field.
I can show other code if needed, but really the above is the only current bottleneck.
It turns out switching from File.Move to setting up a FileInfo and using .MoveTo increased the speed significantly.
It will run in about 35 days now as opposed to 625 days.
FileInfo fileinfo = new FileInfo(Path.Combine(location, fileName));
fileinfo.MoveTo(Path.Combine(rootDir, fileYear, fileMonth, fileName));
18 seconds isn't really unusual. NTFS does not perform well when you have a lot of files in a single directory. When you ask for a file, it has to do a linear search of its directory data structure. With 1,000 files, that doesn't take too long. With 10,000 files you notice it. With 4 million files . . . yeah, it takes a while.
You can probably do this even faster if you pre-load all of the directory entries into memory. Then rather than calling the FileInfo constructor for each file, you just look it up in your dictionary.
Something like:
var dirInfo = new DirectoryInfo(path);
// get list of all files
var files = dirInfo.GetFileSystemInfos();
var cache = new Dictionary<string, FileSystemInfo>();
foreach (var f in files)
{
cache.Add(f.FullName, f);
}
Now when you get a name from the database, you can just look it up in the dictionary. That might very well be faster than trying to get it from the disk each time.
You can move files in parallel and also using Directory.EnumerateFiles gives you a lazy loaded list of files (of-course I have not tested it with 4,000,000 files):
var numberOfConcurrentMoves = 2;
var moves = new List<Task>();
var sourceDirectory = "source-directory";
var destinationDirectory = "destination-directory";
foreach (var filePath in Directory.EnumerateFiles(sourceDirectory))
{
var move = new Task(() =>
{
File.Move(filePath, Path.Combine(destinationDirectory, Path.GetFileName(filePath)));
//UPDATE DB
}, TaskCreationOptions.PreferFairness);
move.Start();
moves.Add(move);
if (moves.Count >= numberOfConcurrentMoves)
{
Task.WaitAll(moves.ToArray());
moves.Clear();
}
}
Task.WaitAll(moves.ToArray());

Categories