Add Files Into Existing Zip - performance issue - c#

I have a WCF webservice that saves files to a folder(about 200,000 small files).
After that, I need to move them to another server.
The solution I've found was to zip them then move them.
When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast.
But in production, zipping 200,000 files takes more than 2 hours.
Here is my code to zip the folder :
using (ZipFile zipFile = new ZipFile())
{
zipFile.UseZip64WhenSaving = Zip64Option.Always;
zipFile.CompressionLevel = CompressionLevel.None;
zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);
zipFile.Save(DestinationCurrentFileInfo.FullName);
}
I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.
I use the following code to test:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.
Is there a way to optimize this ? Or if you have other suggestions ?
EDIT
The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file.
As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?

I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.
Reasoning
Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
// (1)
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
// (2)
zip.AddFile(additionFile.FullName);
// (3)
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file
On my computer this takes about an hour.
Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.
Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:
You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:
Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.
Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.
A possible solution
In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.
Basically the algorithm then becomes:
Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).
Again, we can reason about it. The first 16 files aren't a problem, we've already established that.
We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.
Hacking in the ZIP file
Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.
Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.
I haven't worked this out, but it doesn't seem overly complicated to me.
The easiest solution is the most practical
What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.
Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:
static void Main()
{
try { File.Delete(#"c:\tmp\test.zip"); }
catch { }
var sw = Stopwatch.StartNew();
using (var zip = new ZipFile(#"c:\tmp\test.zip"))
{
zip.UseZip64WhenSaving = Zip64Option.Always;
for (int i = 0; i < 200000; ++i)
{
string filename = "foo" + i.ToString() + ".txt";
byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddEntry(filename, contents);
}
zip.Save();
}
Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
Console.ReadLine();
}
Whop; that finishes in 4,5 seconds. Much better.

I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive).
Two ideas to explore:
Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).
If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).
Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.
If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.

You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".aes"))
.Select(f => new FileInfo(f));
using (var zip = ZipFile.Read(nameOfExistingZip))
{
foreach (var additionFile in listAes)
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
}
zip.Save();
}
If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.

If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.

You can zip all the files using .Net TPL (Task Parallel Library) like this:
while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
{
tasks[taskCounter] = Task.Factory.StartNew(() =>
CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
eventSignal.WaitOne(-1); // Line 2
taskCounter++; // Line 3
bufferRead = new byte[sliceBytes]; // Line 4
}
Task.WaitAll(tasks); // Line 6
There is a compiled library and source code here:
http://www.codeproject.com/Articles/49264/Parallel-fast-compression-unleashing-the-power-of

Related

SharpCompress & LZMA2 7z archive - very slow extraction of specific file. Why? Alternatives?

I have a 7zip archive craeted with LZMA2 compression (compression level: ultra).
The archive contains 1,749 files, which in total originally had a size of 661mb.
The zipped file is 39mb in size.
Now I'm trying to use C# to extract a tiny (~200kb'ish) single file from this archive.
I'm getting the corresponding IArchiveEntry from the IArchive (which works relatively fast),
but then calling IArchiveEntry.WriteToFile(targetPath) takes around 33 seconds! And similarly long if I write to a memory stream instead. (edit: When I'm running this on a 7z LZMA2 archive with compression level = normal, it still takes 9 seconds)
When I'm opening the same archive in the actual 7zip application and extract the same file from there, it takes around 2-3 seconds only.
I suspected it's some sort of multicore (7zip) vs single core (SharpCompress probably?) thing, but I don't notice any CPU usage spike during decompression with 7zip.. maybe its too fast to be noticeable though..
Does anyone know what could be the issue for such slow speeds with SharpCompress? Am I maybe missing some setting or using a wrong factory (ArchiveFactory) ?
If not - is there any C# library out there that might be significantly faster at decompressing this?
For reference, here's a sketch of how I'm using SharpCompress to extract:
private void Extract()
{
using(var archive = GetArchive())
{
var entryPath = /* ... path to entry .. */
var entry = TryGetEntry(archive, entryPath);
entry.WriteToFile(some_target_path);
}
}
private IArchive GetArchive()
{
string path = /* .. path to my .7z file */;
return ArchiveFactory.Open(path);
}
private IArchiveEntry TryGetEntry(IArchive archive, string path)
{
path = path.Replace("\\", "/");
foreach (var entry in archive.Entries)
{
if (!entry.IsDirectory)
{
if (entry.Key == path)
return entry;
}
}
return null;
}
Update: For a temporary solution, I'm now including the 7zr.exe from the 7zip SDK in my application, and run this in a new process to extract a single file, reading the process' output into a binary stream.
This works in around ~3 seconds compared to the ~33seconds with SharpCompress. Works for now, but kind of ugly.. so still curious why SharpCompress seems to be so slow there
This line is the problem
foreach (var entry in archive.Entries)
The problem is described here (ie. If there are 100 files, it decompresses the 1st file 100 times, 2nd file 99 times, and so on)
You need to use reader (forward-only). See the API.
But the sample code there doesn't support 7z.
For 7z you can use archive.ExtractAllEntries(), eg.
var reader = archive.ExtractAllEntries();
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
reader.WriteEntryToDirectory(extractDir, new ExtractionOptions() { ExtractFullPath = false, Overwrite = true });
}
It will be much faster.
If you need all the files you could also do:
using var reader = archive.ExtractAllEntries();
reader.WriteAllToDirectory(targetPath, new ExtractionOptions() { ExtractFullPath = true, Overwrite = true });

Reading 2 GB file with C# takes too much time

I have an indented text file in the following pattern:
cl /FoD:\jnks\complire_flags /c legacy\roxapi\fjord\Module.c
Note: including file: d:\jnks\e\patchlevel.f
Note: including file: d:\3_4_2_patched4\release\include\pyconfig.f
Note: including file: C:\11.0\VC\INCLUDE\io.f
Using stream reader, I am able to read the above file which requires following processing.
1) Every line starting with cl and ending with c is the Parent file.
2) All files starting with Note and ending with f are child files.
3) The .f file is parent of the .f file below it if left indent increases,(space between file: and drive name) hence pyconfig.f is a parent of io.f
Using Entity framework I am writing above data in two tables of SQL server;
Parent table (for only .c files) and Child table (for only .f files).
My big issue here is- It takes 6 hours to read the file (using stream reader) and another 6 hours to write it in the database (using entity framework). I tried first reading the whole file and then writing it. I have also tried reading the file one Parent c file at a time and writing its info along with the child .f files.
The file size might increase in future to 5 GB so I really would appreciate help in achieving better performance.
Below is a part of my read logic:
while (!isEndOfFile)
{
// Read next Line conditionally
if (readNextLine)
{
if (inputFile.Peek() > -1)
{
line = inputFile.ReadLine();
}
else
{
isEndOfFile = true;
continue;
}
}
// Get the name of the CPP file - Condition is that it starts with cl
if (isCPPFile(line))
{
// Regular expression match to extract the CPP file name
Match match = cppFilePathRegex.Match(line);
if (match.Success)
{
cppFileName = match.Value;
addFileDetails = true;
}
readNextLine = true;
}
// Check if meets the condition of Header starting text - "Note: including file:" and we have a parent CPP File
else if (addFileDetails && isHeaderFile(line))
{
//do something
}
1) Go and read why GNU grep is fast?. It gives a number of hints on how to process fast input text files, specifically looking for patterns.
2) Use SQlBulkCopy to transfer the data into SQL Server. EF is definitely not an appropriate solution for bulk import.
But, if I was you, I would do a del /q /s on my entire import solution and start from scratch using SQL Server Integration Services. SSIS is a dedicated solution for your task, it contains countless optimizations around file read, record access, buffering, cache access and ultimately database writes.
Seems like the time drastically reduced (Almost 2 hours) if I split the file before I process it.
The file is in tree structure so it has to be processed line by line only but I can split it at points where a certain character occurs to signify new tree.
If I read block at new character, instead of splitting; the 2 GB file still eats up a lot of memory.
I have split the file using following power shell and will later see how I can probably call both powershell and MY C-SHARP APPLICATION (for processing and db insertion). I am still working on reducing time but please find my powershell below for reference.
// My PowerShell
$Path = "D:\Parser\Test\" -- path of input file
$PathSplit = "D:\Parser\Test\Cpp\" -- path of output
$InputFile = (Join-Path $Path "input_file.txt") --input filename
$Reader = New-Object System.IO.StreamReader($InputFile)
$N = 1
While(($Line = $Reader.ReadLine()) -ne $null)
{
If(($Line -match "^[cl].*")-and($Line -match "/Fo")) {
$OutputFile = $matches+$N + ".txt"
Add-Content(Join-Path $PathSplit $OutputFile) $Line
$N++
}}

What would cause files being added to a zip file to not be included in the zip?

I work with a program that takes large amounts of data, turns the data into xml files, then takes those xml files and zips them for use in another program. Occasionally, during the zipping process, one or two xml files gets left out. It is fairly rare, once or twice a month, but when it does happen it's a big mess. I am looking for help figuring out why the files don't get zipped and how to prevent it. This code is straightforward:
public string AddToZip(string outfile, string toCompress)
{
if (!File.Exists(toCompress)) throw new FileNotFoundException("Could not find the file to compress", toCompress);
string dir = Path.GetDirectoryName(outfile);
if(!Directory.Exists(dir))
{
Directory.CreateDirectory(dir);
}
// The program that gets this data can't handle files over
// 20 MB, so it splits it up into two or more files if it hits the
// limit.
if (File.Exists(outfile))
{
FileInfo tooBig = new FileInfo(outfile);
int converter = 1024;
float fileSize = tooBig.Length / converter; //bytes to KB
fileSize = fileSize / converter; //KB to MB
int limit = CommonTypes.Helpers.ConfigHelper.GetConfigEntryInt("zipLimit", "19");
if (fileSize >= limit)
{
outfile = MakeNewName(outfile);
}
}
using (ZipFile zf = new ZipFile(outfile))
{
zf.AddFile(toCompress,"");
zf.Save();
}
return outfile;
}
Ultimately, what I want to do is have a check that sees if any xml files weren't added to the zip after the zip file is created, but stopping the problem in its tracks are best overall. Thanks for the help.
Make sure you have that code inside a try... catch statement. Also make sure that if you have done that, you do something with the exception. It would not be the first case that has this type of exception handling:
try
{
//...
}
catch { }
Given the code above if you have any exception on your process, you will never notice.
It's hard to judge from this function alone, here's a list of things that can go wrong:
- The toCompress file can be gone by the time zf.AddFile is called (but after the Exists test). Test return value or add exception handling to detect this.
- The zip outFile can be just below the size limit, adding a new file can make it go over the limit.
- The AddToZip() may be called concurrently, that may cause adding to fail.
How is the toCompress file remove handled? I think adding locking to the AddoZip() on a function scope might also be a good idea.
This could be a timing issue. You are checking to see if outfile is too big before trying to add the toCompress file. What you should be doing is:
Add toCompress to outfile
Check to see if adding the file made outfile too big
If outfile is now too big, remove toCompress, create new outfile, add toCompress to new outfile.
I suspect that you occasionally have an outfile that is just under the limit, but adding toCompress puts it over. Then the receiving program does not process outfile because it is too big.
I could be completely off base, but it is something to check.

Would reading/writing a file take the same time as moving the file?

I need to make modifications to a textfile and keep the original intact. Right now I am using a reader/writer and reading the file and then writing it back sans the modifications.
Unfortunately the text files are huge, ~2gb and are taking about an hour to complete (as the text files are on a network drive).
Would using the file.Move be faster than reading/writing? For example, move the textfiles to local machine, do the modifications, then move it back?
Or, make a copy of the original and go through and modify it that way, instead of having to read/write?
My current code :
try
{
using (StreamReader reader = new StreamReader(filePath))
{
using (StreamWriter writer = new StreamWriter(output))
{
//go through the whole txt file
while (!reader.EndOfStream)
{
//gets the line
line = reader.ReadLine();
if (!modification case goes here))
{
writer.Write(line);
}
}
}
}
}
}
Any help would be appreciated!
With attached network drives it's always faster to copy file locally, do whatever and copy it back. This way you use caching and fetch ahead (pre-caching) which is implemented at hardware level of the disk drive. Network throughput is always limited to average formula of Network_T_Rate / 10 expressed in Megabytes-per-second. So, for example, if you have 100T, as most of corporation networks, you'll get ~10Mb/s no matter read or write, as the disk on the other end is faster than network. In other words, your bottleneck is network.

C# Is opening and reading from a Stream slow?

I have 22k text (rtf) files which I must append to one final one.
The code looks something like this:
using (TextWriter mainWriter = new StreamWriter(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (TextReader currentFileRader = new StreamReader(currentFile))
{
string fileContent = currentFileRader.ReadToEnd();
mainWriter.Write(fileContent);
}
}
}
Clearly, this opens 22k times a stream to read from the files.
My questions are :
1) in general, is opening a stream a slow operation? Is reading from a stream a slow operation ?
2) is there any difference if I read the file as byte[] and append it as byte[] than using the file text?
3) any better ideas to merge 22k files ?
Thanks.
1) in general, is opening a stream a slow operation?
No, not at all. Opening a stream is blazing fast, it's only a matter of reserving a handle from the underlying Operating System.
2) is there any difference if I read the file as byte[] and append it
as byte[] than using the file text?
Sure, it might be a bit faster, rather than converting the bytes into strings using some encoding, but the improvement would be negligible (especially if you are dealing with really huge files) compared to what I suggest you in the next point.
3) any ways to achieve this better ? ( merge 22k files )
Yes, don't load the contents of every single file in memory, just read it in chunks and spit it to the output stream:
using (var output = File.OpenWrite(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (var input = File.OpenRead(currentFile))
{
input.CopyTo(output);
}
}
}
The Stream.CopyTo method from the BCL will take care of the heavy lifting in my example.
Probably the best way to speed this up is to make sure that the output file is on a different physical disk drive than the input files.
Also, you can get some increase in speed by creating the output file with a large buffer. For example:
using (var fs = new FileStream(filename, FileMode.Create, FileAccess.Write, FileShare.None, BufferSize))
{
using (var mainWriter = new StreamWriter(fs))
{
// do your file copies here
}
}
That said, your primary bottleneck will be opening the files. That's especially true if those 22,000 files are all in the same directory. NTFS has some problems with large directories. You're better off splitting that one large directory into, say, 22 directories with 1,000 files each. Opening a file from a directory that contains tens of thousands of files is much slower than opening a file in a directory that has only a few hundred files.
What's slow about reading data from a file is the fact that you aren't moving around electrons which can propagate a signal at speeds that are...really fast. To read information in files you have to actually spin these metal disks around and use magnets to read data off of them. These disks are spinning at far slower than electrons can propagate signals through wires. Regardless of what mechanism you use in code to tell these disks to spin around, you're still going to have to wait for them to go a spinin' and that's going to take time.
Whether you treat the data as bytes or text isn't particularly relevant no.

Categories