OutOfMemoryException when adding large number of files in ZipFile - c#

I am facing an issue of OutofMemoryException when I add large number of files in ZipFile. The sample code is as below:
ZipFile file = new ZipFile("E:\\test1.zip");
file.UseZip64WhenSaving = Zip64Option.AsNecessary;
file.ParallelDeflateThreshold = -1;
for (Int64 i = 0; i < 1000000; i++)
{
file.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
byte[] data = Encoding.ASCII.GetBytes("rama");
ZipEntry entry = file.AddEntry(#"myFolder1/test1/myhtml111.html" + i.ToString(), data);
}
file.Save();
I have downloaded the source Code of Ionic.zip library and I see that for every Add*() function like AddEntry(), AddFile() etc. they add item into Dictionary called _entry.
This dictionary does not get cleared when we call Save() or Dispose() methods on ZipFile object.
I feel this is the root cause of OutOfMemoryException.
How do I overcome this issue? Is there any other way to achieve the same result without running into OutOfMemoryException? Am I missing something?
I am open to using other open Source libraries too.

The dictionary holding the internal structure of the archvie shouldn't be a problem.
Assuming your entry 'path' is a string of about 50 bytes - even 1000000 entries should amount to about 50 Mb (a lot - but nowhere near the limit of 2 Gb) - while I haven't bothered checking the size of the ZipEntry - I also doubt it would be large enough (would need to be around 2kb each)
I also think that your expectation for this Entry dictionary to be cleared is wrong. Since this is the informational structure of the contents of the zip file - you need it to hold all the entries.
From this point on I am going to assume that the posted code:
byte[] data = Encoding.ASCII.GetBytes("rama");
Is a place holder for actual file data in bytes (since for 1M x 4 bytes - should be under 4Mb)
The most likely issue here that the declared byte[] data remains in memory untill the entire ZipFile is disposed.
It makes sense to keep this array untill the data is saved.
The simplest way to work around this is to wrap the ZipFile in using re-opening and closing for every file you want to add.
var zipFileName = "E:\\test1.zip";
for (int i = 0; i < 1000000; ++i)
{
using (ZipFile zf = new ZipFile(zipFileName))
{
byte[] data = File.ReadAllBytes(file2Zip);
ZipEntry entry = zf.AddEntry(#"myFolder1/test1/myhtml111.html" + i.ToString(), data);
zf.Save();
}
}
This approach might seem wasteful if you are saving a lot of small files, since you are using byte[] directly it would be quite simple to implement a buffering mechanism.
While it is true that it's possible to side step this issue by compiling to 64-bit, unless you are really just barelly going over the 2Gb limit, this would create a very memory hungry app.

Related

C# System.OutOfMemoryException when Creating New Array

I am reading files into an array; here is the relevant code; a new DiskReader is created for each file and path is determined using OpenFileDialog.
class DiskReader{
// from variables section:
long MAX_STREAM_SIZE = 300 * 1024 * 1024; //300 MB
FileStream fs;
public Byte[] fileData;
...
// Get file size, check it is within allowed size (MAX)STREAM_SIZE), start process including progress bar.
using (fs = File.OpenRead(path))
{
if (fs.Length < MAX_STREAM_SIZE)
{
long NumBytes = (fs.Length < MAX_STREAM_SIZE ? fs.Length : MAX_STREAM_SIZE);
updateValues[0] = (NumBytes / 1024 / 1024).ToString("#,###.0");
result = LoadData(NumBytes);
}
else
{
// Need for something to handle big files
}
if (result)
{
mainForm.ShowProgress(true);
bw.RunWorkerAsync();
}
}
...
bool LoadData(long NumBytes)
{
try
{
fileData = new Byte[NumBytes];
fs.Read(fileData, 0, fileData.Length);
return true;
}
catch (Exception e)
{
return false;
}
}
The first time I run this, it works fine. The second time I run it, sometimes it works fine, most times it throws an System.OutOfMemoryException at
[Edit:
"first time I run this" was a bad choice of words, I meant when I start the programme and open a file is fine, I get the problem when I try to open a different file without exiting the programme. When I open the second file, I am setting the DiskReader to a new instance which means the fileData array is also a new instance. I hope that makes it clearer.]
fileData = new Byte[NumBytes];
There is no obvious pattern to it running and throwing an exception.
I don't think it's relevant, but although the maximum file size is set to 300 MB, the files I am using to test this are between 49 and 64 MB.
Any suggestions on what is going wrong here and how I can correct it?
If the exception is being thrown at that line only, then my guess is that you've got a problem somewhere else in your code, as the comments suggest. Reading the documentation of that exception here, I'd bet you call this function one too many times somewhere and simply go over the limit on object length in memory, since there don't seem to be any problem spots in the code that you posted.
The fs.Length property requires the whole stream to be evaluated, hence to read the file anyway. Try doing something like
byte[] result;
if (new FileInfo(path).Length < MAX_STREAM_SIZE)
{
result = File.ReadAllBytes(path);
}
Also depending on your needs, you might avoid using byte array and read the data directly from the file stream. This should have much lower memory footprint
If I understand well what you want to do, I have this proposal: The best option is to allocate one static array of defined MAX size at the beginning. And then keep that array, only fill it with a new data from another file. This way your memory should be absolutely fine. You just need to store file size in a separate variable, because the array will have always the same MAX size.
This is a common approach in systems with automatic memory management - it makes the program faster when you allocate a constant size of memory at the start and then never allocate anything during the computation, because garbage collector is not run many times.

What would cause files being added to a zip file to not be included in the zip?

I work with a program that takes large amounts of data, turns the data into xml files, then takes those xml files and zips them for use in another program. Occasionally, during the zipping process, one or two xml files gets left out. It is fairly rare, once or twice a month, but when it does happen it's a big mess. I am looking for help figuring out why the files don't get zipped and how to prevent it. This code is straightforward:
public string AddToZip(string outfile, string toCompress)
{
if (!File.Exists(toCompress)) throw new FileNotFoundException("Could not find the file to compress", toCompress);
string dir = Path.GetDirectoryName(outfile);
if(!Directory.Exists(dir))
{
Directory.CreateDirectory(dir);
}
// The program that gets this data can't handle files over
// 20 MB, so it splits it up into two or more files if it hits the
// limit.
if (File.Exists(outfile))
{
FileInfo tooBig = new FileInfo(outfile);
int converter = 1024;
float fileSize = tooBig.Length / converter; //bytes to KB
fileSize = fileSize / converter; //KB to MB
int limit = CommonTypes.Helpers.ConfigHelper.GetConfigEntryInt("zipLimit", "19");
if (fileSize >= limit)
{
outfile = MakeNewName(outfile);
}
}
using (ZipFile zf = new ZipFile(outfile))
{
zf.AddFile(toCompress,"");
zf.Save();
}
return outfile;
}
Ultimately, what I want to do is have a check that sees if any xml files weren't added to the zip after the zip file is created, but stopping the problem in its tracks are best overall. Thanks for the help.
Make sure you have that code inside a try... catch statement. Also make sure that if you have done that, you do something with the exception. It would not be the first case that has this type of exception handling:
try
{
//...
}
catch { }
Given the code above if you have any exception on your process, you will never notice.
It's hard to judge from this function alone, here's a list of things that can go wrong:
- The toCompress file can be gone by the time zf.AddFile is called (but after the Exists test). Test return value or add exception handling to detect this.
- The zip outFile can be just below the size limit, adding a new file can make it go over the limit.
- The AddToZip() may be called concurrently, that may cause adding to fail.
How is the toCompress file remove handled? I think adding locking to the AddoZip() on a function scope might also be a good idea.
This could be a timing issue. You are checking to see if outfile is too big before trying to add the toCompress file. What you should be doing is:
Add toCompress to outfile
Check to see if adding the file made outfile too big
If outfile is now too big, remove toCompress, create new outfile, add toCompress to new outfile.
I suspect that you occasionally have an outfile that is just under the limit, but adding toCompress puts it over. Then the receiving program does not process outfile because it is too big.
I could be completely off base, but it is something to check.

C# Loading text data into WPF control

I am searching for a very fast way of loading text content from a 1GB text file into a WPF control (ListView for example). I want to load the content within 2 seconds.
Reading the content line by line takes to long, so I think reading it as bytes will be faster. So far I have:
byte[] buffer = new byte[4096];
int bytesRead = 0;
using(FileStream fs = new FileStream("myfile.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) {
while((bytesRead = fs.Read(buffer, 0, buffer.Length)) > 0) {
Encoding.Unicode.GetString(buffer);
}
}
Is there any way of transforming the bytes into string lines and add those to a ListView/ListBox?
Is this the fastest way of loading file content into a WPF GUI control? There are various applications that can load file content from a 1GB file within 1 second.
EDIT: will it help by using multiple threads reading the file? For example:
var t1 = Task.Factory.StartNew(() =>
{
//read content/load into GUI...
});
EDIT 2: I am planning to use pagination/paging as suggested below, but when I want to scroll down or up, the file content has to be read again to get to the place that is being displayed.. so I would like to use:
fs.Seek(bytePosition, SeekOrigin.Begin);
but would that be faster than reading line by line, in multiple threads? Example:
long fileLength = fs.Length;
long halfFile = (fileLength / 2);
FileStream fs2 = fs;
byte[] buffer2 = new byte[4096];
int bytesRead2 = 0;
var t1 = Task.Factory.StartNew(() =>
{
while((bytesRead += fs.Read(buffer, 0, buffer.Length)) < (halfFile -1)) {
Encoding.Unicode.GetString(buffer);
//convert bytes into string lines...
}
});
var t2 = Task.Factory.StartNew(() =>
{
fs2.Seek(halfFile, SeekOrigin.Begin);
while((bytesRead2 += fs2.Read(buffer2, 0, buffer2.Length)) < (fileLength)) {
Encoding.Unicode.GetString(buffer2);
//convert bytes into string lines...
}
});
Using a thread won't make it any faster (technically there is a slight expense to threads so loading may take slightly longer) though it may make your app more responsive. I don't know if File.ReadAllText() is any faster?
Where you will have a problem though is data binding. If say you after loading your 1GB file from a worker thread (regardless of technique), you will now have 1GB worth of lines to databind to your ListView/ListBox. I recommend you don't loop around adding line by line to your control via say an ObservableCollection.
Instead, consider having the worker thread append batches of items to your UI thread where it can append the items to the ListView/ListBox per item in the batch.
This will cut down on the overhead of Invoke as it floods the UI message pump.
Since you want to read this fast I suggest using the System.IO.File class for your WPF desktop application.
MyText = File.ReadAllText("myFile.txt", Encoding.Unicode); // If you want to read as is
string[] lines = File.ReadAllLines("myFile.txt", Encoding.Unicode); // If you want to place each line of text into an array
Together with DataBinding, your WPF application should be able to read the text file and display it on the UI fast.
About performance, you can refer to this answer.
So use File.ReadAllText() instead of ReadToEnd() as it makes your code
shorter and more readable. It also takes care of properly disposing
resources as you might forget doing with a StreamReader (as you did in
your snippet). - Darin Dimitrov
Also, you must consider the specs of the machine that will run your application.
When you say "Reading the content line by line takes to long", what do you mean? How are you actually reading the content?
However, more than anything else, let's take a step back and look at the idea of loading 1 GB of data into a ListView.
Personally you should use an IEnumerable to read the file, for example:
foreach (string line in File.ReadLines(path))
{
}
But more importantly you should implement pagination in your UI and cut down what's visible and what's loaded immediately. This will cut down your resource use massively and make sure you have a usable UI. You can use IEnumerable methods such as Skip() and Take(), which are effective at using your resources effectively (i.e. not loading unused data).
You wouldn't need to use any extra threads either (aside from the background thread + UI thread), but I will suggest using MVVM and INotifyPropertyChanged to skip worrying about threading altogether.

Add Files Into Existing Zip - performance issue

I have a WCF webservice that saves files to a folder(about 200,000 small files).
After that, I need to move them to another server.
The solution I've found was to zip them then move them.
When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast.
But in production, zipping 200,000 files takes more than 2 hours.
Here is my code to zip the folder :
using (ZipFile zipFile = new ZipFile())
{
zipFile.UseZip64WhenSaving = Zip64Option.Always;
zipFile.CompressionLevel = CompressionLevel.None;
zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);
zipFile.Save(DestinationCurrentFileInfo.FullName);
}
I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.
I use the following code to test:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.
Is there a way to optimize this ? Or if you have other suggestions ?
EDIT
The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file.
As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?
I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.
Reasoning
Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
// (1)
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
// (2)
zip.AddFile(additionFile.FullName);
// (3)
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file
On my computer this takes about an hour.
Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.
Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:
You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:
Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.
Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.
A possible solution
In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.
Basically the algorithm then becomes:
Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).
Again, we can reason about it. The first 16 files aren't a problem, we've already established that.
We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.
Hacking in the ZIP file
Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.
Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.
I haven't worked this out, but it doesn't seem overly complicated to me.
The easiest solution is the most practical
What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.
Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:
static void Main()
{
try { File.Delete(#"c:\tmp\test.zip"); }
catch { }
var sw = Stopwatch.StartNew();
using (var zip = new ZipFile(#"c:\tmp\test.zip"))
{
zip.UseZip64WhenSaving = Zip64Option.Always;
for (int i = 0; i < 200000; ++i)
{
string filename = "foo" + i.ToString() + ".txt";
byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddEntry(filename, contents);
}
zip.Save();
}
Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
Console.ReadLine();
}
Whop; that finishes in 4,5 seconds. Much better.
I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive).
Two ideas to explore:
Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).
If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).
Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.
If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.
You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".aes"))
.Select(f => new FileInfo(f));
using (var zip = ZipFile.Read(nameOfExistingZip))
{
foreach (var additionFile in listAes)
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
}
zip.Save();
}
If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.
If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.
You can zip all the files using .Net TPL (Task Parallel Library) like this:
while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
{
tasks[taskCounter] = Task.Factory.StartNew(() =>
CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
eventSignal.WaitOne(-1); // Line 2
taskCounter++; // Line 3
bufferRead = new byte[sliceBytes]; // Line 4
}
Task.WaitAll(tasks); // Line 6
There is a compiled library and source code here:
http://www.codeproject.com/Articles/49264/Parallel-fast-compression-unleashing-the-power-of

C# Is opening and reading from a Stream slow?

I have 22k text (rtf) files which I must append to one final one.
The code looks something like this:
using (TextWriter mainWriter = new StreamWriter(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (TextReader currentFileRader = new StreamReader(currentFile))
{
string fileContent = currentFileRader.ReadToEnd();
mainWriter.Write(fileContent);
}
}
}
Clearly, this opens 22k times a stream to read from the files.
My questions are :
1) in general, is opening a stream a slow operation? Is reading from a stream a slow operation ?
2) is there any difference if I read the file as byte[] and append it as byte[] than using the file text?
3) any better ideas to merge 22k files ?
Thanks.
1) in general, is opening a stream a slow operation?
No, not at all. Opening a stream is blazing fast, it's only a matter of reserving a handle from the underlying Operating System.
2) is there any difference if I read the file as byte[] and append it
as byte[] than using the file text?
Sure, it might be a bit faster, rather than converting the bytes into strings using some encoding, but the improvement would be negligible (especially if you are dealing with really huge files) compared to what I suggest you in the next point.
3) any ways to achieve this better ? ( merge 22k files )
Yes, don't load the contents of every single file in memory, just read it in chunks and spit it to the output stream:
using (var output = File.OpenWrite(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (var input = File.OpenRead(currentFile))
{
input.CopyTo(output);
}
}
}
The Stream.CopyTo method from the BCL will take care of the heavy lifting in my example.
Probably the best way to speed this up is to make sure that the output file is on a different physical disk drive than the input files.
Also, you can get some increase in speed by creating the output file with a large buffer. For example:
using (var fs = new FileStream(filename, FileMode.Create, FileAccess.Write, FileShare.None, BufferSize))
{
using (var mainWriter = new StreamWriter(fs))
{
// do your file copies here
}
}
That said, your primary bottleneck will be opening the files. That's especially true if those 22,000 files are all in the same directory. NTFS has some problems with large directories. You're better off splitting that one large directory into, say, 22 directories with 1,000 files each. Opening a file from a directory that contains tens of thousands of files is much slower than opening a file in a directory that has only a few hundred files.
What's slow about reading data from a file is the fact that you aren't moving around electrons which can propagate a signal at speeds that are...really fast. To read information in files you have to actually spin these metal disks around and use magnets to read data off of them. These disks are spinning at far slower than electrons can propagate signals through wires. Regardless of what mechanism you use in code to tell these disks to spin around, you're still going to have to wait for them to go a spinin' and that's going to take time.
Whether you treat the data as bytes or text isn't particularly relevant no.

Categories