Concatenate gzipped byte arrays in C# - c#

I have gzipped data which is stored in DB.
Is there a way to concatenate say 50 separate gzipped data into one gzipped output which can be uncompressed? The result should be same as decompressing that 50 items, concatenating them and then gzipping them.
I would like to avoid decompression phase.
Is there also some performance benefit of merging already gzipped data instead gzipping whole byte array?

I would assume that merely concatenating any file in a zipped format would prove disastrous as the zipping algorithm has been run on the specific content per file. I think that you would have to manually unzip all, concatenate, then zip again.

Yes, you can concatenate gzip streams, which when decompressed give you the same thing as if you had concatenated the uncompressed data and gzipped it all at once. Specifically:
gzip a
gzip b
cat a.gz b.gz > c.gz
gunzip c.gz
will give you the same c as:
cat a b > c
However compression will be degraded as compared to gzipping the whole thing at once, especially if each of your 50 pieces are small, e.g. less than several 10's of K bytes. The compressed result will always be different, and a little or a lot larger depending on the size of the pieces.
The comment in another answer about GZIPStream should be heeded. I also recommend that you use DotNetZip instead.

GZip is buggy, moreso decompressing a gzip file which itself has multiple gzip members is buggy... Not all of gzips bugs have been ironed out even in .net 4.5
Furthermore consider what machine each gzip was created on, i.e. is it a BGZF "Blocked GNU Zip Format"? which complicates the issue at hand.
Furthermore the resulting gzip file can be bigger than if you had concatenated all the uncompressed individual files together (gzip isn't a very good compression algorithm set).
I recommend you use DotNetZip instead if it isn't too late.
GZipStream is not really built to handle multiple files, however you can use System.IO.BinaryWriter and System.IO.BinaryReader to gain full control, although it can get messy. DotNetZip just works! It is designed to handle multiple files.
P.S. GZipStream works for file sizes up to 8GB with .Net 4, although earlier versions have a lower limit, e.g. GZipStream works for file sizes up to 4GB with .Net 3.5

Related

Compressing and decompressing very large files using System.IO.Compressing.Gzip

My problem can be described with following statements:
I would like my program to be able to compress and decompress selected files
I have very large files (20 GB+). It is safe to assume that the size will never fit into the memory
Even after compression the compressed file might still not fit into the memory
I would like to use System.IO.Compression.GzipStream from .NET Framework
I would like my application to be parallel
As I am a newbie to compression / decompression I had following idea on how to do it:
I could use split the files into chunks and compress each of them separately. Then merge them back into a whole compressed file.
Question 1 about this approach - Is compressing multiple chunks and then merging them back together going to give me the proper result i.e. if I were to reverse the process (starting from compressed file, back to decompressed) will I receive the same original input?
Question 2 about this approach - Does this approach make sense to you? Perhaps you could direct me towards some good lecture about the topic? Unfortunately I could not find anything myself.
You do not need to chunk the compression just to limit memory usage. gzip is designed to be a streaming format, and requires on the order of 256KB of RAM to compress. The size of the data does not matter. The input could be one byte, 20 GB, or 100 PB -- the compression will still only need 256KB of RAM. You just read uncompressed data in, and write compressed data out until done.
The only reason to chunk the input as you diagram is to make use of multiple cores for compression. Which is a perfectly good reason for your amount of data. Then you can do exactly what you describe. So long as you combine the output in the correct order, the decompression will then reproduce the original input. You can always concatenate valid gzip streams to make a valid gzip stream. I would recommend that you make the chunks relatively large, e.g. megabytes, so that the compression is not noticeably impacted by the chunking.
Decompression cannot be chunked in this way, but it is much faster so there would be little to no benefit even if you could. The decompression is usually i/o bound.

Add data to a zip archive until it reaches a given size

I'm trying to zip some data, but also break up the data set into multiple archives so that no single zip file ends up being larger than some maximum.
Since my data is not sourced from the file system it seems a good idea to use a streaming approach. I thought I could simply write one atomic piece of data at a a time while keeping track of the stream position prior to writing each piece. Once I exceed the limit, I truncate the stream to the position before writing the piece that can't fit, and move on to create the next archive.
I've attempted with the classes in System.IO.Compression - create an archive, create an entry, use ZipArchiveEntry.Open to get a stream, and write to that stream. The problem is it seems impossible to get from this how large the archive is at any point.
I can read the stream's position, but this is tracking uncompressed bytes. Truncating the stream works fine too, so I have this working as intended now with the important exception that the limit applies to how much uncompressed data there is per archive rather than how large the compressed archive becomes.
The data is part compressible text and various blobs (attachments from end users) that will sometimes be very compressible and sometimes not at all.
My questions:
1) Is there something about the deflate algorithm that inherently conflicts with my approach? I know it is a block-based compression scheme and I imagine the algorithm may not decide how to encode the compressed data until the entire archive has been specified.
2) If the answer to (1) above is "yes", what is a good strategy that doesn't introduce far too much overhead?
One idea I have is to assume that compressed data won't be larger than uncompressed data. I can then write to the stream until uncompressed data exceeds the threshold, and then save the archive, calculate the difference between the threshold and the current size, and repeat until full.
In case that wasn't clear, say the limit is 1MB. I write 1 MB of uncompressed data and save the archive. I then see that the resulting archive is 0.3MB. I open the archive (and its only entry) again and start over with a new limit of 0.7 MB, since I know I am able to add at least that much uncompressed data to it without overshooting. I imagine this approach is relatively simple to implement, and will test it, but am interested to hear if anyone has better ideas.
You can find out a lower bound on how big the compressed data is by looking at the Length or Position of the underlying FileStream. You can then decide to stop adding entries. The ZIP stream classes tend to buffer not too much. Probably on the order of 64KB.
Truncating the archive at a certain point should be possible. Try flushing the ZIP stream before measuring the Position of the base stream. This is always possible in theory but the actual library you are using might not support it. Test it or look at the source.

Does DeflateStream "skip" decompression if the data was not originally compressed?

I'm not familiar with the internals of DeflateStream, but I need to store files in a Vendor's DB system that uses DeflateStream on binary attachments. The first thing I noticed was that all of my files were 10-50% BIGGER after compression, but I attribute that to a less sophisticated compression algo on top of files that are already highly compressed (in this case they were all PDFs). My question however relates to the fact that when I just wrote the original file into the BLOB the Vendor's application had no problem opening it (it opened the attachments I compressed with deflate as well). Is there a header on the compressed data that tells DeflateStream that the data's not compressed and basically pass it on as-is? This is the specification; can anyone familiar with it point where this is defined - or am I off base and the vendor is doing some magic behind the scenes?
no, there is no such magic in DeflateStream.
The built-in deflateStream exhibits a compression anomaly in which previously-compressed data actually increases in size. This has been reported to Microsoft previously, but they declined to fix the problem. it has to do with a naive implementation in DeflateStream of the DEFLATE protocol.
Ways that I know of to avoid the problem:
use an alternative deflateStream that does not exhibit this problem. See DotNetZip for one example.
It includes a DeflateStream that just works.
use the broken DeflateStream, compress the stream, compare sizes, and if the "compressed" stream is larger, then fallback to using the "uncompressed" stream.
If you choose the former case, you still have the condition where you are compressing stuff that has already been compressed. In other words, unnecessary double-compression. so you may want to look into avoiding that, regardless what you choose.
Stream compression is different from file compression. When compressing a file, it's generally possible to make multiple passes over the entire file and determine which compression scheme to use before having to commit to one. When compressing a stream, it's often necessary to start outputting data before the compression routine has processed enough data to know what compression method is going to be optimal.
This effect can be somewhat mitigated by dividing data into blocks, deciding for each block how to represent the data, and including a header at the start of each block identifying how it is stored. Unfortunately, the extra block headers will add to the size of the resulting stream. Further, many compression schemes improve in effectiveness as they process a stream; it may well be that every 1k block in a file would expand if "compressed" individually, even if compressing the whole file would result in a considerable space savings (since the compresser could e.g. build up a dictionary of common byte sequences). It would be possible to design a compress/uncompress pair so that a block of data which would expand would be written out verbatim by the compresser (with a header byte indicating that's what it was), and have the uncompresser process that block the same way the compresser could have done, so as to add to the dictionary the same byte sequences that would have been added had the block been stored in "compressed" form. Such an approach would probably be a good one, though it would add considerably to the complexity of the uncompresser.
I suspect the biggest problem for DeflateStream, though, is that there may not be any way to improve the worst-case "compression" performance without producing compressed data that is incompatible with the existing "uncompress" code. Suppose one has a string of bytes Q, and one needs a sequence of bytes which, when fed to the "uncompress" code that shipped with .net 2.0, will yield that same sequence. It may well be that for some possible values of Q, there are no such input sequences which aren't a lot bigger than Q. If that's the case, there's no way Microsoft could "fix" the problem without a time machine.
It all depends on how the DEFLATE stream was created.
DEFLATE supports a "non-compressed block" (BTYPE=00) and all data in this block, should it be utilized, is stored verbatim with no compression -- just the block header, length, and raw data. However, a stream can be a valid DEFLATE stream and contain zero (or not enough) "non-compressed" blocks even if this resulted in a sub-par compression ratio.
The overall compression ratio will depend upon the data, compressor algorithm/implementation, and amount of effort it puts into performing the compression.
Happy coding.

Is it possible to recover corrupted Zlib data beyond the corrupted section?

I have a Zip archive with a large (important) file that will not extract. All Zip utilities that I've tried, including those that claim to recover/fix broken Zip archives are unable to extract the file containing the corrupted zlib compressed data. They get all the files in the archive except for the damaged entry, which gets skipped.
I've written a small utility app in C# that parses the zip archive, identifies each entry and parses out the fields, decrypts the data sections, and then decompresses them using a DeflateStream (from a .Net implementation of zlib). Everything works fine until I get to the damaged entry. The damaged entry decrypts successfully and fully (using AES in CTR mode), but the DeflateStream reader is only able to get through about 40MB into the decrypted data before throwing "Bad state (oversubscribed dynamic bit lengths tree)".
Is it possible to somehow 'seek' past the damaged section and continue decompressing the data? I'd like to recover as much of the file as possible, even if there are some holes. The DeflateStream doesn't implement a Seek method, and if I attempt to create a new DeflateStream with the underlying FileStream positioned to the last read position, it throws the same "Bad State" exception.
The deflate protocol adapts the tables based on a sliding window and a forward-stream.
This isn't a block -- each section isn't standalone unit of data so there is no way to just "skip" a bad block -- but rather a moving "view" of the streams used to compute/restore the table information. That being said, my simple conclusion: not realistically possible :(
See An Explanation of the "Deflate" Protocol.
Less-than-grumpy data restorations.
The deflate protocol actually has 'blocks' which allow switching between the compressor used. However, I would doubt they are usable for any sort of recovery. It is a far cry from MPEG-4, which is actually rather recoverable.
From RFC 1951, which shows how complicated it would be:
Note that the header bits do not necessarily begin on a byte boundary, since a block does not necessarily occupy an integral number of bytes.
And talking about the LZ77 compressor able to span the individual blocks: (It is the entirety of the stream within a window that is used to build the compression information).
Note that a duplicated string reference may refer to a string in a previous block; i.e., the backward distance may cross one or more block boundaries.
However, there is a hint of hope:
The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer.
Not a ray of light I would chase :-/ Perhaps individual bit-seeking (not all bit sequences are valid block starts) and then running the algorithm until it failed (provably "invalid" deflate), then backing off and trying again until such a starting-bit didn't generate invalid states within some 'acceptable' period.
This method would require modifications to a deflate protocol engine -- it is not a task a normal deflate stream processor could handle.

Compression XML metrics .

I have a client server application that sends XML over TCP/IP from client to server and then broadcast out to other clients. How do i know at what the minimun size of the XML that would warrant a performance improvement by compression the XML rather than sending over the regular stream.
Are there any good metrics on this or examples?
Xml usually compresses very well, as it tends to have a lot of repetition.
Another option would be to swap to a binary format; BinaryFormatter or NetDataContractSerializer are simple options, but both are notoriously incompatible (for example with java) compared with xml.
Another option would be a portable binary format such as google's "protocol buffers". I maintain a .NET/C# version of this called protobuf-net. This is designed to be side-by-side compatible with regular .NET approaches (such as XmlSerializer / DataContractSerializer), but is much smaller than xml, and requires significantly less processing (CPU etc) for both serialization and deserialization.
This page shows some numbers for XmlSerializer, DataContractSerializer and protobuf-net; I thought it included stats with/without compression, but they seem to have vanished...
[update] I should have said - there is a TCP/IP example in the QuickStart project.
A loose metric would be to compress anything larger than a single packet, but that's just nitpicking.
There is no reason to refrain from using a binary format internally in your application - no matter how much time compression will take, the network overhead will be several orders of magnitude slower than compressing (unless we're talking about very slow devices).
If these two suggestions don't put you at ease, you can always benchmark to find the spot to compress at.
By all means compress it always.
It will save you bandwidth for anything with more then 2 tags.
To decide if compression has any benefit for you, you need to run some tests using actual or expected amount of the kind of data expect will flow through your system.
Hope this helps.
In the tests that we did, we found a huge benefit, however be aware about the CPU implications.
On one project that I worked on we were sending over large amounts of XML data (> 10 meg) to clients running .NET. (I'm not recommending this as a way to do things, it's just the situation we found ourselves in!!) We found that as XML files got sufficiently large the Microsoft XML libraries were unable to parse the XML files (the machines ran out of memory, even on machines > 1 gig). Changing the XML parsing libraries eventually helped, but before we did that we enabled GZIP compression on the data we transferred which helped us parse the large documents. On our two linux based websphere servers we were able to generate the XML and then gzip it fairly easily. I think that with 50 users doing this concurrently (loading about 10 to 20 of these files) we were able to do this ok, with about 50% cpu. The compression of the XML seemed to be better handled (i.e. parsing/cpu time) on the servers than on the .net gui's, but this was probably due to the above inadequacies of the Microsoft XML libraries being used. As I mentioned, there are better libraries available that are faster and use less memory.
In our case, we got massive improvements in size too -- we were compressing 50 meg XML files in some cases down to about 10 meg. This obviously helped out network performance too.
Since we were concerned about the impact, and whether this would have other consequences (our users seemed to do things in large waves, so we were concerned we'd run out of CPU) we had a config variable which we could use to turn gzip on/off. I'd recommend that you do this too.
Another thing: we also zipped XML files before persisting them in databases, and this saved about 50% space (XML files ranging from a few K to a few meg, but mostly fairly small). It's probably easier to do everything than choose a specific level to differentiate when to use compression or not.

Categories