GZipStream makes my text bigger than original - c#

There is a post in here Compress and decompress string in c# for compressing string in c#.
I've implement the same code for myself but the returned text is almost twice as mine :O
I've tried it on a json with size 87 like this:
{"G":"82f88ff5-4143-46ef-86cc-a19910f4a6b5","U":"df39e3c7-ffd3-4829-a9cd-27bfcbd4403a"}
The result is 168
H4sIAAAAAAAEAC2NUQ6DIBQE5yx8l0QFqfQCnqAHqKCXaHr3jsaQ3TyYfcuXwKpeamHi0Bf9YCaSGVW6psLua5QWmifykVbPyCDJ3gube4GHet+tXZZM7Xrj6d7Z3u/W8896dVVpd5rMbCaa3k1k25M88OMPcjDew64AAAA=
I've changed Unicode to ASCII but the result is still too big (128)
H4sIAAAAAAAEAA3KyxGAMAgFwF44y0w+JAEbsAILICSvCcfedc/70EUnaYEq0FiyVJa+wdoj2LNZThDvs9FB918Xqu0ag4H1Vy3GbrG4jImYSyRVp/cDp8EZE1cAAAA=
public static string Compress(this string s)
{
var bytes = Encoding.ASCII.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream())
{
using (var gs = new GZipStream(mso, CompressionMode.Compress))
{
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}

Gzip is not only compression but a complete file format - this means it adds additional structures which usually can be neglected regarding their size.
However if compressing small strings they can blow up the overall gzip stream.
The standard GZIP header for example has 10 bytes and it's footer is 8 bytes long.
Therefore you now take your gzip compressed result in raw format (not the bloated up base64 encoded one) you will see that it has 95 bytes.
Therefore the 18 bytes for header and hooter already make nearly 20% of the output!

Related

Why is this ZipArchive oracle off by 5-10 bytes?

I'm using ZipArchive and I'm writing an oracle that determines the size of a zip file based on the zip specification. For simplicity, no compression is being used.
private long ZipSizeOracle(int numOfFiles, int totalLengthOfFilenames, int totalSizeOfFiles)
{
return
numOfFiles * (
30 //Local file header
+
12 //Data descriptor
+
46 //Central directory file header
)
+
2 * totalLengthOfFilenames //Local file header name + Central directory file header name
+
totalSizeOfFiles //Data size
+ 22 //End of central directory record (EOCD)
;
}
Currently I have 4 tests, ZeroFiles outputs 22 bytes correctly and is the appropriate size for an empty zip.
[TestMethod]
public void ZeroFiles()
{
using (var memStream = new MemoryStream())
{
using (var archive = new ZipArchive(memStream, ZipArchiveMode.Create, true)) { }
Assert.AreEqual(ZipSizeOracle(0, 0, 0), memStream.Length);
}
}
One4ByteFile expects 130 bytes but the actual was 125 bytes
[TestMethod]
public void One4ByteFile()
{
using (var memStream = new MemoryStream())
{
using (var archive = new ZipArchive(memStream, ZipArchiveMode.Create, true))
{
var entry1 = archive.CreateEntry("test.txt", CompressionLevel.NoCompression);
using (var writer = new StreamWriter(entry1.Open()))
writer.WriteLine("test");
}
Assert.AreEqual(ZipSizeOracle(1, 8, 4), memStream.Length);
}
}
Two4ByteFiles expects 241 bytes but the actual was 231 bytes
[TestMethod]
public void Two4ByteFiles()
{
using (var memStream = new MemoryStream())
{
using (var archive = new ZipArchive(memStream, ZipArchiveMode.Create, true))
{
var entry1 = archive.CreateEntry("test.txt", CompressionLevel.NoCompression);
using (var writer = new StreamWriter(entry1.Open()))
writer.WriteLine("test");
var entry2 = archive.CreateEntry("test2.txt", CompressionLevel.NoCompression);
using (var writer = new StreamWriter(entry2.Open()))
writer.WriteLine("test2");
}
Assert.AreEqual(ZipSizeOracle(2, 17, 9), memStream.Length);
}
}
OneFolder expects 118 bytes but the actual was 108 bytes
[TestMethod]
public void OneFolder()
{
using (var memStream = new MemoryStream())
{
using (var archive = new ZipArchive(memStream, ZipArchiveMode.Create, true))
archive.CreateEntry(#"test\", CompressionLevel.NoCompression);
Assert.AreEqual(ZipSizeOracle(1, 4, 0), memStream.Length);
}
}
What am I missing from the specification in order for the oracle to give me the correct file size?
You are missing the following:
Data descriptor block is optional and is included only if zip file is written in "streamed" manner (that is - you don't know size of file beforehand and write "on the fly"). When you are streaming - size of compressed and uncompressed data, as well as CRC, are not available when file header is written (because file header goes before data), so all those bytes in file header are set to 0 and data descriptor block is included after compressed data, when this information is available. In case of examples you provided - data descriptor is not included.
Level NoCompression in CreateEntry does not mean data is included literally. Instead, data is processed with deflate algorithm (compression method 8 in specification you linked) without actual compression. This deflate algorithm adds its own overhead, even in "no compression mode":
1 byte defines if this is a last block or not and compression level.
2 bytes define block size
2 bytes define two-complement of block size (for integrity)
then goes the data with size defined above
So for each block of data in input (block is 2^16 bytes) - 5 bytes of overhead are added. In your examples all files are less than 2^16 in size, so just 5 bytes are added for them.
You use writer.WriteLine, so size of data you write is not 4 bytes in first example, but 6, because \r\n (newline characters) are added (and in second example that is 13).
If you take all this into account (remove 12 data descriptor size, add 5 size of deflate overhead for your small files, pass correct totalSizeOfFiles) - your examples will produce expected output.
Update about data descriptor record. Specification says:
This descriptor SHOULD be used only when it was not possible to
seek in the output .ZIP file, e.g., when the output .ZIP file
was standard output or a non-seekable device
And ZipArchive class follows this. If you pass unseekable stream in constructor - it will emit data descriptor records. For example:
public class UnseekableStream : MemoryStream {
public override bool CanSeek => false;
}
using (var memStream = new UnseekableStream()) {
using (var archive = new ZipArchive(memStream, ZipArchiveMode.Create, true)) {
}
}
Such unseekable streams often happen in practice, http response stream is one example. But note that 12 bytes is not the only allowed size for data descriptor record:
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers should be
aware that ZIP files may be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
4.3.9.4 When writing ZIP files, implementors SHOULD include the
signature value marking the data descriptor record. When
the signature is used, the fields currently defined for
the data descriptor record will immediately follow the
signature.
So, data descriptor may optionally start with 4 bytes signature, and it is recommended for implementors to include that signature when writing, and ZipArchive follows this recommendation, so size of data descriptor record it emits is 16 bytes (12 + 4 of signature), not 12.

How can i find the start of Gzip string in a MemoryStream?

byte[] httpDecompress(HttpDatagram http)
{
int magicnum = 0x1f8b;
Stream str= http.Body.ToMemoryStream();
using (var zipStream = new GZipStream(str, CompressionMode.Decompress))
using (var resultStream = new MemoryStream())
{
zipStream.CopyTo(resultStream);
return resultStream.ToArray();
}
}
there is the code but it gives a magic number error. How can i find the beginning of the GZip string, i think the source of problem is there. Can anyone help?
Not knowing where the gzip stream starts may or may not be your problem. (In fact, probably not.) In any case, you can search for the three-byte sequence 1f 8b 08 to identify candidate gzip streams. Start decompressing from the 1f to see if it really is a gzip stream.

Decompress file with wrong size

I have a method that decompresses *.gz file:
using (FileStream originalFileStream = new FileStream(gztempfilename, FileMode.Open, FileAccess.Read))
{
using (FileStream decompressedFileStream = new FileStream(outputtempfilename, FileMode.Create, FileAccess.Write))
{
using (GZipStream decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress))
{
decompressionStream.CopyTo(decompressedFileStream);
}
}
}
It worked perfectly, but recently I received pack of files with wrong size:
When I open them with 7-zip they have Packed Size ~ 1,600,000 and Size = 7 (it should be ~20,000,000).
So when I extract them using this code I get only a part of the file. But when I extract this file using 7-zip I get full file.
How can I handle this situation in my code?
My guess is that that the other end does a mistake when GZipping the files. It looks like it does not set the ISIZE bytes correctly.
The ISIZE bytes are the last four bytes of a valid GZip file and come after a 32-bit CRC value which in turn comes directly after the compressed data bytes.
7-Zip seems to be robust against such mistakes whereas the GZipStream is not. It is odd however that 7-Zip is not showing you any errors. It should show you (tested with 7-Zip 16.02 x64/Win7)...
CRC error in case the size is simply wrong,
"Unexpected end of data" in case some or all of the ISIZE bytes are cut off,
"There are some data after end of the payload data" in case there is more data following the ISIZE bytes.
7-Zip always uses the last four bytes of the packed file to determine the size of the original unpacked file without checking if the file is valid and whether the bytes read for that are actually the ISIZE bytes.
You can verify this by checking those last four bytes of the GZipped file with a hex viewer. For your example they should be exactly 07 00 00 00.
If you know the exact size of the unpacked original file you could replace those bytes so that they specify the correct size. For instance, if the unpacked file's size is 20,000,078, which is 01312D4E in hex (0-padded to eight digits), those bytes should be 4E 2D 31 01.
In case you don't know the exact size you can try replacing them with the maximum value, i.e. FF FF FF FF.
After that try your unpack code again.
This is obviously only a hacky solution to your problem. Better try fixing the code that GZips the files you receive or try to find a library that is more robust than GZipStream.
I've used ICSharpCode.SharpZipLib.GZip.GZipInputStream from this library instead of System.IO.Compression.GZipStream and it helped.
Did you try this for check the size? ie:
byte[] bArray;
using (FileStream f = new FileStream(tempFile, FileMode.Open))
{
bArray= new byte[f.Length];
f.Read(b, 0, f.Length);
}
Regards
try:
GZipStream uncompressed = new GZipStream(streamIn, CompressionMode.Decompress, true);
FileStream streamOut = new FileStream(tempDoc[0], FileMode.Create, FileAccess.Write, FileShare.None);
Looks like this is some sort of bug in GZipStream (it does not write original file length into gz end of file).
You need to change the way you compress your files using GZipStream.
The way it will work:
inputBytes = Encoding.UTF8.GetBytes(output);
using (var outputStream = new MemoryStream())
{
using (var gZipStream = new GZipStream(outputStream, CompressionMode.Compress))
gZipStream.Write(inputBytes, 0, inputBytes.Length);
System.IO.File.WriteAllBytes("file.xml.gz", outputStream.ToArray());
}
And this way will cause the error you have (no matter will you use Flush() or not):
inputBytes = Encoding.UTF8.GetBytes(output);
using (var outputStream = new MemoryStream())
{
using (var gZipStream = new GZipStream(outputStream, CompressionMode.Compress))
{
gZipStream.Write(inputBytes, 0, inputBytes.Length);
System.IO.File.WriteAllBytes("file.xml.gz", outputStream.ToArray());
}
}
You might need to call decompressedStream.Seek() after closing the gZip stream.
As shown here.

Convert a wav file to 8000Hz 16Bit Mono Wav

I need to convert a wav file to 8000Hz 16Bit Mono Wav. I already have a code, which works well with NAudio library, but I want to use MemoryStream instead of temporary file.
using System.IO;
using NAudio.Wave;
static void Main()
{
var input = File.ReadAllBytes("C:/input.wav");
var output = ConvertWavTo8000Hz16BitMonoWav(input);
File.WriteAllBytes("C:/output.wav", output);
}
public static byte[] ConvertWavTo8000Hz16BitMonoWav(byte[] inArray)
{
using (var mem = new MemoryStream(inArray))
using (var reader = new WaveFileReader(mem))
using (var converter = WaveFormatConversionStream.CreatePcmStream(reader))
using (var upsampler = new WaveFormatConversionStream(new WaveFormat(8000, 16, 1), converter))
{
// todo: without saving to file using MemoryStream or similar
WaveFileWriter.CreateWaveFile("C:/tmp_pcm_8000_16_mono.wav", upsampler);
return File.ReadAllBytes("C:/tmp_pcm_8000_16_mono.wav");
}
}
Not sure if this is the optimal way, but it works...
public static byte[] ConvertWavTo8000Hz16BitMonoWav(byte[] inArray)
{
using (var mem = new MemoryStream(inArray))
{
using (var reader = new WaveFileReader(mem))
{
using (var converter = WaveFormatConversionStream.CreatePcmStream(reader))
{
using (var upsampler = new WaveFormatConversionStream(new WaveFormat(8000, 16, 1), converter))
{
byte[] data;
using (var m = new MemoryStream())
{
upsampler.CopyTo(m);
data = m.ToArray();
}
using (var m = new MemoryStream())
{
// to create a propper WAV header (44 bytes), which begins with RIFF
var w = new WaveFileWriter(m, upsampler.WaveFormat);
// append WAV data body
w.Write(data,0,data.Length);
return m.ToArray();
}
}
}
}
}
}
It might be added and sorry I can't comment yet due to lack of points. That NAudio ALWAYS writes 46 byte headers which in certain situations can cause crashes. I want to add this in case someone encouters this while searching for a clue why naudio wav files only start crashing certain programs.
I encoutered this problem after figuring out how to convert and/or sample wav with NAudio and was stuck after for 2 days now and only figured it out with a hex editor.
(The 2 extra bytes are located at byte 37 and 38 right before the data subchunck [d,a,t,a,size<4bytes>].
Here is a comparison of two wave file headers left is saved by NAudio 46 bytes; right by Audacity 44 bytes
You can check this back by looking at the NAudio src in WaveFormat.cs at line 310 where instead of 16 bytes for the fmt chunck 18+extra are reserved (+extra because there are some wav files which even contain bigger headers than 46 bytes) but NAudio always seems to write 46 byte headers and never 44 (MS standard). It may also be noted that in fact NAudio is able to read 44 byte headers (line 210 in WaveFormat.cs)

Compressing with GZipStream results in more bytes

I am using the following methods to compress my response's content:
(Consider _compression = CompressionType.GZip)
private async Task<HttpContent> CompressAsync(HttpContent content)
{
if(content == null) return null;
byte[] compressedBytes;
using(MemoryStream outputStream = new MemoryStream())
{
using (Stream compressionStream = GetCompressionStream(outputStream))
using (Stream contentStream = await content.ReadAsStreamAsync())
await contentStream.CopyToAsync(compressionStream);
compressedBytes = outputStream.ToArray();
}
content.Dispose();
HttpContent compressedContent = new ByteArrayContent(compressedBytes);
compressedContent.Headers.ContentEncoding.Add(GetContentEncoding());
return compressedContent;
}
private Stream GetCompressionStream(Stream output)
{
switch (_compression)
{
case CompressionType.GZip: { return new GZipStream(output, CompressionMode.Compress); }
case CompressionType.Deflate: { return new DeflateStream(output, CompressionMode.Compress); }
default: return null;
}
}
private string GetContentEncoding()
{
switch (_compression)
{
case CompressionType.GZip: { return "gzip"; }
case CompressionType.Deflate: { return "deflate"; }
default: return null;
}
}
However, this method returns more bytes than the original content.
For example, my initial content is 42 bytes long, and the resulting compressedBytes array has a size of 62 bytes.
Am I doing something wrong here? How can compression generate more bytes?
You are not necessarily doing anything wrong. You have to take into account that these compressed formats always require a bit of space for header information. So that's probably why it grew by a few bytes.
Under normal circumstances, you would be compressing larger amounts of data. In that case, the overhead associated with the header data becomes unnoticeable when compared to the gains you make by compressing the data.
But because, in this case, your uncompressed data is so small, and so you are probably not gaining much in the compression, then this is one of the few instances where you can actually notice the header taking up space.
When compressing small files with gzip it is possible that the metadata (for the compressed file itself) causes an increase larger then the number of bytes saved by compression.
See Googles gzip tips:
Believe it or not, there are cases where GZIP can increase the size of the asset. Typically, this happens when the asset is very small and the overhead of the GZIP dictionary is higher than the compression savings, or if the resource is already well compressed.
For such a small size the compression overhead can actually make the file larger and it's nothing unusual. Here it's explained with more detail.

Categories