Compressing with GZipStream results in more bytes - c#

I am using the following methods to compress my response's content:
(Consider _compression = CompressionType.GZip)
private async Task<HttpContent> CompressAsync(HttpContent content)
{
if(content == null) return null;
byte[] compressedBytes;
using(MemoryStream outputStream = new MemoryStream())
{
using (Stream compressionStream = GetCompressionStream(outputStream))
using (Stream contentStream = await content.ReadAsStreamAsync())
await contentStream.CopyToAsync(compressionStream);
compressedBytes = outputStream.ToArray();
}
content.Dispose();
HttpContent compressedContent = new ByteArrayContent(compressedBytes);
compressedContent.Headers.ContentEncoding.Add(GetContentEncoding());
return compressedContent;
}
private Stream GetCompressionStream(Stream output)
{
switch (_compression)
{
case CompressionType.GZip: { return new GZipStream(output, CompressionMode.Compress); }
case CompressionType.Deflate: { return new DeflateStream(output, CompressionMode.Compress); }
default: return null;
}
}
private string GetContentEncoding()
{
switch (_compression)
{
case CompressionType.GZip: { return "gzip"; }
case CompressionType.Deflate: { return "deflate"; }
default: return null;
}
}
However, this method returns more bytes than the original content.
For example, my initial content is 42 bytes long, and the resulting compressedBytes array has a size of 62 bytes.
Am I doing something wrong here? How can compression generate more bytes?

You are not necessarily doing anything wrong. You have to take into account that these compressed formats always require a bit of space for header information. So that's probably why it grew by a few bytes.
Under normal circumstances, you would be compressing larger amounts of data. In that case, the overhead associated with the header data becomes unnoticeable when compared to the gains you make by compressing the data.
But because, in this case, your uncompressed data is so small, and so you are probably not gaining much in the compression, then this is one of the few instances where you can actually notice the header taking up space.

When compressing small files with gzip it is possible that the metadata (for the compressed file itself) causes an increase larger then the number of bytes saved by compression.
See Googles gzip tips:
Believe it or not, there are cases where GZIP can increase the size of the asset. Typically, this happens when the asset is very small and the overhead of the GZIP dictionary is higher than the compression savings, or if the resource is already well compressed.

For such a small size the compression overhead can actually make the file larger and it's nothing unusual. Here it's explained with more detail.

Related

GZipStream makes my text bigger than original

There is a post in here Compress and decompress string in c# for compressing string in c#.
I've implement the same code for myself but the returned text is almost twice as mine :O
I've tried it on a json with size 87 like this:
{"G":"82f88ff5-4143-46ef-86cc-a19910f4a6b5","U":"df39e3c7-ffd3-4829-a9cd-27bfcbd4403a"}
The result is 168
H4sIAAAAAAAEAC2NUQ6DIBQE5yx8l0QFqfQCnqAHqKCXaHr3jsaQ3TyYfcuXwKpeamHi0Bf9YCaSGVW6psLua5QWmifykVbPyCDJ3gube4GHet+tXZZM7Xrj6d7Z3u/W8896dVVpd5rMbCaa3k1k25M88OMPcjDew64AAAA=
I've changed Unicode to ASCII but the result is still too big (128)
H4sIAAAAAAAEAA3KyxGAMAgFwF44y0w+JAEbsAILICSvCcfedc/70EUnaYEq0FiyVJa+wdoj2LNZThDvs9FB918Xqu0ag4H1Vy3GbrG4jImYSyRVp/cDp8EZE1cAAAA=
public static string Compress(this string s)
{
var bytes = Encoding.ASCII.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream())
{
using (var gs = new GZipStream(mso, CompressionMode.Compress))
{
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
Gzip is not only compression but a complete file format - this means it adds additional structures which usually can be neglected regarding their size.
However if compressing small strings they can blow up the overall gzip stream.
The standard GZIP header for example has 10 bytes and it's footer is 8 bytes long.
Therefore you now take your gzip compressed result in raw format (not the bloated up base64 encoded one) you will see that it has 95 bytes.
Therefore the 18 bytes for header and hooter already make nearly 20% of the output!

Compress a string using GZip, the string is not shorter

I used the following code to compress a string, but the string is not shorter. Can you explain why?
private string Compress(string str)
{
try
{
String returnValue;
byte[] buffer = Encoding.ASCII.GetBytes(str);
using (MemoryStream ms = new MemoryStream())
{
using (GZipStream zip = new GZipStream(ms, CompressionMode.Compress, true))
{
zip.Write(buffer, 0, buffer.Length);
using (StreamReader sReader = new StreamReader(ms, Encoding.ASCII))
{
returnValue = sReader.ReadToEnd();
}
}
}
return returnValue;
}
catch
{
return str;
}
}
Ignoring issues in the code - there are multiple possible scenarios when this can happen.
Simplified explanation of compression algorithm - compression is based on the fact that data you are trying to compress contain redundant values - patterns which can be recognized by the compression algorithm and can be "shortened" by expressing the redundant values more concisely.
Some scenarios when the compressed result can be larger then the input:
1) Input is too short - compression algorithms have some data overhead and considering the short input, it is unable to compress it effectively. So you have some data overhead from the compression mechanism + original data.
2) Input is already compressed - again, compression algorithms have some data overhead and when is the input already compressed - it is unable to compress it effectively.
3) Input is too random - considering the input is generated by some random generator, the compression algorithm is unable to compress it effectively - no patterns can be recognized.

Reduce Quality of Image/Stream Before Saving

I'm trying to take an input stream (a zip file of images) and extract each file. But i must reduce the quality of each image before they are saved (if quality < 100). I have tried the following but it never compresses the image:
public void UnZip(Stream inputStream, string destinationPath, int quality = 80) {
using (var zipStream = new ZipInputStream(inputStream)) {
ZipEntry entry;
while ((entry = zipStream.GetNextEntry()) != null) {
var directoryPath = Path.GetDirectoryName(destinationPath + Path.DirectorySeparatorChar + entry.Name);
var fullPath = directoryPath + Path.DirectorySeparatorChar + Path.GetFileName(entry.Name);
// Create the stream to unzip the file to
using (var stream = new MemoryStream()) {
// Write the zip stream to the stream
if (entry.Size != 0) {
var size = 2048;
var data = new byte[2048];
while (true) {
size = zipStream.Read(data, 0, data.Length);
if (size > 0)
stream.Write(data, 0, size);
else
break;
}
}
// Compress the image and save it to the stream
if (quality < 100)
using (var image = Image.FromStream(stream)) {
var info = ImageCodecInfo.GetImageEncoders();
var #params = new EncoderParameters(1);
#params.Param[0] = new EncoderParameter(Encoder.Quality, quality);
image.Save(stream, info[1], #params);
}
}
// Save the stream to disk
using (var fs = new FileStream(fullPath, FileMode.Create)) {
stream.WriteTo(fs);
}
}
}
}
}
I'd appreciate it if someone could show me what i'm doing wrong. Also any advice on tidying it up would be appreciated as the code's grown abit ugly. Thanks
You really shouldn't be using the same stream to save the compressed image. The MSDN documentation clearly says: "Do not save an image to the same stream that was used to construct the image. Doing so might damage the stream." (MSDN Article on Image.Save(...))
using (var compressedImageStream = new MemoryStream())
{
image.Save(compressedImageStream, info[1], #params);
}
Also, what file format are you encoding into? You haven't specified. You're just getting the second encoder found. You shouldn't rely on the order of the results. Search for a specific codec instead:
var encoder = ImageCodecInfo.GetImageEncoders().Where(x => x.FormatID == ImageFormat.Jpeg.Guid).SingleOrDefault()
... and don't forget to check if the encoder doesn't exist on your system:
if (encoder != null)
{ .. }
The Quality parameter doesn't have meaning for all file formats. I assume you might be working with JPEGs? Also, keep in mind that 100% JPEG Quality != Lossless Image. You can still encode with Quality = 100 and reduce space.
There is no code to compress the image after you've extracted it from the zip stream. All you seem to be doing is getting the unzipped data into a MemoryStream, then proceeding the write the image to the same stream based on quality information (which may or may not compress an image, depending on the codec). I would first recommend not writing to the same stream you're reading from. Also, what "compression" you get out of the Encoder.Quality property depends on the type of image--which you haven't provided any detail on. If the image type supports compression and the incoming image quality is lower than 100 to start, you won't get any reduction in size. Also, you've not provided any information with regard to that. Long story short, you haven't provided enough information for anyone to give you a real answer.

difference between Image.Save and FileStream.Write() in c#

I have to read image binary from database and save this image binary as a Tiff image on filesystem. I was using the following code
private static bool SavePatientChartImageFileStream(byte[] ImageBytes, string ImageFilePath, string IMAGE_NAME)
{
bool success = false;
try
{
using (FileStream str = new FileStream(Path.Combine(ImageFilePath, IMAGE_NAME), FileMode.Create))
{
str.Write(ImageBytes, 0, Convert.ToInt32(ImageBytes.Length));
success = true;
}
}
catch (Exception ex)
{
success = false;
}
return success;
}
Since these image binaries are being transferred through merge replication, sometimes it happens that image binary is not completely transferred and we are sending request to fetch Image Binary with a nolock hint. This returns in ImageBytes having 1 byte data and it saves it as a 0 kb corrupted tiff image.
I have changed the above code to :-
private static bool SavePatientChartImage(byte[] ImageBytes, string ImageFilePath, string IMAGE_NAME)
{
bool success = false;
System.Drawing.Image newImage;
try
{
using (MemoryStream stream = new MemoryStream(ImageBytes))
{
using (newImage = System.Drawing.Image.FromStream(stream))
{
newImage.Save(Path.Combine(ImageFilePath, IMAGE_NAME));
success = true;
}
}
}
catch (Exception ex)
{
success = false;
}
return success;
}
In this case if ImageBytes is of 1 byte or incomplete, it won't save image and will return success as false.
I cannot remove NOLOCK as we are having extreme locking.
The second code is slower as compared to first one. I tried for 500 images. there was a difference of 5 seconds.
I couldn't understand the difference between these 2 pieces of code and which code to use when. Please help me understand.
In the first version of the code, you are essentially taking a bunch of bytes and writing them to the filesystem. There's no verification of a valid TIFF file because the code neither knows nor cares it's a TIFF file. It's just a bunch of bytes without any business logic attached.
In the second code, you're taking the bytes, wrapping them in a MemoryStream, and then feeding them into an Image object, which parses the entire file and reads it as a TIFF file. This give you the validation you need - it can tell when the data is invalid - but you're essentially going over the entire file twice, once to read it in (with additional overhead for parsing) and once to write it to disk.
Assuming you don't need any validation that requires deep parsing of the image file (# of colors, image dimensions, etc) you can skip this overhead by simply checking if the byte[] ImageBytes is of length 1 (or find any other good indicator of corrupt data) and skip writing if it doesn't match. In effect, do your own validation, rather than using the Image class as a validator.
I think the main difference between the two is that in the second code you are writing the source byte[] to a MemoryStream object first which would mean that if the data becomes essentially independent of the database. So, you could potentially incorporate this MemoryStream into the first code to achieve the same results.

Bytes consumed by StreamReader

Is there a way to know how many bytes of a stream have been used by StreamReader?
I have a project where we need to read a file that has a text header followed by the start of the binary data. My initial attempt to read this file was something like this:
private int _dataOffset;
void ReadHeader(string path)
{
using (FileStream stream = File.OpenRead(path))
{
StreamReader textReader = new StreamReader(stream);
do
{
string line = textReader.ReadLine();
handleHeaderLine(line);
} while(line != "DATA") // Yes, they used "DATA" to mark the end of the header
_dataOffset = stream.Position;
}
}
private byte[] ReadDataFrame(string path, int frameNum)
{
using (FileStream stream = File.OpenRead(path))
{
stream.Seek(_dataOffset + frameNum * cbFrame, SeekOrigin.Begin);
byte[] data = new byte[cbFrame];
stream.Read(data, 0, cbFrame);
return data;
}
return null;
}
The problem is that when I set _dataOffset to stream.Position, I get the position that the StreamReader has read to, not the end of the header. As soon as I thought about it this made sense, but I still need to be able to know where the end of the header is and I'm not sure if there's a way to do it and still take advantage of StreamReader.
You can find out how many bytes the StreamReader has actually returned (as opposed to read from the stream) in a number of ways, none of them too straightforward I'm afraid.
Get the result of textReader.CurrentEncoding.GetByteCount(totalLengthOfAllTextRead) and then seek to this position in the stream.
Use some reflection hackery to retrieve the value of the private variable of the StreamReader object that corresponds to the current byte position within the internal buffer (different from that with the stream - usually behind, but no more than equal to of course). Judging by .NET Reflector, the this variable seems to be named bytePos.
Don't bother using a StreamReader at all but instead implement your custom ReadLine function built on top of the Stream or BinaryReader even (BinaryReader is guaranteed never to read further ahead than what you request). This custom function must read from the stream char by char, so you'd actually have to use the low-level Decoder object (unless the encoding is ASCII/ANSI, in which case things are a bit simpler due to single-byte encoding).
Option 1 is going to be the least efficient I would imagine (since you're effectively re-encoding text you just decoded), and option 3 the hardest to implement, though perhaps the most elegant. I'd probably recommend against using the ugly reflection hack (option 2), even though it's looks tempting, being the most direct solution and only taking a couple of lines. (To be quite honest, the StreamReader class really ought to expose this variable via a public property, but alas it does not.) So in the end, it's up to you, but either method 1 or 3 should do the job nicely enough...
Hope that helps.
So the data is utf8 (the default encoding for StreamReader). This is a multibyte encoding, so IndexOf would be inadvisable. You could:
Encoding.UTF8.GetByteCount(string)
on your data so far, adding 1 or 2 bytes for the missing line ending.
If you're needing to count bytes, I'd go with the BinaryReader. You can take the results and cast them about as needed, but I find its idea of its current position to be more reliable (in that since it reads in binary, its immune to character-set problems).
So your last line contains 'DATA' + an unknown amount of data bytes. You could extract the position by using IndexOf() with your last read line. Then readjust the stream.Position.
But I am not sure if you should use ReadLine() at all in this case. Maybe it would be better to read byte by byte until you reach the 'DATA' mark.
The line breaks are easily identifiable without needing to decode the stream first (except for some encodings rarely used for text files like EBCDIC, UTF-16, UTF-32), so you can just read each line as bytes and then decode the entire line:
using (FileStream stream = File.OpenRead(path)) {
List<byte> buffer = new List<byte>();
bool hasCr = false;
bool done = false;
while (!done) {
int b = stream.ReadByte();
if (b == -1) throw new IOException("End of file reached in header.");
if (b == 13) {
hasCr = true;
} else if (b == 10 && hasCr) {
string line = Encoding.UTF8.GetString(buffer.ToArray(), 0, buffer.Count);
if (line == "DATA") {
done = true;
} else {
HandleHeaderLine(line);
}
buffer.Clear();
hasCr = false;
} else {
if (hasCr) buffer.Add(13);
hasCr = false;
buffer.Add((byte)b);
}
}
_dataOffset = stream.Position;
}
Instead of closing the stream and open it again, you could of course just keep on reading the data.

Categories