Compress a string using GZip, the string is not shorter - c#

I used the following code to compress a string, but the string is not shorter. Can you explain why?
private string Compress(string str)
{
try
{
String returnValue;
byte[] buffer = Encoding.ASCII.GetBytes(str);
using (MemoryStream ms = new MemoryStream())
{
using (GZipStream zip = new GZipStream(ms, CompressionMode.Compress, true))
{
zip.Write(buffer, 0, buffer.Length);
using (StreamReader sReader = new StreamReader(ms, Encoding.ASCII))
{
returnValue = sReader.ReadToEnd();
}
}
}
return returnValue;
}
catch
{
return str;
}
}

Ignoring issues in the code - there are multiple possible scenarios when this can happen.
Simplified explanation of compression algorithm - compression is based on the fact that data you are trying to compress contain redundant values - patterns which can be recognized by the compression algorithm and can be "shortened" by expressing the redundant values more concisely.
Some scenarios when the compressed result can be larger then the input:
1) Input is too short - compression algorithms have some data overhead and considering the short input, it is unable to compress it effectively. So you have some data overhead from the compression mechanism + original data.
2) Input is already compressed - again, compression algorithms have some data overhead and when is the input already compressed - it is unable to compress it effectively.
3) Input is too random - considering the input is generated by some random generator, the compression algorithm is unable to compress it effectively - no patterns can be recognized.

Related

Compressing with GZipStream results in more bytes

I am using the following methods to compress my response's content:
(Consider _compression = CompressionType.GZip)
private async Task<HttpContent> CompressAsync(HttpContent content)
{
if(content == null) return null;
byte[] compressedBytes;
using(MemoryStream outputStream = new MemoryStream())
{
using (Stream compressionStream = GetCompressionStream(outputStream))
using (Stream contentStream = await content.ReadAsStreamAsync())
await contentStream.CopyToAsync(compressionStream);
compressedBytes = outputStream.ToArray();
}
content.Dispose();
HttpContent compressedContent = new ByteArrayContent(compressedBytes);
compressedContent.Headers.ContentEncoding.Add(GetContentEncoding());
return compressedContent;
}
private Stream GetCompressionStream(Stream output)
{
switch (_compression)
{
case CompressionType.GZip: { return new GZipStream(output, CompressionMode.Compress); }
case CompressionType.Deflate: { return new DeflateStream(output, CompressionMode.Compress); }
default: return null;
}
}
private string GetContentEncoding()
{
switch (_compression)
{
case CompressionType.GZip: { return "gzip"; }
case CompressionType.Deflate: { return "deflate"; }
default: return null;
}
}
However, this method returns more bytes than the original content.
For example, my initial content is 42 bytes long, and the resulting compressedBytes array has a size of 62 bytes.
Am I doing something wrong here? How can compression generate more bytes?
You are not necessarily doing anything wrong. You have to take into account that these compressed formats always require a bit of space for header information. So that's probably why it grew by a few bytes.
Under normal circumstances, you would be compressing larger amounts of data. In that case, the overhead associated with the header data becomes unnoticeable when compared to the gains you make by compressing the data.
But because, in this case, your uncompressed data is so small, and so you are probably not gaining much in the compression, then this is one of the few instances where you can actually notice the header taking up space.
When compressing small files with gzip it is possible that the metadata (for the compressed file itself) causes an increase larger then the number of bytes saved by compression.
See Googles gzip tips:
Believe it or not, there are cases where GZIP can increase the size of the asset. Typically, this happens when the asset is very small and the overhead of the GZIP dictionary is higher than the compression savings, or if the resource is already well compressed.
For such a small size the compression overhead can actually make the file larger and it's nothing unusual. Here it's explained with more detail.

What is an efficient way in C# of doing MD5 and download all at once?

I'm working on download and then MD5 check to ensure the download is successful. I have the following code which should work, but isn't the most efficient - especially for large files.
using (var client = new System.Net.WebClient())
{
client.DownloadFile(url, destinationFile);
}
var fileHash = GetMD5HashAsStringFromFile(destinationFile);
var successful = expectedHash.Equals(fileHash, StringComparison.OrdinalIgnoreCase);
My concern is that the bytes are all streamed through to disk, and then the MD5 ComputeHash() has to open the file and read all the bytes again. Is there a good, clean way of computing the MD5 as part of the download stream? Ideally, the MD5 should just fall out of the DownloadFile() function as a side effect of sorts. A function with a signature like this:
string DownloadFileAndComputeHash(string url, string filename, HashTypeEnum hashType);
Edit: Adds code for GetMD5HashAsStringFromFile()
public string GetMD5HashAsStringFromFile(string filename)
{
using (FileStream file = File.Open(filename, FileMode.Open, FileAccess.Read, FileShare.Read))
{
var md5er = System.Security.Cryptography.MD5.Create();
var md5HashBytes = md5er.ComputeHash(file);
return BitConverter
.ToString(md5HashBytes)
.Replace("-", string.Empty)
.ToLower();
}
}
Is there a good, clean way of computing the MD5 as part of the download stream? Ideally, the MD5 should just fall out of the DownloadFile() function as a side effect of sorts.
You could follow this strategy, to do "chunked" calculation and minimize memory pressure (and duplication):
Open the response stream on the web client.
Open the destination file stream.
Repeat while there is data available:
Read chunk from response stream into byte buffer
Write it to the destination file stream.
Use the TransformBlock method to add the bytes to the hash calculation
Use TransformFinalBlock to get the calculated hash code.
The sample code below shows how this could be achieved.
public static byte[] DownloadAndGetHash(Uri file, string destFilePath, int bufferSize)
{
using (var md5 = MD5.Create())
using (var client = new System.Net.WebClient())
{
using (var src = client.OpenRead(file))
using (var dest = File.Create(destFilePath, bufferSize))
{
md5.Initialize();
var buffer = new byte[bufferSize];
while (true)
{
var read = src.Read(buffer, 0, buffer.Length);
if (read > 0)
{
dest.Write(buffer, 0, read);
md5.TransformBlock(buffer, 0, read, null, 0);
}
else // reached the end.
{
md5.TransformFinalBlock(buffer, 0, 0);
return md5.Hash;
}
}
}
}
}
If you're talking about large files (I'm assuming over 1GB), you'll want to read the data in chunks, then process each chunk through the MD5 algorithm, and then store it to the disk. It's doable, but I don't know how much of the default .NET classes will help you with that.
One approach might be with a custom stream wrapper. First you get a Stream from WebClient (via GetWebResponse() and then GetResponseStream()), then you wrap it, and then pass it to ComputeHash(stream). When MD5 calls Read() on your wrapper, the wrapper would call Read on the network stream, write the data out when it's received, and then pass it back to MD5.
I don't know what problems would await you if you try and do this.
Something like this.
byte[] result;
using (var webClient = new System.Net.WebClient())
{
result = webClient.DownloadData("http://some.url");
}
byte[] hash = ((HashAlgorithm)CryptoConfig.CreateFromName("MD5")).ComputeHash(result);

How to compress a random string?

I'm working on a encryptor application that works based on RSA Asymmetric Algorithm.
It generates a key-pair and the user have to keep it.
As key-pairs are long random strings, I want to create a function that let me compress generated long random strings (key-pairs) based on a pattern.
(For example the function get a string that contains 100 characters and return a string that contains 30 characters)
So when the user enter the compressed string I can regenerate the key-pairs based on the pattern I compressed with.
But a person told me that it is impossible to compress random things because they are Random!
What is your idea ?
Is there any way to do this ?
Thanks
It's impossible to compress (nearly any) random data. Learning a bit about information theory, entropy, how compression works, and the pigeonhole principle will make this abundantly clear.
One exception to this rule is if by "random string", you mean, "random data represented in a compressible form, like hexadecimal". In this sort of scenario, you could compress the string or (the better option) simply encode the bytes as base 64 instead to make it shorter. E.g.
// base 16, 50 random bytes (length 100)
be01a140ac0e6f560b1f0e4a9e5ab00ef73397a1fe25c7ea0026b47c213c863f88256a0c2b545463116276583401598a0c36
// base 64, same 50 random bytes (length 68)
vgGhQKwOb1YLHw5KnlqwDvczl6H+JcfqACa0fCE8hj+IJWoMK1RUYxFidlg0AVmKDDY=
You might instead give the user a shorter hash or fingerprint of the value (e.g. the last x bytes). Then by storing the full key and hash somewhere, you could give them the key when they give you the hash. You'd have to have this hash be long enough that security is not compromised. Depending on your application, this might defeat the purpose because the hash would have to be as long as the key, or it might not be a problem.
public static string ZipStr(String str)
{
using (MemoryStream output = new MemoryStream())
{
using (DeflateStream gzip =
new DeflateStream(output, CompressionMode.Compress))
{
using (StreamWriter writer =
new StreamWriter(gzip, System.Text.Encoding.UTF8))
{
writer.Write(str);
}
}
return Convert.ToBase64String(output.ToArray());
}
}
public static string UnZipStr(string base64)
{
byte[] input = Convert.FromBase64String(base64);
using (MemoryStream inputStream = new MemoryStream(input))
{
using (DeflateStream gzip =
new DeflateStream(inputStream, CompressionMode.Decompress))
{
using (StreamReader reader =
new StreamReader(gzip, System.Text.Encoding.UTF8))
{
return reader.ReadToEnd();
}
}
}
}
Take into account that this doesn't have to be shorter at all... depends on the contents of the string.
Try to use gzip compression and see if it helps you

Deflating a stream with SharpZipLib has different behaviors in C# vs. VB.NET

I'm having troubles in writing a static Deflate extension method, that i would use to deflate a string, using BZip2 alghorithm of the SharpZipLib library (runtime version: v2.0.50727).
I'm doing it using .NET framework 4.
This is my VB.NET code:
Public Function Deflate(ByVal text As String)
Try
Dim compressedData As Byte() = Convert.FromBase64String(text)
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed text data size: ", text.Length.ToString()))
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed byte data size: ", compressedData.Length.ToString()))
Using compressedStream As MemoryStream = New MemoryStream(compressedData)
Using decompressionStream As BZip2OutputStream = New BZip2OutputStream(compressedStream)
Dim cleanData() As Byte
Using decompressedStream As MemoryStream = New MemoryStream()
decompressionStream.CopyTo(decompressedStream) // HERE THE ERROR!
cleanData = decompressedStream.ToArray()
End Using
decompressionStream.Close()
compressedStream.Close()
Dim cleanText As String = Encoding.UTF8.GetString(cleanData, 0, cleanData.Length)
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression text data size: ", cleanText.Length.ToString()))
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression byte data size: ", cleanData.Length.ToString()))
Return cleanText
End Using
End Using
Catch
Return String.Empty
End Try
End Function
The strange thing is that I wrote a C# counterpart of the same method, and it works perfectly!!! This is the code:
public static string Deflate(this string text)
{
try
{
byte[] compressedData = Convert.FromBase64String(text);
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed text data size: ", text.Length.ToString()));
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed byte data size: ", compressedData.Length.ToString()));
using (MemoryStream compressedStream = new MemoryStream(compressedData))
using (BZip2InputStream decompressionStream = new BZip2InputStream(compressedStream))
{
byte[] cleanData;
using (MemoryStream decompressedStream = new MemoryStream())
{
decompressionStream.CopyTo(decompressedStream);
cleanData = decompressedStream.ToArray();
}
decompressionStream.Close();
compressedStream.Close();
string cleanText = Encoding.UTF8.GetString(cleanData, 0, cleanData.Length);
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression text data size: ", cleanText.Length.ToString()));
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression byte data size: ", cleanData.Length.ToString()));
return cleanText;
}
}
catch(Exception e)
{
return String.Empty;
}
}
In VB.NET version I get this error: "Stream does not support reading." (see the code to understand where it comes!)
Where is the mistake?!! I cannot understand what's the difference between the two methods...
Thank you very much!
A game of spot the difference shows that in the first you are using BZip2OutputStream whereas the second is BZip2InputStream.
It seems reasonable that the output stream is used to write to and so as it says is not readable.
For what its worth there are a lot of good comparison tools out there. They won't cope with syntax different but the way the matching works it shows up quite well when you are using totally different objects (in this case at least). The one I personally use and recommend is Beyond Compare
You switched BZip2OutputStream and BZip2InputStream
In one version you are using a BZip2InputStream and in the other a BZip2OutputStream.

Best way to read a large file into a byte array in C#?

I have a web server which will read large binary files (several megabytes) into byte arrays. The server could be reading several files at the same time (different page requests), so I am looking for the most optimized way for doing this without taxing the CPU too much. Is the code below good enough?
public byte[] FileToByteArray(string fileName)
{
byte[] buff = null;
FileStream fs = new FileStream(fileName,
FileMode.Open,
FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
buff = br.ReadBytes((int) numBytes);
return buff;
}
Simply replace the whole thing with:
return File.ReadAllBytes(fileName);
However, if you are concerned about the memory consumption, you should not read the whole file into memory all at once at all. You should do that in chunks.
I might argue that the answer here generally is "don't". Unless you absolutely need all the data at once, consider using a Stream-based API (or some variant of reader / iterator). That is especially important when you have multiple parallel operations (as suggested by the question) to minimise system load and maximise throughput.
For example, if you are streaming data to a caller:
Stream dest = ...
using(Stream source = File.OpenRead(path)) {
byte[] buffer = new byte[2048];
int bytesRead;
while((bytesRead = source.Read(buffer, 0, buffer.Length)) > 0) {
dest.Write(buffer, 0, bytesRead);
}
}
I would think this:
byte[] file = System.IO.File.ReadAllBytes(fileName);
Your code can be factored to this (in lieu of File.ReadAllBytes):
public byte[] ReadAllBytes(string fileName)
{
byte[] buffer = null;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[fs.Length];
fs.Read(buffer, 0, (int)fs.Length);
}
return buffer;
}
Note the Integer.MaxValue - file size limitation placed by the Read method. In other words you can only read a 2GB chunk at once.
Also note that the last argument to the FileStream is a buffer size.
I would also suggest reading about FileStream and BufferedStream.
As always a simple sample program to profile which is fastest will be most beneficial.
Also your underlying hardware will have a large effect on performance. Are you using server based hard disk drives with large caches and a RAID card with onboard memory cache? Or are you using a standard drive connected to the IDE port?
Depending on the frequency of operations, the size of the files, and the number of files you're looking at, there are other performance issues to take into consideration. One thing to remember, is that each of your byte arrays will be released at the mercy of the garbage collector. If you're not caching any of that data, you could end up creating a lot of garbage and be losing most of your performance to % Time in GC. If the chunks are larger than 85K, you'll be allocating to the Large Object Heap(LOH) which will require a collection of all generations to free up (this is very expensive, and on a server will stop all execution while it's going on). Additionally, if you have a ton of objects on the LOH, you can end up with LOH fragmentation (the LOH is never compacted) which leads to poor performance and out of memory exceptions. You can recycle the process once you hit a certain point, but I don't know if that's a best practice.
The point is, you should consider the full life cycle of your app before necessarily just reading all the bytes into memory the fastest way possible or you might be trading short term performance for overall performance.
I'd say BinaryReader is fine, but can be refactored to this, instead of all those lines of code for getting the length of the buffer:
public byte[] FileToByteArray(string fileName)
{
byte[] fileData = null;
using (FileStream fs = File.OpenRead(fileName))
{
using (BinaryReader binaryReader = new BinaryReader(fs))
{
fileData = binaryReader.ReadBytes((int)fs.Length);
}
}
return fileData;
}
Should be better than using .ReadAllBytes(), since I saw in the comments on the top response that includes .ReadAllBytes() that one of the commenters had problems with files > 600 MB, since a BinaryReader is meant for this sort of thing. Also, putting it in a using statement ensures the FileStream and BinaryReader are closed and disposed.
In case with 'a large file' is meant beyond the 4GB limit, then my following written code logic is appropriate. The key issue to notice is the LONG data type used with the SEEK method. As a LONG is able to point beyond 2^32 data boundaries.
In this example, the code is processing first processing the large file in chunks of 1GB, after the large whole 1GB chunks are processed, the left over (<1GB) bytes are processed. I use this code with calculating the CRC of files beyond the 4GB size.
(using https://crc32c.machinezoo.com/ for the crc32c calculation in this example)
private uint Crc32CAlgorithmBigCrc(string fileName)
{
uint hash = 0;
byte[] buffer = null;
FileInfo fileInfo = new FileInfo(fileName);
long fileLength = fileInfo.Length;
int blockSize = 1024000000;
decimal div = fileLength / blockSize;
int blocks = (int)Math.Floor(div);
int restBytes = (int)(fileLength - (blocks * blockSize));
long offsetFile = 0;
uint interHash = 0;
Crc32CAlgorithm Crc32CAlgorithm = new Crc32CAlgorithm();
bool firstBlock = true;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[blockSize];
using (BinaryReader br = new BinaryReader(fs))
{
while (blocks > 0)
{
blocks -= 1;
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(blockSize);
if (firstBlock)
{
firstBlock = false;
interHash = Crc32CAlgorithm.Compute(buffer);
hash = interHash;
}
else
{
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
offsetFile += blockSize;
}
if (restBytes > 0)
{
Array.Resize(ref buffer, restBytes);
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(restBytes);
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
buffer = null;
}
}
//MessageBox.Show(hash.ToString());
//MessageBox.Show(hash.ToString("X"));
return hash;
}
Overview: if your image is added as a action= embedded resource then use the GetExecutingAssembly to retrieve the jpg resource into a stream then read the binary data in the stream into an byte array
public byte[] GetAImage()
{
byte[] bytes=null;
var assembly = Assembly.GetExecutingAssembly();
var resourceName = "MYWebApi.Images.X_my_image.jpg";
using (Stream stream = assembly.GetManifestResourceStream(resourceName))
{
bytes = new byte[stream.Length];
stream.Read(bytes, 0, (int)stream.Length);
}
return bytes;
}
Use the BufferedStream class in C# to improve performance. A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance.
See the following for a code example and additional explanation:
http://msdn.microsoft.com/en-us/library/system.io.bufferedstream.aspx
use this:
bytesRead = responseStream.ReadAsync(buffer, 0, Length).Result;
I would recommend trying the Response.TransferFile() method then a Response.Flush() and Response.End() for serving your large files.
If you're dealing with files above 2 GB, you'll find that the above methods fail.
It's much easier just to hand the stream off to MD5 and allow that to chunk your file for you:
private byte[] computeFileHash(string filename)
{
MD5 md5 = MD5.Create();
using (FileStream fs = new FileStream(filename, FileMode.Open))
{
byte[] hash = md5.ComputeHash(fs);
return hash;
}
}

Categories