C# - GZipStream magic number incorrect? - c#

So, I'm trying to make a program which turns a computer into a proxy using this. It all works well, except for gzip/deflate pages.
Whenever I try to uncompress, I get an InvalidDataException stating the magic number in the GzipHeader is incorrect.
I use this function:
private byte[] GZipUncompress(byte[] data)
{
using (var input = new MemoryStream(data))
{
input.Seek(0, SeekOrigin.Begin);
using (var gzip = new GZipStream(input, CompressionMode.Decompress))
using (var output = new MemoryStream())
{
output.Seek(0, SeekOrigin.Begin);
gzip.CopyTo(output);
return output.ToArray();
}
}
}
to decompress data. The error:
(source: gyazo.com)
Any help would be appreciated.
EDIT: I seem to have gotten somewhere!
As usr suggested, I should write a HTTP parser to get the body and decompress that.
Before parsing: http://pastebin.com/Cb0E8WtT
After parsing: http://pastebin.com/k9e8wMvr
This is the method I use to get to the body:
private byte[] HTTParse(byte[] data)
{
string http = ascii.GetString(data);
char[] lineBreak = crlf.ToCharArray();
string[] parts = http.Split(lineBreak);
List<byte> res = new List<byte>();
for (int i = 1; i < parts.Length; i++)
{
if (i % 2 == 0)
{
Regex r = new Regex(#"(.)*: (.)*");
Regex htt = new Regex(#"HTT(.)*/(.)*.(.)* d{1,50} (.)*");
if (!r.IsMatch(parts[i]) && !htt.IsMatch(parts[i]))
{
//Console.WriteLine("[TEST] " + parts[i]);
res.AddRange(ascii.GetBytes(parts[i]));
res.AddRange(ascii.GetBytes("\r\n"));
}
}
}
return res.ToArray();
}
However, I still get an error saying "The magic number in GZip header is not correct. Make sure you are passing in a GZip stream."
EDIT (2): After copying an answer from here, I have managed to successfully uncompress the body.
The new problem: Firefox.
(source: gyazo.com)
I'm now unsure whether or not I even needed to uncompress gzip pages..
Where have I gone wrong now?

You said, that you use this code for gzip/deflate. But deflate is not the same as gzip, especially it has no magic header like gzip does. Deflate is defined in RFC1951, gzip in RC1952. Also, browsers like Firefox and Chrome (but not Internet Explorer) also accept "raw deflate" according to RFC1950.
So before you apply decompression to the body you must first check based on the "Content-Encoding" header which compression is used.

It turns out I never even needed to unzip the compressed data.
However, as per the solution:
I separated the body with the help of this, and attempted to unzip that. What I hadn't realised was that I was sending around 500 blank bytes, which generated a bad request (with the html amongst the compressed data), so I couldn't unzip anyway.

Related

HttpClient: Correct order to detect encoding

I'm using HttpClient to fetch some files. I put the content into a byte array (bytes). Now I need to detect the encoding. The contenttype will be either html, css, JavaScript or XML contenttype.
Currently I check the charset from headers, then check for a BOM (byte order mark) before I finally check the first part of the file for a charset meta tag.
Normally this works fine, because there are no conflicts.
But: Is that order correct (in case of conflict)?
The code I corrently use:
Encoding encoding;
try
{
encoding = Encoding.GetEncoding(responseMessage.Content.Headers.ContentType.CharSet);
}
catch
{
using (MemoryStream ms = new MemoryStream(bytes))
{
using (StreamReader sr = new StreamReader(ms, Encoding.Default, true))
{
char[] chars = new char[1024];
sr.Read(chars, 0, 1024);
string textDefault = new string(chars);
if (sr.CurrentEncoding == Encoding.Default)
{
encoding = Global.EncodingFraContentType(textDefault);
}
else
{
encoding = sr.CurrentEncoding;
}
}
}
}
responseInfo.Text = encoding.GetString(bytes);
Global.EncodingFraContentType is a regular expression that finds the charset defined either in XML declaration, or in a meta tag.
What order is the correct to detect charset/encoding?
The correct answer depends not on order, but on which actually gives the correct result, and there's no perfect answer here.
If there is a conflict, then the server has given you something incorrect. Since it's incorrect there can't be a "correct" order because there isn't a correct way of being incorrect. And, maybe the header and the embedded metadata are both wrong!
No even slightly common-used encoding can have something that looks like a BOM would look like in UTF-8 or UTF-16 at the beginning and still be a valid example of the content types you mention, so if there's a BOM then that wins.
(The one exception to that is if the document is so badly edited as to switch encoding part-way through, which is no unheard of, but then the buggy content is so very buggy as to have no real meaning).
If the content contains no octet that is greater than 0x7F then it doesn't matter and the header and metadata both claim it as different examples of US-ASCII, UTF-8, any of the ISO-8859 family of encodings, or any of the other encodings for which those octets all map to the same code point, then it doesn't really matter which you consider it to be, as the nett result is the same. Consider it to be whatever the metadata says, as then you don't need to rewrite it to match correctly.
If it's in UTF-16 without a BOM it is likely going to be clearly as such very soon as all of those formats have a lot of characters with special meaning in the range U+0000 to U+00FF (indeed, generally U+0020 to U+007F) and so you'll have lots of ranges with a zero byte every other character.
If it has octets above 0x7F and is valid UTF-8, then it's almost certainly UTF-8. (By the same token if it's not UTF-8 and has octets above 0x7F then it almost certainly can't be mistaken for UTF-8).
The trickiest reasonably common case is if you have conflicting claims about it being in two different encodings which are both single-octet-per-character encodings and an octet in the range 0x80-0xFF is present. This is the case that you can't be sure about. If one encoding is a subset of the other (especially when C1 controls are excluded) then you could go for the superset, but that requires storing knowledge about those encodings, and considerable amount of work. Most of the time I'd be inclined to just throw an exception, and when it's found in the logs see if I can get the source to fix their bug, or special-case that source, but that doesn't work if you are dealing with a very large number of disparate sources that you may not have a relationship with. Alas there is no perfect answer here.
Its worth noting also that sometimes both header and embedded metadata will agree with each other incorrectly. A common case is content in CP-1252 but claimed as being in ISO-8859-1.
According to W3C Faq
If you have a UTF-8 byte-order mark (BOM) at the start of your file then recent browser versions other than Internet Explorer 10 or 11 will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.
When it comes to the http-header vs meta BOM takes precedence, as long as the meta is within the first 1024 it can take precedence, though there is no strict rule on that.
Conclusion - in order of importance:
Byte Order Mark (BOM): If present, this is AUTHORATIVE, since it was
added by the editor that actually saved the file (this can only be
present on unicode encodings).
Content-Type charset (in header set by the server): For dynamically created/processed files, it should be present (since the
server knows), but might not be for static files (the server just
sends those).
Inline charset: For xml, html and css the encoding can be be specified inside the document, in either xml prolog, html meta tag
or #charset in css. To read that you need to decode the first
part of the document using for instance 'Windows-1252' encoding.
Assume utf-8. This is the standard of the web and is today by far the most used.
If the found encoding equals 'ISO-8859-1', use 'Windows-1252' instead (required in html5 - read more at Wikipedia
Now try to decode the document using the found encoding. If error handling is turned on, that might fail! In that case:
Use 'Windows-1252'. This was the standard in old windows files and works fine as last try (there's still a lot of old files out there).
This will never throw errors. However it might of course be wrong.
I have made a method that implements this. The regex I use is able to find encodings specified as:
Xml: <?xml version="1.0" encoding="utf-8"?> OR <?xml encoding="utf-8"?>
html: <meta charset="utf-8" /> OR <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
css: #charset "utf-8";
(It works with both single and double qoutes).
You will need:
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
Here's the method that returns the decoded string (parameters are the HttpClient and the Uri):
public static async Task<string> GetString(HttpClient httpClient, Uri url)
{
byte[] bytes;
Encoding encoding = null;
Regex charsetRegex = new Regex(#"(?<=(<meta.*?charset=|^\<\?xml.*?encoding=|^#charset[ ]?)[""']?)[\w-]+?(?=[""';\r\n])",
RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture);
using (HttpResponseMessage responseMessage = await httpClient.GetAsync(url).ConfigureAwait(false))
{
responseMessage.EnsureSuccessStatusCode();
bytes = await responseMessage.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
string headerCharset = responseMessage?.Content?.Headers?.ContentType?.CharSet;
byte[] buffer = new byte[0x1000];
Array.Copy(bytes, buffer, Math.Min(bytes.Length, buffer.Length));
using (MemoryStream ms = new MemoryStream(buffer))
{
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), true, buffer.Length, true))
{
string testString = await sr.ReadToEndAsync().ConfigureAwait(false);
if (!sr.CurrentEncoding.Equals(Encoding.GetEncoding("Windows-1252")))
{
encoding = sr.CurrentEncoding;
}
else if (headerCharset != null)
{
encoding = Encoding.GetEncoding(headerCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
string inlineCharset = charsetRegex.Match(testString).Value;
if (!string.IsNullOrEmpty(inlineCharset))
{
encoding = Encoding.GetEncoding(inlineCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
encoding = new UTF8Encoding(false, true);
}
}
if (encoding.Equals(Encoding.GetEncoding("iso-8859-1")))
{
encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
}
}
using (MemoryStream ms = new MemoryStream(bytes))
{
try
{
using (StreamReader sr = new StreamReader(ms, encoding, false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
catch (DecoderFallbackException)
{
ms.Position = 0;
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
}
}
}
You should wrap the method call in a try/catch, since HttpClient can throw errors, if the request fails.
Update:
In .Net Core, you don't have the 'Windows-1252' encoding (big mistake IMHO), so here you must settle with 'ISO-8859-1'.

Change StreamReader Encoding while reading from NetworkStream

I am trying to read an email from POP3 and change to the correct encoding when I find the charset in the headers.
I use a TCP Client to connect to the POP3 server.
Below is my code :
public string ReadToEnd(POP3Client pop3client, out System.Text.Encoding messageEncoding)
{
messageEncoding = TCPStream.CurrentEncoding;
if (EOF)
return ("");
System.Text.StringBuilder sb = new System.Text.StringBuilder(m_bytetotal * 2);
string st = "";
string tmp;
do
{
tmp = TCPStream.ReadLine();
if (tmp == ".")
EOF = true;
else
sb.Append(tmp + "\r\n");
//st += tmp + "\r\n";
m_byteread += tmp.Length + 2; // CRLF discarded by read
FireReceived();
if (tmp.ToLower().Contains("content-type:") && tmp.ToLower().Contains("charset="))
{
try
{
string charSetFound = tmp.Substring(tmp.IndexOf("charset=") + "charset=".Length).Replace("\"", "").Replace(";", "");
var realEnc = System.Text.Encoding.GetEncoding(charSetFound);
if (realEnc != TCPStream.CurrentEncoding)
{
TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
}
}
catch { }
}
} while (!EOF);
messageEncoding = TCPStream.CurrentEncoding;
return (sb.ToString());
}
If I remove this line:
TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
Everything works fine except that when the e-mail contains different charset characters I get question marks as the initial encoding is ASCII.
Any suggestions on how to change the encoding while reading data from the Network Stream?
You're doing it wrong (tm).
Seriously, though, you are going about trying to solve this problem in completely the wrong way. Don't use a StreamReader for this. And especially don't read 1 byte at a time (as you said you needed to do in a comment on an earlier "solution").
For an explanation of why not to use a StreamReader, besides the obvious "because it isn't designed to switch between encodings during the process of reading", feel free to read over another answer I gave about the inefficiencies of using a StreamReader here: Reading an mbox file in C#
What you need to do is buffer your reads (such as a 4k buffer should be fine). Then, as you are already having to do anyway, scan for the '\n' byte to extract content on a line-by-line basis, combining header lines that were folded.
Each header may have multiple encoded-word tokens which may each be in a separate charset, assuming they are properly encoded, otherwise you'll have to deal with undeclared 8-bit data and try to massage that into unicode somehow (probably by having a set of fallback charsets). I'd recommend trying UTF-8 first followed by a selection of charsets that the user of your library has provided before finally trying iso-8859-1 (make sure not to try iso-8859-1 until you've tried everything else, because any sequence of 8-bit text will convert properly to unicode using the iso-8859-1 character encoding).
When you get to text content of the message, you'll want to check the Content-Type header for a charset parameter. If no charset parameter is defined, it should be US-ASCII, but in practice it could be anything. Even if the charset is defined, it might not match the actual character encoding used in the text body of the message, so once again you'll probably want to have a set of fallbacks.
As you've probably guessed by this point, this is very clearly not a trivial task as it requires the parser to do on-the-fly character conversion as it goes (and the character conversion requires internal parser state about what the expected charset is at any given time).
Since I've already done the work, you should really consider using MimeKit which will parse the email and properly do charset conversion on the headers and the content using the appropriate charset encoding.
I've also written a Pop3Client class that is included in my MailKit library.
If your goal is to learn and write your own library, I'd still highly recommend reading over my code because it is highly efficient and does things in a proper way.
There are some ways you can detect the encoding by looking at the Byte Order Mark, which are the firts few bytes of the stream. These will tell you the encoding. However, the stream might not have a BOM, and in these cases it could be ASCII, UTF without BOM, or others.
You can convert your stream from one encoding to another with the Encoding Class:
Encoding textEncoding = Encoding.[your detected encoding here];
byte[] converted = Encoding.UTF8.GetBytes(textEncoding.GetString(TCPStream.GetBuffer()));
You may select your preferred encoding when converting.
Hope it answers your question.
edit
You may use this code to read your stream in blocks.
MemoryStream st = new MemoryStream();
int numOfBytes = 1024;
int reads = 1;
while (reads > 0)
{
byte[] bytes = new byte[numOfBytes];
reads = yourStream.Read(bytes, 0, numOfBytes);
if (reads > 0)
{
int writes = ( reads < numOfBytes ? reads : numOfBytes);
st.Write(bytes, 0, writes);
}
}

compressing a string in C# and uncompressing in python

I am trying to compress a large string on a client program in C# (.net 4) and send it to a server (django, python 2.7) using a PUT request.
Ideally I want to use the standard library at both ends, so I am trying to use gzip.
My C# code is:
public static string Compress(string s) {
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream()) {
using (var gs = new GZipStream(mso, CompressionMode.Compress)) {
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
The python code is:
s = base64.standard_b64decode(request)
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}, i.e. every other letter is something weird.
If I take out every other character by doing decompressed_data[::2], then it works, but it's a bit of a hack, and clearly there is something else wrong.
I'm wondering if I need to base64 encode it at all for a PUT request? Is this only necessary for POST?
I think the main problem might be C# uses UTF-16 encoded strings. This may yield a problem similar to yours. As any other encoding problem, we might need a little luck here but I guess you can solve this by doing:
decompressed_data = gz.read().decode('utf-16')
There, decompressed_data should be Unicode and you can treat it as such for further work.
UPDATE: This worked for me:
C Sharp
static void Main(string[] args)
{
FileStream f = new FileStream("test", FileMode.CreateNew);
using (StreamWriter w = new StreamWriter(f))
{
w.Write(Compress("hello"));
}
}
public static string Compress(string s)
{
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream())
{
using (var gs = new GZipStream(mso, CompressionMode.Compress))
{
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
Python
import base64
import cStringIO
import gzip
f = open('test','rb')
s = base64.standard_b64decode(f.read())
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
print decompressed_data.decode('utf-16')
Without decode('utf-16) it printed in the console:
>>>h e l l o
with it it did well:
>>>hello
Good luck, hope this helps!
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}
That's because you're using Encoding.Unicode to convert the string to bytes to start with.
If you can tell Python which encoding to use, you could do that - otherwise you need to use an encoding on the C# side which matches what Python expects.
If you can specify it on both sides, I'd suggest using UTF-8 rather than UTF-16. Even though you're compressing, it wouldn't hurt to make the data half the size (in many cases) to start with :)
I'm also somewhat suspicious of this line:
buff = cStringIO.StringIO(s)
s really isn't text data - it's compressed binary data, and should be treated as such. It may be okay - it's just worth checking whether there's a better way.

How do I parse the gzip file header from a gzip stream in C#?

I am forced to use an older version of the SharpZipLib and the standard Microsoft libraries to perform this. I have a gziped file whose name is different than the filename inside the archive. I need to parse the gzip file header to return the original filename. Here is documentation on the gzip website:
http://www.gzip.org/zlib/rfc-gzip.html#conventions
And a java example that looks like it might be doing what I want. it looks like it checks for the file header, but doesn't actually read the file name.
(Sorry couldn't post more than 1 hyperlink)
(http://www).java2s.com/Open-Source/Java-Document/6.0-JDK-Modules/j2me/java/util/zip/GZIPInputStream.java.htm
Any help on this problem would be much appreciated. Thanks!
Well if finally figured it out. Its not the safest way or best but i needed a quick and dirty way to do it and this works. So if anyone else needs to know this or want to improve on it, here you go.
using (FileStream stream = File.OpenRead(filePath))
{
int size = 2048;
byte[] data = new byte[2048];
size = stream.Read(data,0,size);
if (data[3] == 8)
{
List<byte> byteList = new List<byte>();
int i = 10;
while (data[i] != 0)
{
byteList.Add(data[i]);
i++;
}
string test = System.Text.ASCIIEncoding.ASCII.GetString(byteList.ToArray());
Console.WriteLine(test);
}
}

Using HttpWebRequest with dynamic URI causes "parameter is not valid" in Image.FromStream

I'm trying to obtain an image to encode to a WordML document. The original version of this function used files, but I needed to change it to get images created on the fly with an aspx page. I've adapted the code to use HttpWebRequest instead of a WebClient. The problem is that I don't think the page request is getting resolved and so the image stream is invalid, generating the error "parameter is not valid" when I invoke Image.FromStream.
public string RenderCitationTableImage(string citation_table_id)
{
string image_content = "";
string _strBaseURL = String.Format("http://{0}",
HttpContext.Current.Request.Url.GetComponents(UriComponents.HostAndPort, UriFormat.Unescaped));
string _strPageURL = String.Format("{0}{1}", _strBaseURL,
ResolveUrl("~/Publication/render_citation_chart.aspx"));
string _staticURL = String.Format("{0}{1}", _strBaseURL,
ResolveUrl("~/Images/table.gif"));
string _fullURL = String.Format("{0}?publication_id={1}&citation_table_layout_id={2}",
_strPageURL, publication_id, citation_table_id);
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(_fullURL);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream image_stream = response.GetResponseStream();
// Read the image data
MemoryStream ms = new MemoryStream();
int num_read;
byte[] crlf = System.Text.Encoding.Default.GetBytes("\r\n");
byte[] buffer = new byte[1024];
for (num_read = image_stream.Read(buffer, 0, 1024); num_read > 0; num_read = image_stream.Read(buffer, 0, 1024))
{
ms.Write(buffer, 0, num_read);
}
// Base 64 Encode the image data
byte[] image_bytes = ms.ToArray();
string encodedImage = Convert.ToBase64String(image_bytes);
ms.Position = 0;
System.Drawing.Image image_original = System.Drawing.Image.FromStream(ms); // <---error here: parameter is not valid
image_stream.Close();
image_content = string.Format("<w:p>{4}<w:r><w:pict><w:binData w:name=\"wordml://{0}\">{1}</w:binData>" +
"<v:shape style=\"width:{2}px;height:{3}px\">" +
"<v:imagedata src=\"wordml://{0}\"/>" +
"</v:shape>" +
"</w:pict></w:r></w:p>", _word_image_id, encodedImage, 800, 400, alignment.center);
image_content = "<w:br w:type=\"text-wrapping\"/>" + image_content + "<w:br w:type=\"text-wrapping\"/>";
}
catch (Exception ex)
{
return ex.ToString();
}
return image_content;
Using a static URI it works fine. If I replace "staticURL" with "fullURL" in the WebRequest.Create method I get the error. Any ideas as to why the page request doesn't fully resolve?
And yes, the full URL resolves fine and shows an image if I post it in the address bar.
UPDATE:
Just read your updated question. Since you're running into login issues, try doing this before you execute the request:
request.Credentials = CredentialCache.DefaultCredentials
If this doesn't work, then perhaps the problem is that authentication is not being enforced on static files, but is being enforced on dynamic files. In this case, you'll need to log in first (using your client code) and retain the login cookie (using HttpWebRequest.CookieContainer on the login request as well as on the second request) or turn off authentication on the page you're trying to access.
ORIGINAL:
Since it works with one HTTP URL and doesn't work with another, the place to start diagnosing this is figuring out what's different between the two requests, at the HTTP level, which accounts for the difference in behavior in your code.
To figure out the difference, I'd use Fiddler (http://fiddlertool.com) to compare the two requests. Compare the HTTP headers. Are they the same? In particular, are they the same HTTP content type? If not, that's likely the source of your problem.
If headers are the same, make sure both the static and dynamic image are exactly the same content and file type on the server. (e.g. use File...Save As to save the image in a browser to your disk). Then use Fiddler's Hex View to compare the image content. Can you see any obvious differences?
Finally, I'm sure you've already checked this, but just making sure: /Publication/render_citation_chart.aspx refers to an actual image file, not an HTML wrapper around an IMG element, right? This would account for the behavior you're seeing, where a browser renders the image OK but your code doesn't.

Categories