I need to download a webpage, I have the following code to determe the encoding
System.IO.StreamReader sr=null;
mFrm.InfoShotcut("Henter webside....");
if(response.ContentEncoding!=null && response.ContentEncoding!="")
{
sr=new System.IO.StreamReader(srm,System.Text.Encoding.GetEncoding(response.ContentEncoding));
}
else
{
//System.Windows.Forms.MessageBox.Show();
sr=new System.IO.StreamReader(srm,System.Text.Encoding.GetEncoding(response.CharacterSet));
}
if(sr!=null)
{
result=sr.ReadToEnd();
if(response.CharacterSet!=GetCharatset(result))
{
System.Text.Encoding CorrectEncoding=System.Text.Encoding.GetEncoding(GetCharatset(result));
HttpWebRequest client2=(HttpWebRequest)HttpWebRequest.Create(Helper.value1);
HttpWebResponse response2=(HttpWebResponse)client2.GetResponse();
System.IO.Stream srm2=response2.GetResponseStream();
sr=new System.IO.StreamReader(srm2,CorrectEncoding);
result=sr.ReadToEnd();
}
}
mFrm.InfoShotcut("Henter webside......");
}
catch (Exception ex)
{
// handle error
MessageBox.Show( ex.Message );
}
And it had worked great, but now i have tried it with a site, where it states it uses
<pre>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</pre>
But realy is in UTF-8, how do I find out that sow i can save the file with the right encoding.
First off, the Content-Encoding header does not describe the character set being used. As the RFC says:
Content codings are primarily used to allow a document to be compressed or otherwise usefully transformed without losing the identity of its underlying media type and without loss of information.
The character set used is described in the Content-Type header. For example:
Content-Type: text/html; charset=UTF-8
Your code above that uses the Content-Encoding header will not correctly identify the character set. You have to look at the Content-Type header, find the semicolon if it's there, and then parse the charset parameter.
And, as you've discovered, it can also be described in an HTML META tag.
Or, there might not be a character set definition at all, in which case you have to default to something. My experience has been that defaulting to UTF-8 is a good choice. It's not 100% reliable, but it seems that sites that don't include the charset parameter with the Content-Type field usually default to UTF-8. I've also found that META tags, when they exist, are wrong almost half the time.
As L.B mentioned in his comment, it's possible to download the bytes and examine them to determine the encoding. That can be done with a surprising degree of accuracy, but it requires a lot of code.
Related
I have an question releated to encoding on Microsoft Exchange servers. I have built an app that is processing messages on Exchange and one of options is to force the encoding always to "US-ASCII".
As long as the mails goes directly through Exchange protocols, there is no problem. I have noticed the issue releated to messages sent by third-party mail clients (e.g. Thunderbird) over SMTP protocol.
Although the charset is visible in source code as US-ASCII I can find "3D" near = character, therefore the source code is corrupted and some parts of message are not displaying correctly (e.g. images).
To resolve this problem I have tried to force 7-bit content transfer encoding, but the is issue still persisting.
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body bgcolor=3D"#FFFFFF" text=3D"#000000">
dsadsadsadsdsdsadasdsadasdsad<b>dsa</b>
<p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-6">Some signature with image.=
</p><p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-7"><img alt=3D"" src=3D"cid=
:img1.jpg" id=3D"c1-id-8"></p><p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-9=
"> </p></body>
</html>
As long as the message is processed by my app, the "3D" does not appear, even after changing the charset.
Your choice of content transfer encoding is causing this: Content-Transfer-Encoding: quoted-printable
Quoted printable uses the equals sign as an escape character, so the mail sever has dutifully escaped all the 'raw' equals signs for you.
Quoted-Printable, or QP encoding, is an encoding using printable ASCII
characters (alphanumeric and the equals sign "=") to transmit 8-bit
data over a 7-bit data path or, generally, over a medium which is not
8-bit clean.[1] It is defined as a MIME content transfer encoding for
use in e-mail.
QP works by using the equals sign "=" as an escape character.
If you wanted to properly process this, look for all '=' characters in your content (not headers), read the next two characters, and then replace the '=XX' triple with the ascii value of the hex you read. "=3D" replaces to "=" with the above scheme.
For more information on Content-Transfer-Encoding refer to section 5 of RFC 1341, and RFC 1521 at least; consider reading the RFCs that obsolete the above RFCs.
I have a textarea where I type some unicode characters which become question marks by the time the string reaches the server.
On the input I typed the following:
Don’t “quote” me on that.
On the server I checked Request.Form["fieldID"] in Page_Load() and I saw:
"Don�t �quote� me on that."
I checked my web.config file and it says <globalization requestEncoding="utf-8" responseEncoding="utf-8" />. Anything else I should check to ensure UTF-8 is enabled?
Question marks like that generally show up when UTF-8 nulls are passed.
You need to HTML encode your strings.
Check the encoding of the Page where the form is, and/or the accept-charset of the form.
I can replicate what you are seeing with ISO-8859-1 - e.g.
<form action="foo" method="post" accept-charset="ISO-8859-1">
....
</form>
In VS watch window:
Inspecting Request.Form (before accessing the key itself):
message=Don%ufffdt+%ufffdquote%ufffd+me+on+that.
Inspecting Request.Form["message"] - accessing the collection keys which means ASP.Net has already automatically urldecoded:
"Don�t �quote� me on that."
It seems something is overriding your web.config settings on that specific page (?)
Hth...
Once I again I solve my own problem. It is quite simple. The short answer is add the following before sending any response back to the client:
Response.ContentType = "text/html; charset=utf-8";
The long answer is that a "feature" called Cache Mode circumvented all other response data by writing a UTF-8 encoded file that is really just a cached response. Adding that line before it write the file solved my problem.
if (cacheModeEnabled) {
Response.ContentType = "text/html; charset=utf-8"; // WriteFile doesn't know the file encoding
Response.WriteFile(Server.MapPath("CacheForm.aspx"), true);
Response.End();
} else {
// perform normal response here
}
Thanks for all the answers and comments. They definitely helped me solve this issue. Most notably, Fiddler2 let me see what the heck is really in the request and response.
This might be different with other Korean encoding questions.
There is this site I have to scrape and it's Korean.
An example sentence in their site is this
"개인정보보호를 위해 뒤로가기 버튼 대신 검색결과 화면 상단과 하단의 이전 버튼을 사용하시기 바랍니다."
I am using HttpWebRequest and HttpWebResponse to scrape the site.
this is how I retreive the html
-- partial code --
using (Stream data = resp.GetResponseStream())
{
response.Append(new StreamReader(data, Encoding.GetEncoding(code), true).ReadToEnd());
}
now my problem is, am not getting the correct Korean characters. In my "code" variable, I'm basing the code page here in MSDN http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx (let me narrow it down).
here are the Korean code pages:
51949, 50225, 20949, 20833, 10003, 949
but am still not getting the correct Korean characters? What you think is the problem?
It is very likely that the page is not in a specific Korean encoding, but one of the Unicode encodings.
Try Encoding.UTF8, Encoding.Default (UTF-16) instead of the specific code pages. There are also Encoding.UTF7 and Encoding.UTF32, but they are not as common.
To be certain, examine the meta tags and headers for the content-type returned by the server.
Update (gleaned from commments):
Since the content-type header is EUC-KR, the corresponding codepage is 51949 and this is what you need to use to retrieve the page.
It was not clear that you are writing this out to a file - you need to use the same encoding when writing the file out, or convert the byte[] from the original to the output file encoding (using Encoding.Convert).
While having exact same issue I've finished it with code below:
Encoding.UTF8.GetString(DownloadData(URL));
This directly transform output for the WebClient GET request to UTF8 encoding.
I have written a mail-processing program, which basically slaps a template on incoming mail and forwards it on. Incoming mail goes to a Gmail account, which I download using POP, then I read the mail (both html and plain text multipart-MIME), make whatever changes I need to the template, then create a new mail with the appropriate plain+html text and send it on to another address.
Trouble is, when the mail gets to the other side, some of the mails have been mangled, with weird characters like à and  magically getting inserted. They weren't in the original mails, they're not in my template, and I can't find any sort of predictable pattern as to when these characters appear. I'm sure it's got something to do with the encoding properties of the mails, but I am making sure to set both the charset and the transfer encoding of the outgoing mail to be the same as the incoming mail. So what else do I need to do?
EDIT: Here's a snipped sample of an incoming mail:
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
=0A=0ASafari Special:=0A=0A=A0=0A=0ASafari in Thornybush Priv=
ate Game Reserve 9-12=0AJanuary 2012 (3nights)
After processing, this comes out as:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
=0D=0A=0D=0ASafari Special:=0D=0A=0D=0A=C2=A0=0D=0A=0D=0A=
Safari in Thornybush Private Game Reserve 9-12=0D=0AJanuary=
2012 (3nights)
Notice the insertion of the =0D and =C2 characters (aside from a few =0A's that weren't in the original).
So what does you think is happening here?
ANOTHER CLUE: Here's my code that creates the alternate view:
var htmlView = AlternateView.CreateAlternateViewFromString(htmlBody, null, "text/html");
htmlView.ContentType.CharSet = charSet;
htmlView.TransferEncoding = transferEncoding;
m.AlternateViews.Add(htmlView);
Along the lines of what #mjwills suggested, perhaps the CreateAlternativeViewFromString() method already assumes UTF-8, and changing it later to iso-8859-1 doesn't make a difference?
So every =0A is becoming =0D=0A.
And every =A0 is becoming =C2=A0.
The former looks like it might be related to Carriage Return / Line Feeds.
The latter looks like it might be related to What is "=C2=A0" in MIME encoded, quoted-printable text?.
My guess is that even though you have specified the charset, something alone the line is treating it as UTF8.
You may want to try using this form of CreateAlternateViewFromString, where the ContentType.CharSet is set appropriately.
I'm using automatic conversion from wsdl to c#, everything works apart from encoding, whenever
I have native characters (like 'ł' or 'ó') I get '??' insted of them in string fields ('G????wny' instead of 'Główny'). How to deal with it? Server sends document with correct encoding, with header .
EDIT: I noticed in Wireshark, that packets send FROM me have BOM, but packets sends TO me, don't have it - maybe it's a root of problem?
So maybe the following will help:
What I am sure I did is:
In the webservice PHP file, after connecting to the Mysql Database I call:
mysql_query("SET CHARSET utf8");
mysql_query("SET NAMES utf8 COLLATE utf8_polish_ci");
The second I did:
In the same PHP file,
I added utf8_encode to the service on the $POST_DATA variable:
$server->service(utf8_encode($POST_DATA));
in the class.nusoap_base.php I changed:
`//var $soap_defencoding = 'ISO-8859-1';
var $soap_defencoding = 'UTF-8';`
and olso in the nusoap.php the same as above:
//var $soap_defencoding = 'ISO-8859-1';
var $soap_defencoding = 'UTF-8';
and in the nusoap.php file again:
var $decode_utf8 = true;
Now I can send and receive properly encoded data.
Hope this helps.
Regards,
The problem was on the server side with sent Content-Type parameter in header (it was set to "text/xml"). It occurs that for utf-8 it HAVE TO be "text/xml; charset=utf-8", other methods such as placing BOM aren't correct (according to RFC 3023). More info here: http://annevankesteren.nl/2005/03/text-xml