EDIT: The characters come correctly, but in the middle of the page there's this line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">, afterwards the special characters become é as é (that are represented fine in browser), but are represented as eacute; (without the &) if downloaded via WebClient. END EDIT
I am extracting an excerpt from a web using WebClient + RegEx.
But setting the encoding correctly still makes é as eacute;, ças ccedil;, í as iacute; etc.
I followed DownloadString and Special Characters example to correctly set the charset (ISO-8859-1):
System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadString("https://myurl"); //
var contentType = wc.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType, "charset=([^;]+)").Groups[1].Value;
wc.Encoding = Encoding.GetEncoding(charset);
It does set charset like the document's (ISO-8859-1), but when i do the follow-up DownloadString (i know i could set the encoding before and just do one wc.DownloadString, but i wanted to follolw the accepted answer's example):
string result = wc.DownloadString("https://myurl");
The special characters still come wrong.
NOTE: I am using a non-English Windows 10 (if it's relevant)
NOTE 2: The page's special characters appear correctly in any browser
My question is, why the WebClient don't download correctly even with the correct charset set?
using System.Text;
wc.Encoding = Encoding.UTF8;
Related
Completely stuck on a problem related to the inbound parse webhook functionality offered by SendGrid: https://sendgrid.com/docs/for-developers/parsing-email/setting-up-the-inbound-parse-webhook/
First off everything is working just fine with retrieving the mail sent to my application endpoint. Using Request.Form I'm able to retrieve the data and work with it.
The problem is that we started noticing question mark symbols instead of letters when recieving some mails (written in swedish using Å Ä and Ö). This occured both when sending plaintext mails, and mails with an HTML-body.
However, this only happens every now and then. After a lot of searching I found out that if the mail is sent from e.g. Postbox or Outlook (or the like), and the application has the charset set to iso-8859-1 that's when Å Ä Ö is replaced by question marks.
To replicate the error and be able to debug it I set up a HTML page with a form using the iso-8859-1 encoding, sending a similar payload as the one seen in the link above (the default one). And after that been through testing a multitude of things trying to get it to work.
As of now I'm trying to recode the input, without success. Code I'm testing:
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(Request.Form.["html"]);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8,wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);
This only results in the utf8String producing the same result with "???" where Å Ä Ö should be. My guess here is that perhaps it's due to the Request.Form["html"] returning a UTF-16 string, of the content that is encoded already in the wrong encoding iso-8859-1.
The method for fetching the POST is as follows
public async Task<InboundParseModel> FetchMail(IFormCollection form)
{
InboundParseModel _em = new InboundParseModel
{
To = form["to"].SingleOrDefault(),
From = form["from"].SingleOrDefault(),
Subject = form["subject"].SingleOrDefault(),
Html = form["html"].SingleOrDefault(),
Text = System.Net.WebUtility.HtmlEncode(form["text"].SingleOrDefault()),
Envelope = form["envelope"].SingleOrDefault()
};
}
Called from another method that the POST is done to by FetchMail(Request.Form);
Project info: ASP.NET Core 2.2, C#
So as stated earlier, I am completely stuck and don't really have any ideas on how to solve this. Any help would be much appreciated!
I am parsing some web content in a response from a HttpWebRequest.
This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this � and I want to know which is the right way to transform it back into a readable string.
So, what I've tried is to convert the current word encoding into UTF-8 like this:
(I am wondering if UTF-8 could solve my problem)
string word = "ESPA�OL";
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");
byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);
string utfWord = utf.GetString(utfBytes);
Console.WriteLine(utfWord);
However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.
Can someone please give me the right directions to solve this, if possible?
The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.
You can see this for yourself using the following simple program:
using System;
using System.Diagnostics;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
string original = "ESPAÑOL";
byte[] iso_8859_1 = enc.GetBytes(original);
string roundTripped = enc.GetString(iso_8859_1);
Debug.Assert(original == roundTripped);
Console.WriteLine(roundTripped);
}
}
}
What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.
A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.
The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.
This might be different with other Korean encoding questions.
There is this site I have to scrape and it's Korean.
An example sentence in their site is this
"개인정보보호를 위해 뒤로가기 버튼 대신 검색결과 화면 상단과 하단의 이전 버튼을 사용하시기 바랍니다."
I am using HttpWebRequest and HttpWebResponse to scrape the site.
this is how I retreive the html
-- partial code --
using (Stream data = resp.GetResponseStream())
{
response.Append(new StreamReader(data, Encoding.GetEncoding(code), true).ReadToEnd());
}
now my problem is, am not getting the correct Korean characters. In my "code" variable, I'm basing the code page here in MSDN http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx (let me narrow it down).
here are the Korean code pages:
51949, 50225, 20949, 20833, 10003, 949
but am still not getting the correct Korean characters? What you think is the problem?
It is very likely that the page is not in a specific Korean encoding, but one of the Unicode encodings.
Try Encoding.UTF8, Encoding.Default (UTF-16) instead of the specific code pages. There are also Encoding.UTF7 and Encoding.UTF32, but they are not as common.
To be certain, examine the meta tags and headers for the content-type returned by the server.
Update (gleaned from commments):
Since the content-type header is EUC-KR, the corresponding codepage is 51949 and this is what you need to use to retrieve the page.
It was not clear that you are writing this out to a file - you need to use the same encoding when writing the file out, or convert the byte[] from the original to the output file encoding (using Encoding.Convert).
While having exact same issue I've finished it with code below:
Encoding.UTF8.GetString(DownloadData(URL));
This directly transform output for the WebClient GET request to UTF8 encoding.
I'm using webclient to get the source html code from websites and put the html in a textbox
but for some reason in the textbox I'm gettig weird symbol
using (WebClient cliente = new WebClient())
{
textbox.Text = cliente.DownloadString(url);
}
I'm using c# .net 3.5
http://imageshack.us/photo/my-images/691/weirdssymbols.jpg/
Those are representations of non-printable new line characters.
Try
textBox.Multiline = true;
using (WebClient cliente = new WebClient())
{
textbox.Text = cliente.DownloadString(url);
}
I think that it's a problem connected to encoding.
Is your string utf-8 encoded?
You need to set the webclient encoding equals to web page enconding (if you manage the page, set it to utf-8, is a better solution).
http://msdn.microsoft.com/en-us/library/system.net.webclient.encoding%28v=vs.80%29.aspx
Then, I think you wouldn't get bad squares anymore, however I don't know encoding used by textboxes, this could be a problem (I again suppose they use utf-8, don't know if they are configurable).
EDIT:
Didn't see your comment, yes definitely I think those squares are \r\n characters, which (maybe) are written on the page with an encoding different from uft-8 (so it's not your fault but it's a problem that the webpage's developer created).
´ can't be converted, you must replace with string.replace with what you want (´ is used by html to show some special characters)
When I do
WebClient wc = new WebClient();
string content = wc.DownloadString(url);
File.WriteAllText(path, content);
And I open the file in path with Internet Explorer, special characters like ó apear like ó.
Is there a way for interpreting correctly those characters?
You're downloading it in whatever content encoding is specified, but then saving it as UTF-8. If you want to save it to disk anyway, I suggest you use WebClient.DownloadFile directly instead. Then so long as the encoding is also specified in the HTML (correctly) it should be okay.