I'm using webclient to get the source html code from websites and put the html in a textbox
but for some reason in the textbox I'm gettig weird symbol
using (WebClient cliente = new WebClient())
{
textbox.Text = cliente.DownloadString(url);
}
I'm using c# .net 3.5
http://imageshack.us/photo/my-images/691/weirdssymbols.jpg/
Those are representations of non-printable new line characters.
Try
textBox.Multiline = true;
using (WebClient cliente = new WebClient())
{
textbox.Text = cliente.DownloadString(url);
}
I think that it's a problem connected to encoding.
Is your string utf-8 encoded?
You need to set the webclient encoding equals to web page enconding (if you manage the page, set it to utf-8, is a better solution).
http://msdn.microsoft.com/en-us/library/system.net.webclient.encoding%28v=vs.80%29.aspx
Then, I think you wouldn't get bad squares anymore, however I don't know encoding used by textboxes, this could be a problem (I again suppose they use utf-8, don't know if they are configurable).
EDIT:
Didn't see your comment, yes definitely I think those squares are \r\n characters, which (maybe) are written on the page with an encoding different from uft-8 (so it's not your fault but it's a problem that the webpage's developer created).
´ can't be converted, you must replace with string.replace with what you want (´ is used by html to show some special characters)
Related
Completely stuck on a problem related to the inbound parse webhook functionality offered by SendGrid: https://sendgrid.com/docs/for-developers/parsing-email/setting-up-the-inbound-parse-webhook/
First off everything is working just fine with retrieving the mail sent to my application endpoint. Using Request.Form I'm able to retrieve the data and work with it.
The problem is that we started noticing question mark symbols instead of letters when recieving some mails (written in swedish using Å Ä and Ö). This occured both when sending plaintext mails, and mails with an HTML-body.
However, this only happens every now and then. After a lot of searching I found out that if the mail is sent from e.g. Postbox or Outlook (or the like), and the application has the charset set to iso-8859-1 that's when Å Ä Ö is replaced by question marks.
To replicate the error and be able to debug it I set up a HTML page with a form using the iso-8859-1 encoding, sending a similar payload as the one seen in the link above (the default one). And after that been through testing a multitude of things trying to get it to work.
As of now I'm trying to recode the input, without success. Code I'm testing:
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(Request.Form.["html"]);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8,wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);
This only results in the utf8String producing the same result with "???" where Å Ä Ö should be. My guess here is that perhaps it's due to the Request.Form["html"] returning a UTF-16 string, of the content that is encoded already in the wrong encoding iso-8859-1.
The method for fetching the POST is as follows
public async Task<InboundParseModel> FetchMail(IFormCollection form)
{
InboundParseModel _em = new InboundParseModel
{
To = form["to"].SingleOrDefault(),
From = form["from"].SingleOrDefault(),
Subject = form["subject"].SingleOrDefault(),
Html = form["html"].SingleOrDefault(),
Text = System.Net.WebUtility.HtmlEncode(form["text"].SingleOrDefault()),
Envelope = form["envelope"].SingleOrDefault()
};
}
Called from another method that the POST is done to by FetchMail(Request.Form);
Project info: ASP.NET Core 2.2, C#
So as stated earlier, I am completely stuck and don't really have any ideas on how to solve this. Any help would be much appreciated!
EDIT: The characters come correctly, but in the middle of the page there's this line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">, afterwards the special characters become é as é (that are represented fine in browser), but are represented as eacute; (without the &) if downloaded via WebClient. END EDIT
I am extracting an excerpt from a web using WebClient + RegEx.
But setting the encoding correctly still makes é as eacute;, ças ccedil;, í as iacute; etc.
I followed DownloadString and Special Characters example to correctly set the charset (ISO-8859-1):
System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadString("https://myurl"); //
var contentType = wc.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType, "charset=([^;]+)").Groups[1].Value;
wc.Encoding = Encoding.GetEncoding(charset);
It does set charset like the document's (ISO-8859-1), but when i do the follow-up DownloadString (i know i could set the encoding before and just do one wc.DownloadString, but i wanted to follolw the accepted answer's example):
string result = wc.DownloadString("https://myurl");
The special characters still come wrong.
NOTE: I am using a non-English Windows 10 (if it's relevant)
NOTE 2: The page's special characters appear correctly in any browser
My question is, why the WebClient don't download correctly even with the correct charset set?
using System.Text;
wc.Encoding = Encoding.UTF8;
I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?
You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.
I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;
Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.
Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.
Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.
For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.
File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}
I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}
I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.
for Arabic, I used Encoding.GetEncoding(1256). it is working good.
I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.
I'm working on a project for school. We are making a static code analyzer.
A requirement for this is to analyse C# code in Java, which is going so far so good with ANTLR.
I have made some example C# code to scan with ANTLR in Visual Studio. I analyse every C# file in the solution. But it does not work. I am getting a memory leak and the error message :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.antlr.runtime.Lexer.emit(Lexer.java:151)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:86)
at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
After a while I thought it was an issue with encoding, because all the files are in UTF-8. I think it can't read the encoded Stream. So i opened Notepad++ and i changed the encoding of every file to ANSI, and then it worked. I don't really understand what ANSI means, is this one character set or some kind of organisation?
I want to change the encoding from any encoding (probably UTF-8) to this ANSI encoding so i won't get memory leaks anymore.
This is the code that makes the Lexer and Parser:
InputStream inputStream = new FileInputStream(new File(filePath));
CharStream charStream = new ANTLRInputStream(inputStream);
CSharpLexer cSharpLexer = new CSharpLexer(charStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(cSharpLexer);
CSharpParser cSharpParser = new CSharpParser(commonTokenStream);
Does anyone know how to change the encoding of the InputStream to the right encoding?
And what does Notepad++ do when I change the encoding to ANSI?
When reading text files you should set the encoding explicitly. Try you examples with the following change
CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");
I solved this issue by putting the ImputStream into a BufferedStream and then removed the Byte Order Mark.
I guess my parser didn't like that encoding, because I also tried set the encoding explicitly.
This might be different with other Korean encoding questions.
There is this site I have to scrape and it's Korean.
An example sentence in their site is this
"개인정보보호를 위해 뒤로가기 버튼 대신 검색결과 화면 상단과 하단의 이전 버튼을 사용하시기 바랍니다."
I am using HttpWebRequest and HttpWebResponse to scrape the site.
this is how I retreive the html
-- partial code --
using (Stream data = resp.GetResponseStream())
{
response.Append(new StreamReader(data, Encoding.GetEncoding(code), true).ReadToEnd());
}
now my problem is, am not getting the correct Korean characters. In my "code" variable, I'm basing the code page here in MSDN http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx (let me narrow it down).
here are the Korean code pages:
51949, 50225, 20949, 20833, 10003, 949
but am still not getting the correct Korean characters? What you think is the problem?
It is very likely that the page is not in a specific Korean encoding, but one of the Unicode encodings.
Try Encoding.UTF8, Encoding.Default (UTF-16) instead of the specific code pages. There are also Encoding.UTF7 and Encoding.UTF32, but they are not as common.
To be certain, examine the meta tags and headers for the content-type returned by the server.
Update (gleaned from commments):
Since the content-type header is EUC-KR, the corresponding codepage is 51949 and this is what you need to use to retrieve the page.
It was not clear that you are writing this out to a file - you need to use the same encoding when writing the file out, or convert the byte[] from the original to the output file encoding (using Encoding.Convert).
While having exact same issue I've finished it with code below:
Encoding.UTF8.GetString(DownloadData(URL));
This directly transform output for the WebClient GET request to UTF8 encoding.