Handle special chars - c#

When I do
WebClient wc = new WebClient();
string content = wc.DownloadString(url);
File.WriteAllText(path, content);
And I open the file in path with Internet Explorer, special characters like ó apear like ó.
Is there a way for interpreting correctly those characters?

You're downloading it in whatever content encoding is specified, but then saving it as UTF-8. If you want to save it to disk anyway, I suggest you use WebClient.DownloadFile directly instead. Then so long as the encoding is also specified in the HTML (correctly) it should be okay.

Related

C# .csv-file in WinForm with Ä, Ö, Ü [duplicate]

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?
You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.
I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;
Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.
Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.
Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.
For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.
File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}
I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}
I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.
for Arabic, I used Encoding.GetEncoding(1256). it is working good.
I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.

Handling Special Characters (¦)

I'm a bit lost on how to read and write to/from text files in C# when special characters are present. I'm writing a simple script that does some cleanup on a .txt data file which contains the '¦' character as its delimiter.
foreach (string file in Directory.EnumerateFiles(#"path\raw txt","*.txt"))
{
string contents = File.ReadAllText(file);
contents = contents.Replace("¦", ",");
File.WriteAllText(file.Replace("raw txt", "txt"), contents);
}
However, when I open the txt file in Notepad++, the delimeter is now �. What exactly is going on? What even is this characters (¦) encoding / how would I determine that? I've tried adding things like:
string contents = File.ReadAllText(file, Encoding.UTF8);
File.WriteAllText(file.Replace("raw txt", "txt"), contents, Encoding.UTF8);
Everything is now working correctly by switching the encoding to 'default' when both reading/writing.
string contents = File.ReadAllText(file, Encoding.Default);
File.WriteAllText(file.Replace("raw txt", "txt"), contents, Encoding.Default);
Try change encoding of Notepad to UTF-8

Download and encode HTML page into file

I like to download some web pages which use charset="UTF-8"
This page is a sample: http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2003
I always end up with special characters like this:
Beyoncé instead of Beyoncé
I tried the following code:
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
webClient.DownloadFile(url, fileName);
or this one:
WebClient client = new WebClient();
Byte[] pageData = client.DownloadData(url);
string pageHtml = Encoding.UTF8.GetString(pageData);
System.IO.File.WriteAllText(fileName, pageHtml);
What do I do wrong?
I just want an easy way to download web pages and write them to files. After that is done I will extract data from these files and obviously I want "normal" characters like I see on the original web-page and not some special characters.
The problem is that the WriteAllText Method don't write the encoded Text in UTF-8 in the File.
You should add the Encoding:
System.IO.File.WriteAllText(fileName, pageHtml, Encoding.UTF8);

Unable to print languages other than English in System.windows.Forms.WebBrowser

I am trying to use System.windows.Forms.WebBrowser to display a content in the languages other than English, but the resulting encoding is incorrect. What should I do to display for example Russian?
I am downloading and displaying a string as following:
System.Net.WebClient wc = new System.Net.WebClient();
webBrsr.DocumentText = wc.DownloadString(url);
The problem is with the WebClient and how it is interpreting the string encoding. One solution is to download the data as raw bytes and parse it out manually:
Bytes[] bytes = wc.DownloadData("http://news.google.com/news?edchanged=1&ned=ru_ru");
//You should really inspect the headers from the response to determine the exact encoding to use,
// this example just assumes UTF-8 which might work in most scenarios
String t = System.Text.Encoding.UTF8.GetString(bytes);
webBrsr.DocumentText = t;

System.IO.File.ReadAllText(path) does not read the html file

I want to read the html file.And for that I use System.IO.File.ReadAllText(path).It can read all the html file but there is one file which is not read through this function.
I have also used
using (StreamReader reader = File.OpenText(fileName)) {
text = reader.ReadToEnd(); But still there is same problem.
What is the reason can be there ? And for that what can be the solution ? Or any other way to read the file ?
I'll take a wild guess:
The file contains unicode sequences for extended chars and the diagnose is based on (mismatched) length.
if I debug the code in the it looks
like
"<\0h\0t\0m\0l\0>\0<\0h\0e\0a\0d\0>\0\r\0\n\0<\0M\0E\0T\0A\0
\0h\0t\0t\0p\0-\0e\0q\0u\0i\0v\0=\0\"\0C\0o\0n\0t\0e\0n
Which is a valid beginning of a HTML file except for the very first char. The file is probably damaged by missing a unicode marker at the start. This damage was probably caused when it was written and is not (easy) repairable now.
You could try setting the WebClient.Encoding to UTF8 (and try a few ASCII as well).
Does MsgBox shows anything? Any error? What does varText.Length show?
string varText = File.ReadAllText(varFile, Encoding.Default);
MessageBox.Show(varFile + " Text: " + varText + " Lenght: " + varText.Length);
Verify in MessageBox that the path to file is correct, verify that the access rights from inside your application are the same as if you would be reading the file with notepad.
Came across this on google recently. The correct way to do it is via WebClient...
WebClient client = new WebClient();
String guestMsg = client.DownloadString("C:\\temp\\TheBarGuestDetailsEmail.htm");
File.ReadAllText will mess up the html when it's doing a read, and characters like £ or ' will get messed up.

Categories