System.IO.File.ReadAllText(path) does not read the html file - c#

I want to read the html file.And for that I use System.IO.File.ReadAllText(path).It can read all the html file but there is one file which is not read through this function.
I have also used
using (StreamReader reader = File.OpenText(fileName)) {
text = reader.ReadToEnd(); But still there is same problem.
What is the reason can be there ? And for that what can be the solution ? Or any other way to read the file ?

I'll take a wild guess:
The file contains unicode sequences for extended chars and the diagnose is based on (mismatched) length.
if I debug the code in the it looks
like
"<\0h\0t\0m\0l\0>\0<\0h\0e\0a\0d\0>\0\r\0\n\0<\0M\0E\0T\0A\0
\0h\0t\0t\0p\0-\0e\0q\0u\0i\0v\0=\0\"\0C\0o\0n\0t\0e\0n
Which is a valid beginning of a HTML file except for the very first char. The file is probably damaged by missing a unicode marker at the start. This damage was probably caused when it was written and is not (easy) repairable now.
You could try setting the WebClient.Encoding to UTF8 (and try a few ASCII as well).

Does MsgBox shows anything? Any error? What does varText.Length show?
string varText = File.ReadAllText(varFile, Encoding.Default);
MessageBox.Show(varFile + " Text: " + varText + " Lenght: " + varText.Length);
Verify in MessageBox that the path to file is correct, verify that the access rights from inside your application are the same as if you would be reading the file with notepad.

Came across this on google recently. The correct way to do it is via WebClient...
WebClient client = new WebClient();
String guestMsg = client.DownloadString("C:\\temp\\TheBarGuestDetailsEmail.htm");
File.ReadAllText will mess up the html when it's doing a read, and characters like £ or ' will get messed up.

Related

C# .csv-file in WinForm with Ä, Ö, Ü [duplicate]

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?
You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.
I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;
Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.
Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.
Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.
For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.
File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}
I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}
I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.
for Arabic, I used Encoding.GetEncoding(1256). it is working good.
I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.

Read an HTML/Text file and Send it as a HTML Formated Email in WPF/C#

I have a WPF application that sends out a HTML-formatted email when a button is clicked. The entire email message is in HTML-format and it does work.
However, I was wondering if there was a way to read a html file and send it out rather than writing the whole message in the code behind...keeping all the HTML formatting in-tact.
I tried something like this:
string MessageTosend = File.ReadAllText("path to txt/html file");
But that just sent out an email that only has text (no styling, no html...just the plain text found in the file).
Then I thought, I may have to convert everything:
string MessageTosend = Convert.ToString(File.ReadAllText("path to txt/html file"));
But that does the same thing as before.
Is there a way to do achieve this? Or will I have to stick to having
string MessageTosend = #"<html> ... lots of html stuff ... </html>";
for every button that sends an email?
For notice: The contents of the .txt and .html file I attempted to read from was tested using the same contents of the above string (which, again, works as expected), and without the double quotes (example: width=""100"" and width="100")
Try adding an encoding to your file read:
string MessageTosend = File.ReadAllText("path to txt/html file", Encoding.UTF8);
Try reading a file simply containing < and compare it to the string "<". Repeat for any special characters until you find a mismatch. Then find the character number like this:
(int)MessageTosend[0] // < should be 60 (3C in UTF-8)
Find out what the offending characters are, and we may be able to help. If I read a file, I do not see this problem.

Strip Out Illegal Characters For Excel Sheet

I wrote a program to crawl website to get data and output to a excel sheet. The program is written in C# using Microsoft Visual Studio 2010.
For most of the time, I have no problem getting content from the website, parse it, and store data in excel.
However, once a will I'll run into issue, saying that there are illegal characters (such as ▶) that prevents outputting to excel file, which crashes the program.
I also went onto the website manually and found other illegal characters such as Ú.
I tried to do a .Replace() but the code can't seem to find those characters.
string htmlContent = getResponse(url); //get full html from given url
string newHtml = htmlContent.Replace("▶", "?").Replace("Ú", "?");
So my question is, is there a way to strip out all characters of those types from a html string? (the html of the web page) Below is the error message I got.
I tried Anthony and woz's solution and that didn't work...
See System.Text.Encoding.Convert
Example usage:
var htmlText = // get the text you're trying to convert.
var convertedText = System.Text.Encoding.ASCII.GetString(
System.Text.Encoding.Convert(
System.Text.Encoding.Unicode,
System.Text.Encoding.ASCII,
System.Text.Encoding.Unicode.GetBytes(htmlText)));
I tested this with the string ▶Hello World and it gave me ?Hello World.
You could try stripping all non-ASCII characters.
string htmlContent = getResponse(url);
string newHtml = Regex.Replace(htmlContent, #"[^\u0000-\u007F]", "?");
thank you for the replies and thanks for the help.
After couple more hours of googling I have found the solution to my question. The problem was that I had to "sanitize" my html string.
http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
Above is the helpful article I found, which also provides code example.

Irregular character/text encoding issue with writing back to file

I'm using this function to read text lines from a file:
string[] postFileLines = System.IO.File.ReadAllLines(pstPathTextBox.Text);
Inserting a few additional lines at strategic spots, then writing the text lines back to a file with:
TextWriter textW = new StreamWriter(filePath);
for (int i = 0; i < linesToWrite.Count; i++)
{
textW.WriteLine(linesToWrite[i]);
}
textW.Close();
This works perfectly well until the text file I am reading in contains an international or special character. When writing back to the file, I don't get the same character - it is a box.
Ex:
Before = W:\Contrat à faire aujourdhui\ `
After = W:\Contrat � faire aujourdhui\ `
This webpage is portraying it as a question mark, but in the text file it's a rect white box.
Is there a way to include the correct encoding in my application to be able to handle these characters? Or, if not, throw a warning saying it was not able to properly write given line?
Add encondig like this:
File.ReadAllLines(path, Encoding.UTF8);
and
new StreamWriter(filePath, Encoding.UTF8);
Hope it helps.
use This , works for me
string txt = System.IO.File.ReadAllText(inpPath, Encoding.GetEncoding("iso-8859-1"));
You can try UTF encoding while writing to the file as well,
textW.WriteLine(linesToWrite[i],Encoding.UTF8);
You may be need to write Single-byte Character Sets
Using Encoding.GetEncodings() you can easily get all possible encoding. ("DOS" encoding are System.Text.SBCSCodePageEncoding)
In your case you may need to use
File.ReadAllLines(path, Encoding.GetEncoding("IBM850"));
and
new StreamWriter(filePath, Encoding.GetEncoding("IBM850"));
Bonne journée! ;)

Writing PHP in C# with a String Builder problem

I have the following C# code to produce a small PHP file. The reason I am doing this is to update 400 plus sites automatically. The sites are in PHP on a Windows Environment so using C# for utility apps is the easiest for me.
fileContents.AppendFormat("<?php{0}",Environment.NewLine);
fileContents.AppendFormat("# FileName=\"clientsite.php\"{0}",Environment.NewLine);
fileContents.AppendFormat("# HTTP=\"true\"{0}",Environment.NewLine);
fileContents.AppendFormat("$clientname = \"{0}\";{1}", clientsiteName, Environment.NewLine);
fileContents.AppendFormat("$version = \"v6.2i\";{0}",Environment.NewLine);
fileContents.Append("?>");
The end result of this file causes a strange character to appear on the PHP page that includes this page. When I manually open the created PHP file - press backspace on the last line then enter it works. Is there something better than Environment.NewLine to use for this? Or is there another problem I am missing?
EDIT: The character looks like something I can't reproduce on the keyboard (squiggle line) by ends with ?
You could just try "\n", I believe Environment.NewLine is "\r\n".
But it could also be about how you write the StringBuilder (I assume fileContents is a StringBuilder) to the file. If you e.g. use WriteAllText, you could try using different encoding.

Categories