Parsing web pages via C#, XmlDocument.LoadXml - c#

I'm trying to download a web page and parse it. I need to reach every node of html document. So I used WebClient to download, which works perfectly. Then I use following code segment to parse the document:
WebClient client = new WebClient();
Stream data = client.OpenRead("http://web.cs.hacettepe.edu.tr/~bil339/");
StreamReader reader = new StreamReader(data);
string xml = reader.ReadToEnd();
data.Close();
reader.Close();
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.loadXml(xml);
In last line, program waits for some time, then crashes. It says there are errors in HTML code, this wasn't expected, that shouldn't be here, etc.
Any suggestions to fix this? Other techniques to parse HTML code are welcome (In C#, of course.)

Use the HTMLAgilityPack to parse HTML. Well-formed HTML is not XML and can't be parsed as such. For instance, it lacks the <?xml version="1.0" encoding="UTF-8"?> preamble that all XML files require. The HTML Agility Pack is more forgiving.

Related

Read XML Value from a XML String with DOCTYPE [duplicate]

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.

Insert double quotes around all html tag attribute

am trying to convert html to xml , but double quotes of html tag attribute doesn't work
so when convert it to xml gives me error
so how can i add double quotes to all to my xml file ,
am using vb.net windows form application
converting an html to xml would not work..There are various corner cases where your html to xml conversion may fail
The best way to convert html to xml would be to:
1>Extract relevant data from the html using parsers like htmlagilitypack
2>Store those extracted data into xml using various xml api's like XmlWriter or Linq2Xml.
I wonder what method you use to convert. You say nothing abour that. Nevertheless, it's obviously this method which is the core problem. And maybe also what you plan to do once the html is converted into xml ?
To tell the truth, no conversion is needed given that html is already xml (well-formed html at least). Simply load your html in a XDocument for example... and that's it. Nothing special to do.
Try this please :
install SgmlReader from nuget
in case you have a string variable like below you will have to convert it into a TextReader object.
Now we are going to use the package installed.
static XmlDocument HTMLTEST()
{
string html = "<table frame=all><tgroup></tgroup></table>";
TextReader reader = new StringReader(html);
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All;
sgmlReader.InputStream = reader;
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true; //false if you dont want whitespace
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
Input is string html format, and the return will be doc XmlDocument format.
Your frame=all from html will become frame="all".

How to read HTML as XML?

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.

Is there a quick way to format an XmlDocument for display in C#?

I want to output my InnerXml property for display in a web page. I would like to see indentation of the various tags. Is there an easy way to do this?
Here's a little class that I put together some time ago to do exactly this.
It assumes that you're working with the XML in string format.
public static class FormatXML
{
public static string FormatXMLString(string sUnformattedXML)
{
XmlDocument xd = new XmlDocument();
xd.LoadXml(sUnformattedXML);
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
XmlTextWriter xtw = null;
try
{
xtw = new XmlTextWriter(sw);
xtw.Formatting = Formatting.Indented;
xd.WriteTo(xtw);
}
finally
{
if(xtw!=null)
xtw.Close();
}
return sb.ToString();
}
}
You should be able to do this with code formatters. You would have to html encode the xml into the page first.
Google has a nice prettifyer that is capable of visualizing XML as well as several programming languages.
Basically, put your XML into a pre tag like this:
<pre class="prettyprint">
<link href="prettify.css" type="text/css" rel="stylesheet" />
<script type="text/javascript" src="prettify.js"></script>
</pre>
Use the XML Web Server Control to display the content of an xml document on a web page.
EDIT: You should pass the entire XmlDocument to the Document property of the XML Web Server Control to display it. You don't need to use the InnerXml property.
If identation is your only cocern and if you can afford to launch xternall process, you can process xml file with HTML Tidy console tool (~100K).
The code is:
tidy --input-xml y --output-xhtml y --indent "1" $(FilePath)
Then you can display idented string on web page once you get rid of special chars.
It would be also easy to create recursive function that makes such output - simply iterate nodes starting from the root and enter next recursion step for child node, passing identation as a parameter to each new recursion call.
Check out the free Actipro CodeHighlighter for ASP.NET - it can neatly display XML and other formats.
Or are you more interested in actually formatting your XML? Then have a look at the XmlTextWriter - you can specify things like Format (indenting or not) and the indent level, and then write out your XML to e.g. a MemoryStream and read it back from there into a string for display.
Marc
Use an XmlTextWriter with the XmlWriterSettings set up so that indentation is enabled. You can use a StringWriter as "temporary storage" if you want to write the resulting string onto screen.

UTF-8 encoding issue

I am trying to fetch data from rss feed (feed location is http://www.bgsvetionik.com/rss/ ) in c# win form. Take a look at the following code:
public static XmlDocument FromUri(string uri)
{
XmlDocument xmlDoc;
WebClient webClient = new WebClient();
using (Stream rssStream = webClient.OpenRead(uri))
{
XmlTextReader reader = new XmlTextReader(rssStream);
xmlDoc = new XmlDocument();
xmlDoc.XmlResolver = null;
xmlDoc.Load(reader);
}
return xmlDoc;
}
Although xmlDoc.InnerXml contains XML definition with UTF-8 encoding, I get š instead of š etc.
How can I solve it?
The feed's data is incorrect. The š is inside a CDATA section, so it isn't being treated as an entity by the XML parser.
If you look at the source XML, you'll find that there's a mixture of entities and "raw" characters, e.g. čišćenja in the middle of the first title.
If you need to correct that, you'll have to do it yourself with a Replace call - the XML parser is doing exactly what it's meant to.
EDIT: For the replacement, you could get hold of all the HTML entities and replace them one by one, or just find out which ones are actually being used. Then do:
string text = element.Value.Replace("š", "š")
.Replace(...);
Of course, this means that anything which is actually correctly escaped and should really be that text will get accidentally replaced... but such is the problem with broken data :(

Categories