How to read HTML as XML? - c#

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML

HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.

I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;

If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);

HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.

If you know the nodes you're interested in I would use regex to extract the links from the string.

Related

Read XML Value from a XML String with DOCTYPE [duplicate]

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.

Insert double quotes around all html tag attribute

am trying to convert html to xml , but double quotes of html tag attribute doesn't work
so when convert it to xml gives me error
so how can i add double quotes to all to my xml file ,
am using vb.net windows form application
converting an html to xml would not work..There are various corner cases where your html to xml conversion may fail
The best way to convert html to xml would be to:
1>Extract relevant data from the html using parsers like htmlagilitypack
2>Store those extracted data into xml using various xml api's like XmlWriter or Linq2Xml.
I wonder what method you use to convert. You say nothing abour that. Nevertheless, it's obviously this method which is the core problem. And maybe also what you plan to do once the html is converted into xml ?
To tell the truth, no conversion is needed given that html is already xml (well-formed html at least). Simply load your html in a XDocument for example... and that's it. Nothing special to do.
Try this please :
install SgmlReader from nuget
in case you have a string variable like below you will have to convert it into a TextReader object.
Now we are going to use the package installed.
static XmlDocument HTMLTEST()
{
string html = "<table frame=all><tgroup></tgroup></table>";
TextReader reader = new StringReader(html);
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All;
sgmlReader.InputStream = reader;
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true; //false if you dont want whitespace
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
Input is string html format, and the return will be doc XmlDocument format.
Your frame=all from html will become frame="all".

C# URL Crawler not getting enough links?

I have the following code, however, when I launch it I only ever seam to get a few URLS returned.
while (stopFlag != true)
{
WebRequest request = WebRequest.Create(urlList[i]);
using (WebResponse response = request.GetResponse())
{
using (StreamReader reader = new StreamReader
(response.GetResponseStream(), Encoding.UTF8))
{
string sitecontent = reader.ReadToEnd();
//add links to the list
// process the content
//clear the text box ready for the HTML code
//Regex urlRx = new Regex(#"((https?|ftp|file)\://|www.)[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*", RegexOptions.IgnoreCase);
Regex urlRx = new Regex(#"(?<url>(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)", RegexOptions.IgnoreCase);
MatchCollection matches = urlRx.Matches(sitecontent);
foreach (Match match in matches)
{
string cleanMatch = cleanUP(match.Value);
urlList.Add(cleanMatch);
updateResults(theResults, "\"" + cleanMatch + "\",\n");
}
}
}
}
I think the error is within the regex.
What I am trying to achieve is pull a webpage, then grab all the links from that page - add these to a list, then for each list item fetch the next page and repeat the process.
Instead of trying to use regex to parse HTML, I suggest using a good HTML parser - the HTML Agilty Pack is a great choice:
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

get html elements of <tr>....</tr> by creating a httpwebrequest in c#

I want to extract some html elements from the "tablerow" contents of a html code and create a automated application.Can httpwebrequest and httpwebresponse help me doing that ? if yes then
could any one show me the sample of doing that...Thanking you in advance
I would go get HtmlAgilityPack from nuget. WebClient is easier, but HttpWebRequest is more powerful and allows for more control. Regex can work, but is generally a pain. If you think this document will be well enough formatted a quick XPath to the elements in question is usually much easier and cleaner, so try something like this:
var client = new WebClient();
//var html = client.DownloadString("YOURURL");
var html = "<html><body><table><tr><td></td></tr></table></body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var nodes = document.DocumentNode.SelectNodes("//body/table");
Console.WriteLine(nodes[0].InnerHtml);
Console.ReadLine();

Parsing web pages via C#, XmlDocument.LoadXml

I'm trying to download a web page and parse it. I need to reach every node of html document. So I used WebClient to download, which works perfectly. Then I use following code segment to parse the document:
WebClient client = new WebClient();
Stream data = client.OpenRead("http://web.cs.hacettepe.edu.tr/~bil339/");
StreamReader reader = new StreamReader(data);
string xml = reader.ReadToEnd();
data.Close();
reader.Close();
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.loadXml(xml);
In last line, program waits for some time, then crashes. It says there are errors in HTML code, this wasn't expected, that shouldn't be here, etc.
Any suggestions to fix this? Other techniques to parse HTML code are welcome (In C#, of course.)
Use the HTMLAgilityPack to parse HTML. Well-formed HTML is not XML and can't be parsed as such. For instance, it lacks the <?xml version="1.0" encoding="UTF-8"?> preamble that all XML files require. The HTML Agility Pack is more forgiving.

Categories