HTML to RichTextBox as Plaintext with Hyperlinks - c#

Reading so much about not using RegExes for stripping HTML, I am wondering about how to get some Links into my RichTextBox without getting all the messy html that is also in the content that i download from some newspaper site.
What i have: HTML from a newspaper website.
What i want: The article as plain text in a RichTextBox. But with links (that is, replacing the bar with <Hyperlink NavigateUri="foo">bar</Hyperlink>).
HtmlAgilityPack gives me HtmlNode.InnerText (stripped of all HTML tags) and HtmlNode.InnerHtml (with all tags). I can get the Url and text of the link(s) with articlenode.SelectNodes(".//a"), but how should i know where to insert that in the plain text of HtmlNode.InnerText?
Any hint would be appreciated.

Here is how you can do it (with a sample console app but the idea is the same for Silverlight):
Let's suppose you have this HTML:
<html>
<head></head>
<body>
Link 1: bar
Link 2: bar2
</body>
</html>
Then this code:
HtmlDocument doc = new HtmlDocument();
doc.Load(myFileHtm);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
// replace the HREF element in the DOM at the exact same place
// by a deep cloned one, with a different name
HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node);
// modify some attributes
newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null));
newNode.Attributes.Remove("href");
}
doc.Save(Console.Out);
will output this:
<html>
<head></head>
<body>
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink>
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink>
</body>
</html>

Related

How to construct an XPath to identify and click on an element using Selenium

I want to access and click on the following HTML code elements.
I tried:
driver.FindElement(By.ClassName("all_excel")).Click();
But an error occurs.
I'd appreciate it if you could give me a solution.
<html>
<body>
<span class="blind">all excel download</span>
</body>
</html>
You can use below locator:
driver.FindElement(By.Xpath("//a[.='all excel download']")).Click();
// or use css selector
driver.FindElement(By.Css("a[class*='_excelDownloadBtn']")).Click();

CsQuery replace tags

I using CsQuery in order to parse HTML documents. What I'm trying to do is to replace all the "br" HTML tags with "." character.
Assuming that this is my input HTML:
<html>
<body>
Hello
<br>
World
</body>
</html>
The requested output will be:
<html>
<body>
Hello
.
World
</body>
</html>
Pseudo code:
CQ dom = CQ.CreateFromUrl("http://my.url");
dom.ReplaceTag("<br>", ".");
Is this possible?
Thanks for advices.
That's pretty easy, just replace the <br> elements by setting their OuterHTML.
The relevant selector is just "br":
foreach (var br in dom["br"])
br.OuterHTML = ".";
Call dom.Render() to see the result.

HTML agility parsing error

HTML
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<form action="demo_form.asp" id="form1" method="get">
First name: <input type="text" name="fname"><br>
Last name: <input type="text" name="lname"><br>
<input type="submit" value="Submit">
</form>
</body>
</html>
Code
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
//nd.InnerHtml is "".
//nd.InnerText is "".
Problem
nd.ChildNodes //Collection(to get all nodes in form) is always null.
nd.SelectNodes("/input") //returns null.
nd.SelectNodes("./input") //returns null.
"//form[#id='form1']/input" //returns null.
what i want is to access childnodes of form tag with id=form1 one by one in order of occurrence. I tried same xpath in chrome developer console and it works just exactly the way i wanted. Is HTMlAgility pack is having problem in reading html from file or Web.
Your html is invalid and may be preventing the html agility pack from working properly.
Try adding a doctype (and an xml namespace) to the start of your document and change your input element's closing tags from > to />
Try adding the following statement before loading the document:
HtmlNode.ElementsFlags.Remove("form");
HtmlAgilityPack's default behaviour adds all the form's inner-elements as siblings in stead of children. The statement above alters that behaviour so that they (meaning the input tags) will appear as childnodes.
Your code would look like this:
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
etc...
references:
bug issue & fix: http://htmlagilitypack.codeplex.com/workitem/23074
codeplex forum post: http://htmlagilitypack.codeplex.com/discussions/247206

How can I extract just text from the html

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
How about using the XPath expression '//body//text()' to select all text nodes?
You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Html Agility Pack - Get html fragment from an html document

Using the html agility pack; how would I extract an html "fragment" from a full html document? For my purposes, an html "fragment" is defined as all content inside of the <body> tags.
For example:
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> element (eg. assume that I was passed a fragment in the first place if it wasn't a full html document)
Can anyone point me in the right direction?
I think you need to do it in pieces.
you can do selectnodes of document for body or html as follows
doc.DocumentNode.SelectSingleNode("//body") // returns body with entire contents :)
then you can check for null values for criteria and if that is provided, you can take the string as it is.
Hope it helps :)
The following will work:
public string GetFragment(HtmlDocument document)
{
return doc.DocumentNode.SelectSingleNode("//body") == null ? doc.DocumentNode.InnerHtml : doc.DocumentNode.SelectSingleNode("//body").InnerHtml;
}

Categories