How can I extract just text from the html

How can I extract just text from the html - c#

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)

You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

How about using the XPath expression '//body//text()' to select all text nodes?

You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Related

Clear raw HTML from malicious data in C#

I'm writing ASP.NET MVC app. Some pieces of HTML comes from user and some of them from third-party sources. Is there easy and fast enough way to clean HTML without heavy artillery like HAP(Html Agility Pack) or Tidy?
I'm just need to remove scripts, styles, <object>/<embed>, href="javascript:", style=, onclick and I'm not think that removing them manually via .Remove/.Replace is a good way even with StringBuilder.
In example, if I have next input
<html>
<style src="http://harmyourpage.com"/>
<script src="http://killyourdog.com"/>
<div>
Good link
Bad link
<p>Some text <b>to</b> test</p><br/>
<h1 style="position:absolute;">Damage your layout</h1>
And an image there <img src="http://co.com/a.jpg"/><br>
<span onclick="harm()">Good span with bad attribute</span>
<object>Your lovely java can be there</object>
</div>
</html>
which must be converted into next:
<div>
Good link
<a>Bad link</a>
<p>Some text <b>to</b> test</p><br/>
<h1>Damage your layout</h1>
And an image there <img src="http://co.com/a.jpg"/><br>
<span>Good span with bad attribute</span>
</div>
So, how to do this — with whitelist of tags and anttributes — in right way?
UPD: I'm tried StackExchange HtmlHelpers library, but it removes needed tags such as div, a and img.

Fastest way to achieve the same is to use Regular Expression
var regex = new Regex(
"(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)|(\\<object(.+?)\\</object\\>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase
);
string ouput = regex.Replace(input, "");
You can also use Microsoft Web Protection Library (http://wpl.codeplex.com/) for same like
Sanitizer.GetSafeHtmlFragment(input);

CsQuery replace tags

I using CsQuery in order to parse HTML documents. What I'm trying to do is to replace all the "br" HTML tags with "." character.
Assuming that this is my input HTML:
<html>
<body>
Hello
<br>
World
</body>
</html>
The requested output will be:
<html>
<body>
Hello
.
World
</body>
</html>
Pseudo code:
CQ dom = CQ.CreateFromUrl("http://my.url");
dom.ReplaceTag("<br>", ".");
Is this possible?
Thanks for advices.

That's pretty easy, just replace the <br> elements by setting their OuterHTML.
The relevant selector is just "br":
foreach (var br in dom["br"])
br.OuterHTML = ".";
Call dom.Render() to see the result.

AntiXSS doesn't sanitize unclosed html tag

Why are unclosed html tags not sanitized with Microsoft AntiXSS?
string untrustedHtml = "<img src=x onmouseover=confirm(foo) y=";
string trustedHtml = AntiXSS.Sanitizer.GetSafeHtmlFragment(untrustedHtml); // returns "<img src=x onmouseover=confirm(foo) y="
Closing tags are sanitized:
string untrustedHtml = "<img src=x onmouseover=confirm(foo) y=a>";
string trustedHtml = AntiXSS.Sanitizer.GetSafeHtmlFragment(untrustedHtml); // returns ""

It is recommended to use HTML encoding whenever possible instead of HTML sanitation.
Sanitation should only be used if you actually need to use some HTML but want to remove any unsafe code. 99% of the times you don't need any HTML to be inserted by your users, and eliminating that should be done with encoding.
Having said that, if you still want to perform sanitation, AntiXSS is not the best solution - both because of the example above, and the fact that it also removes totally safe HTML and falsely recognizing it as unsafe, causing AntiXSS sanitizer to be ineffective.
Ajax control toolkit have a better internal sanitizer you can use, but notice that it is less secured because their partly work with black-lists (searching for the dangerous code instead of permitting only safe code).
If you still want to use AntiXSS sanitation, you can just check whether the HTML that was inserted is a valid one before you sent to the sanitizer. You can do it for example with XML document class of some kind as any valid HTML is also a valid XML.
Hope this helps.

What version of the AntiXss library are you using?
I used version 4.3.0.0 and when I ran this through Encoder.GetSafeHtmlFragment()
and the output gave the following value "<img src=x onmouseover=test(1) y="
as you can see, they automatically encoded the non-HTML values.
Here is the code I used:
protected void Page_Load(object sender, EventArgs e)
{
var testValue = "<img src=x onmouseover=test(1) y=";
litFirst.Text = testValue;
litSecond.Text = Sanitizer.GetSafeHtml(testValue);
litThird.Text = Sanitizer.GetSafeHtmlFragment(testValue);
}
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title></title>
<script>
function test(x) {
alert(x);
}
</script>
</head>
<body>
<form id="form1" runat="server">
<div>
First: <asp:Literal ID="litFirst" runat="server"/>
<br/>
Second: <asp:Literal ID="litSecond" runat="server"/>
<br/>
Third: <asp:Literal ID="litThird" runat="server"/>
</div>
</form>
</body>
</html>
But I also agree with Gil Cohen, in that you really should not allow users to enter HTML.
Along with Gil Cohen, I would recommend that instead of allowing them to enter in HTML directly, do it through an intermediate language like Markup, Textile, Wiki markup to name a few. This gives the advantage of allowing users to have more control over there output but still does not let them write HTML directly.
There are JavaScript WYSIWYG editors that will output the markup/preview for the user, then allow you to save the markup language for later use (to be converted to HTML during the output procedure, not before you save it to your data-store).

HTML to RichTextBox as Plaintext with Hyperlinks

Reading so much about not using RegExes for stripping HTML, I am wondering about how to get some Links into my RichTextBox without getting all the messy html that is also in the content that i download from some newspaper site.
What i have: HTML from a newspaper website.
What i want: The article as plain text in a RichTextBox. But with links (that is, replacing the bar with <Hyperlink NavigateUri="foo">bar</Hyperlink>).
HtmlAgilityPack gives me HtmlNode.InnerText (stripped of all HTML tags) and HtmlNode.InnerHtml (with all tags). I can get the Url and text of the link(s) with articlenode.SelectNodes(".//a"), but how should i know where to insert that in the plain text of HtmlNode.InnerText?
Any hint would be appreciated.

Here is how you can do it (with a sample console app but the idea is the same for Silverlight):
Let's suppose you have this HTML:
<html>
<head></head>
<body>
Link 1: bar
Link 2: bar2
</body>
</html>
Then this code:
HtmlDocument doc = new HtmlDocument();
doc.Load(myFileHtm);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
// replace the HREF element in the DOM at the exact same place
// by a deep cloned one, with a different name
HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node);
// modify some attributes
newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null));
newNode.Attributes.Remove("href");
}
doc.Save(Console.Out);
will output this:
<html>
<head></head>
<body>
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink>
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink>
</body>
</html>

Html Agility Pack - Get html fragment from an html document

Using the html agility pack; how would I extract an html "fragment" from a full html document? For my purposes, an html "fragment" is defined as all content inside of the <body> tags.
For example:
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> element (eg. assume that I was passed a fragment in the first place if it wasn't a full html document)
Can anyone point me in the right direction?

I think you need to do it in pieces.
you can do selectnodes of document for body or html as follows
doc.DocumentNode.SelectSingleNode("//body") // returns body with entire contents :)
then you can check for null values for criteria and if that is provided, you can take the string as it is.
Hope it helps :)

The following will work:
public string GetFragment(HtmlDocument document)
{
return doc.DocumentNode.SelectSingleNode("//body") == null ? doc.DocumentNode.InnerHtml : doc.DocumentNode.SelectSingleNode("//body").InnerHtml;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I extract just text from the html - c#

How about using the XPath expression '//body//text()' to select all text nodes?

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Related

Clear raw HTML from malicious data in C#

CsQuery replace tags

AntiXSS doesn't sanitize unclosed html tag

HTML to RichTextBox as Plaintext with Hyperlinks

Html Agility Pack - Get html fragment from an html document

Categories

Resources