CsQuery replace tags - c#

I using CsQuery in order to parse HTML documents. What I'm trying to do is to replace all the "br" HTML tags with "." character.
Assuming that this is my input HTML:
<html>
<body>
Hello
<br>
World
</body>
</html>
The requested output will be:
<html>
<body>
Hello
.
World
</body>
</html>
Pseudo code:
CQ dom = CQ.CreateFromUrl("http://my.url");
dom.ReplaceTag("<br>", ".");
Is this possible?
Thanks for advices.

That's pretty easy, just replace the <br> elements by setting their OuterHTML.
The relevant selector is just "br":
foreach (var br in dom["br"])
br.OuterHTML = ".";
Call dom.Render() to see the result.

Related

MailMessage from MailBee conversion from HTML to TEXT adds extra space

I am using MailBee to convert HTML to Text but it adds an extra space in the beginning of each line except from the first one.
For example I have this HTML
<!DOCTYPE html>
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">
</head>
<body>
<div style=\"font-size:13px;font-family:Arial;\"><br></div>
<div style=\"font-size:13px;font-family:Arial;\">test</div>
<div style=\"font-size:13px;font-family:Arial;\">test2</div>
<div style=\"font-size:13px;font-family:Arial;\">test3</div>
<div style=\"font-size:13px;font-family:Arial;\">test</div>
</body>
<html>
(The html is in one line. I have changed it to multi line just for readability.)
When I use this code to get text
MailMessage message = new MailMessage
{
BodyHtmlText = Html
};
message.MakePlainBodyFromHtmlBody();
return message.BodyPlainText;
I get this result
\r\ntest \r\n test2 \r\n test3 \r\n test \r\n
As you can see, before test2, test3 and test, there is an extra space added.
Is this a bug or am I doing something wrong?
Can someone help me?
Thanks
I suggest you use a simple regex to remove Spaces at start or end of line.
The regex:
^[ ]*|[ ]*$
It simply matches zero or more Spaces at either start or end of line.
You need to set the 'Multiline' option.
Then replace the Spaces with an empty string.
How to use:
message.BodyPlainText = Regex.Replace(message.BodyPlainText, "^[ ]*|[ ]*$", "", RegexOptions.Multiline);
Now your message will have Spaces removed.

Remove HTML nodes from HTTP Request

I have some HTML code stored into a string variable, resulting from a HttpWebRequest:
<html>
<head>
<div>Lots of scripts and libraries</div>
</head>
<body>
<div>Some very useful data</div>
</body>
<footer>
<div>Not interesting struff</div>
</footer>
<html>
How can I do to remove all unecesary nodes and get into this:
<body>
<div>Some very useful data</div>
</body>
The easiest way is to use HtmlAgilityPack to grab just the body tag.
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
HtmlNode body = document.DocumentNode.SelectSingleNode("//body");
From there, you can use HtmlAgilityPack to further parse the body node for more detail.

HTML to RichTextBox as Plaintext with Hyperlinks

Reading so much about not using RegExes for stripping HTML, I am wondering about how to get some Links into my RichTextBox without getting all the messy html that is also in the content that i download from some newspaper site.
What i have: HTML from a newspaper website.
What i want: The article as plain text in a RichTextBox. But with links (that is, replacing the bar with <Hyperlink NavigateUri="foo">bar</Hyperlink>).
HtmlAgilityPack gives me HtmlNode.InnerText (stripped of all HTML tags) and HtmlNode.InnerHtml (with all tags). I can get the Url and text of the link(s) with articlenode.SelectNodes(".//a"), but how should i know where to insert that in the plain text of HtmlNode.InnerText?
Any hint would be appreciated.
Here is how you can do it (with a sample console app but the idea is the same for Silverlight):
Let's suppose you have this HTML:
<html>
<head></head>
<body>
Link 1: bar
Link 2: bar2
</body>
</html>
Then this code:
HtmlDocument doc = new HtmlDocument();
doc.Load(myFileHtm);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
// replace the HREF element in the DOM at the exact same place
// by a deep cloned one, with a different name
HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node);
// modify some attributes
newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null));
newNode.Attributes.Remove("href");
}
doc.Save(Console.Out);
will output this:
<html>
<head></head>
<body>
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink>
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink>
</body>
</html>

How can I extract just text from the html

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
How about using the XPath expression '//body//text()' to select all text nodes?
You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Html Agility Pack - Get html fragment from an html document

Using the html agility pack; how would I extract an html "fragment" from a full html document? For my purposes, an html "fragment" is defined as all content inside of the <body> tags.
For example:
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> element (eg. assume that I was passed a fragment in the first place if it wasn't a full html document)
Can anyone point me in the right direction?
I think you need to do it in pieces.
you can do selectnodes of document for body or html as follows
doc.DocumentNode.SelectSingleNode("//body") // returns body with entire contents :)
then you can check for null values for criteria and if that is provided, you can take the string as it is.
Hope it helps :)
The following will work:
public string GetFragment(HtmlDocument document)
{
return doc.DocumentNode.SelectSingleNode("//body") == null ? doc.DocumentNode.InnerHtml : doc.DocumentNode.SelectSingleNode("//body").InnerHtml;
}

Categories