Convert Html to plain text with .net core [duplicate] - c#

This question already has answers here:
How can I Convert HTML to Text in C#?
(20 answers)
Closed 2 years ago.
If a HTML will be sent via email, an alternative plain text has to be attached as well. (At least some spam detection software will check for a plain-text alternative) How am I able to convert a HTML to plain text?
HtmlDocument document = new HtmlDocument();
document.Load(htmlBody);
string plainBody = document.DocumentNode.InnerText;
Will return plain text, but all links will be lost.
E.g.:
HTML Version
Hello World
should result in
Hello World (#)
But it results in
Hello World

As far as I know, the innertext will get the text between the start and end tags of the object, it will not get the attribute value.
If you want to get the attribute value ,you should do it by yourself. You could select all the a tag's href attribute value and then replace the innertext.
More details, you could refer to below codes:
I used HtmlAgilityPack package, you could install it by using Nugetpackage: https://www.nuget.org/packages/HtmlAgilityPack/
var doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><div id='foo'>text<a href='#'>Hello World</a> <a href='#'>test</a></div></body></html>");
var innertext = doc.DocumentNode.InnerText;
var nodes = doc.DocumentNode.SelectNodes("//a");
foreach (var item in nodes)
{
var herf = ((HtmlAttribute)item.Attributes.Where(x => x.Name == "href").FirstOrDefault()).Value;
innertext = innertext.Replace(item.InnerText, item.InnerText + string.Format("({0})", herf));
}
Result:

Related

How to identify html tags in html string

I have below html string, where i am trying to identify the <br> tag start and end of the whole text inside an html string using the below code
var htmlstring = "<p><span><br> text <b>text <br></b>text <br></span></p>"
var document = new HtmlDocument();
document.LoadHtml(htmlString);
var nodes= rootNode.SelectNodes("//br")
but it is giving all <br> tags nodes where i want only at the start and at the end of whole html text string in below html string
<p><span><br> text <b> text <br></b>text <br></span></p>
I am looking for nodes should be 2 instead of 3 but getting as 3 as it counts the <br> tag presented in between text as well.
Could any one please help on this how can i achieve this, many thanks in advance
You can use the Split method to solve your problem. I have a suggestion for you as follows. It prints text between <br> tags which are start and end tags. In addition, you can modify the output according to your requirements. Maybe it can be solved by using the regex pattern.
const string tag = "<br>";
var splitedHtmlString = htmlString.Split(tag);
StringBuilder builder = new StringBuilder();
for (int i = 1; i < splitedHtmlString.Length - 1; i++)
{
builder.Append(splitedHtmlString[i]);
builder.Append(tag);
}
builder.Remove(builder.ToString().Length - tag.Length, tag.Length);
Console.WriteLine(builder.ToString());
Output: text <b>text <br></b>text
You can convert your string to an HtmlDocument and filter by nodes, using HtmlAgilityPack library
HtmlDocument document = new HtmlDocument();
document.LoadHtml("your html code");
var htmlTag = document.DocumentNode.SelectNodes("//br");

How to find HTML attributes that starts with specific word using Regex in C#? [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 7 years ago.
I have html code which jQuery adds random attributes to it like:
<td style='font-size: x-large;' jquery9202340423042='22423423424'>
Using c# Regex I want to find and remove any attribute which starts with jquery
I have the code below but it removes all attributes:
public static void Main(string[] args)
{
string before ="<td style='font-size: x-large;' jquery9202340423042='22423423424'>";
//string after = Regex.Replace(before, regexImgSrc, "<$1>");
//string regexImgSrc = #"<(table|tr|td)[^>]*?" + "jquery9202340423042" + #"\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
string after = Regex.Replace(before, #"(?i)<(table|tr|td)(?:\s+(?:""[^""]*""|'[^']*'|[^""'>])*)?>", "<$1>");
Console.WriteLine(after);
}
You need to use this:
Regex.Replace(before, #"(jquery\d*=[\"']\d*[\"'])", "");
Which will replace anything that follow the pattern jqueryXXX='XXX' where XXX is any number
Why are you trying to do this with Regex?
Regex is absolutely the wrong tool for the job (even though at a cursory glance, this might not be obvious to you).
Using Regex might work for specific cases, but will always be a brittle solution.
Use an HTML parser like HtmlAgilityPack and you can approach this far more sensibly. Now you can do something like this:
string before ="<td style='font-size: x-large;' jquery9202340423042='22423423424'>";
var doc = new HtmlDocument();
doc.LoadHtml(before);
var el = doc.DocumentNode.FirstChild;
var attrsToRemove = el.Attributes.Where(att => att.Name.StartsWith("jquery")).ToList();
attrsToRemove.ForEach(a => a.Remove());
Console.WriteLine(el.OuterHtml);

OpenXML Find Variables within Word doc and replace them

I need to search a document for strings enclosed in <>. So if the application finds the variable within the document, it replaces that variable with DateTime.Today.ToShortDateString(). For instance:
string filename = "C:\\Temp\\" + appNum + "_ReceiptOfApplicationLtr.docx";
if (File.Exists((string)filename))
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filename, true))
{
var body = wordDoc.MainDocumentPart.Document.Body;
foreach (var text in body.Descendants<Text>())
{
if (text.Text == "<TodaysDate>")
{
text.Text = text.Text.Replace("<TodaysDate>", DateTime.Today.ToShortDateString());
}
}
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(filename);
}
}
}
Well when it searches the Descendants Text, it finds the first <, then TodaysDate, finally >. The issue being it won't find the string <TodaysDate>. Can anyone help me out?
Open XML can store text in different text tags inside the same run. What I would do if I were you is just find the Run where your string is stored and use the InnerText property to find all the text inside that run.
For example:
Run runToFind = body.Descendants<Run>()
.FirstOrDefault(r => r.Innertext.Contains("<TodaysDate>");
Then you can replace the Run with another one:
runToFind.Parent.Replace(new Run(new Text(DateTime.Now.ToShortDateString())),runToFind);
For anyone still struggling with this - you can check out this library
https://github.com/antonmihaylov/OpenXmlTemplates
With it instead of searching for special tags in the text (because of the problems specified in the comment of Thomas Barnekow), you add a Content control in the document and in the tag name of the content control you specify the name of the variable you want to replace.
You can then feed JSON data or a regular C# dictionary object and the text will get replaced.
Note - I am the maker of that library, but i have no financial gain from it - it is open sourced and under active development (and always looking for contributors!)

How to get text from html nodes and solve character encoding issue?

I'm trying to get innertext in this site http://www.hurriyet.com.tr/yazarlar/22933964.asp
with htmlagilitypack.
html structure is
<div class="detailText">
<span class="yzrArticleDate">30 Mart 2014</span>
<h1 class="yazarArticleTitle">31 Mart sabahı için acil ihtiyaç listesi</h1>
<p></p><p><p >Akıl.<br />Sağduyu.<br />Barış.<br />
Özgürlük.<br />Kardeşlik.<br />Vicdan.<br />Huzur.............
and my current code
string htmlContent = getsource(s);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(htmlContent);
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerText;
problem is it gets with the heading and date. I mean with "30 Mart 2014" and "31 Mart sabahı için acil ihtiyaç listesi".
I want the part which begins with
<*p><*/p><*p><p* >Akıl.<*br "
I tried different variation
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerHtml;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").NextSibling.NextSibling.InnerText;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").LastSibling.InnerText;
my second question ; if I manage to text this text I ll be faced a character encoding problem, how can I fix this
The easiest solution would be to remove nodes you don't want and than get InnerHtml/InnerText as covered in remove html node from htmldocument :HTMLAgilityPack.
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']")
noa.RemoveChild(noa.SelectSingleNode("span"));
// remove the rest too...
var result = noa.InnerText;
There should be no encoding problem unless site reports invalid encoding as C# strings are Unicode (UTF16).

Prevent HTMLAgilityPack from connecting words when using InnerText

I'm trying to do a simple task of getting text from HTML document.
So I'm using HTMLdoc.DocumentNode.InnerText for that.
The problem is that on some sites the don't put spaces between words when they are in a different tags. In those cases the DocumentNode.InnerText connect those word into one and it became useless.
for example, I'm trying to read a site contain that line
<span>İstanbul</span><ul><li>Adana</li>
I'm getting "İstanbulAdana" which is meaningless.
I couldn't find any solution at HTMLAgilityPack documentation nor Google
Do I missing something?
Thanks,
That should be rather easy to do.
const string html = #"<span>İstanbul</span><ul><li>Adana</li>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
string result = string.Join(" ", doc.DocumentNode.Descendants()
.Where(n => !n.HasChildNodes && !string.IsNullOrWhiteSpace(n.InnerText))
.Select(n => n.InnerText));
Console.WriteLine(result); // prints "İstanbul Adana"
Well, the code snippet hangs for this example:
const string html = #"<td><font size=""2"">abc </font><font size=""2"">(</font><font size=""2"">abc</font><font size=""2"">) </font>abc, abc<br><font size=""2"">abc </font>abc, abc, abc, abc<br><font size=""2"">abc </font>abc abc, abc abc<br></td>";
It doesn't hang without the join-clause (but it doesn't put spaces correctly neither).

Categories