How to get text from html nodes and solve character encoding issue?

How to get text from html nodes and solve character encoding issue? - c#

I'm trying to get innertext in this site http://www.hurriyet.com.tr/yazarlar/22933964.asp
with htmlagilitypack.
html structure is
<div class="detailText">
<span class="yzrArticleDate">30 Mart 2014</span>
<h1 class="yazarArticleTitle">31 Mart sabahı için acil ihtiyaç listesi</h1>
<p></p><p><p >Akıl.<br />Sağduyu.<br />Barış.<br />
Özgürlük.<br />Kardeşlik.<br />Vicdan.<br />Huzur.............
and my current code
string htmlContent = getsource(s);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(htmlContent);
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerText;
problem is it gets with the heading and date. I mean with "30 Mart 2014" and "31 Mart sabahı için acil ihtiyaç listesi".
I want the part which begins with
<*p><*/p><*p><p* >Akıl.<*br "
I tried different variation
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerHtml;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").NextSibling.NextSibling.InnerText;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").LastSibling.InnerText;
my second question ; if I manage to text this text I ll be faced a character encoding problem, how can I fix this

The easiest solution would be to remove nodes you don't want and than get InnerHtml/InnerText as covered in remove html node from htmldocument :HTMLAgilityPack.
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']")
noa.RemoveChild(noa.SelectSingleNode("span"));
// remove the rest too...
var result = noa.InnerText;
There should be no encoding problem unless site reports invalid encoding as C# strings are Unicode (UTF16).

Related

How to identify html tags in html string

I have below html string, where i am trying to identify the <br> tag start and end of the whole text inside an html string using the below code
var htmlstring = "<p><span><br> text <b>text <br></b>text <br></span></p>"
var document = new HtmlDocument();
document.LoadHtml(htmlString);
var nodes= rootNode.SelectNodes("//br")
but it is giving all <br> tags nodes where i want only at the start and at the end of whole html text string in below html string
<p><span><br> text <b> text <br></b>text <br></span></p>
I am looking for nodes should be 2 instead of 3 but getting as 3 as it counts the <br> tag presented in between text as well.
Could any one please help on this how can i achieve this, many thanks in advance

You can use the Split method to solve your problem. I have a suggestion for you as follows. It prints text between <br> tags which are start and end tags. In addition, you can modify the output according to your requirements. Maybe it can be solved by using the regex pattern.
const string tag = "<br>";
var splitedHtmlString = htmlString.Split(tag);
StringBuilder builder = new StringBuilder();
for (int i = 1; i < splitedHtmlString.Length - 1; i++)
{
builder.Append(splitedHtmlString[i]);
builder.Append(tag);
}
builder.Remove(builder.ToString().Length - tag.Length, tag.Length);
Console.WriteLine(builder.ToString());
Output: text <b>text <br></b>text

You can convert your string to an HtmlDocument and filter by nodes, using HtmlAgilityPack library
HtmlDocument document = new HtmlDocument();
document.LoadHtml("your html code");
var htmlTag = document.DocumentNode.SelectNodes("//br");

Convert Html to plain text with .net core [duplicate]

This question already has answers here:
How can I Convert HTML to Text in C#?
(20 answers)
Closed 2 years ago.
If a HTML will be sent via email, an alternative plain text has to be attached as well. (At least some spam detection software will check for a plain-text alternative) How am I able to convert a HTML to plain text?
HtmlDocument document = new HtmlDocument();
document.Load(htmlBody);
string plainBody = document.DocumentNode.InnerText;
Will return plain text, but all links will be lost.
E.g.:
HTML Version
Hello World
should result in
Hello World (#)
But it results in
Hello World

As far as I know, the innertext will get the text between the start and end tags of the object, it will not get the attribute value.
If you want to get the attribute value ,you should do it by yourself. You could select all the a tag's href attribute value and then replace the innertext.
More details, you could refer to below codes:
I used HtmlAgilityPack package, you could install it by using Nugetpackage: https://www.nuget.org/packages/HtmlAgilityPack/
var doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><div id='foo'>text<a href='#'>Hello World</a> <a href='#'>test</a></div></body></html>");
var innertext = doc.DocumentNode.InnerText;
var nodes = doc.DocumentNode.SelectNodes("//a");
foreach (var item in nodes)
{
var herf = ((HtmlAttribute)item.Attributes.Where(x => x.Name == "href").FirstOrDefault()).Value;
innertext = innertext.Replace(item.InnerText, item.InnerText + string.Format("({0})", herf));
}
Result:

Using C# and Regex to find and surround all words and numbers within some html text with a span

I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?

A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}

Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).

Prevent HTMLAgilityPack from connecting words when using InnerText

I'm trying to do a simple task of getting text from HTML document.
So I'm using HTMLdoc.DocumentNode.InnerText for that.
The problem is that on some sites the don't put spaces between words when they are in a different tags. In those cases the DocumentNode.InnerText connect those word into one and it became useless.
for example, I'm trying to read a site contain that line
<span>İstanbul</span><ul><li>Adana</li>
I'm getting "İstanbulAdana" which is meaningless.
I couldn't find any solution at HTMLAgilityPack documentation nor Google
Do I missing something?
Thanks,

That should be rather easy to do.
const string html = #"<span>İstanbul</span><ul><li>Adana</li>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
string result = string.Join(" ", doc.DocumentNode.Descendants()
.Where(n => !n.HasChildNodes && !string.IsNullOrWhiteSpace(n.InnerText))
.Select(n => n.InnerText));
Console.WriteLine(result); // prints "İstanbul Adana"

Well, the code snippet hangs for this example:
const string html = #"<td><font size=""2"">abc </font><font size=""2"">(</font><font size=""2"">abc</font><font size=""2"">) </font>abc, abc<br><font size=""2"">abc </font>abc, abc, abc, abc<br><font size=""2"">abc </font>abc abc, abc abc<br></td>";
It doesn't hang without the join-clause (but it doesn't put spaces correctly neither).

remove all HTML formatting from a string

I am trying to compare 2 strings but i just realized that one has some html formatting already.
How can i get these two strings to match when doing string1 == string2. (NOTE: i dont know what the HTML formatting is going to be upfront)
string1 = "This is a test";
string1 = "<font color=\"black\" size=\"1\">This is a test</font>";

Load the html into Html Agility Pack, and extract only the text.
string html = "<html><body><div>test</div></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html)
string text = document.DocumentNode.InnerText;
This will not remove the content of <script> nodes, but you can easily remove the script nodes first.

string newText = System.Text.RegularExpressions.Regex.Replace(OldHtmlTextHere, "<[^>]*>", string.Empty);

Check out system.web.Httputility.HTMLdecode

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get text from html nodes and solve character encoding issue? - c#

Related

How to identify html tags in html string

Convert Html to plain text with .net core [duplicate]

Using C# and Regex to find and surround all words and numbers within some html text with a span

Prevent HTMLAgilityPack from connecting words when using InnerText

remove all HTML formatting from a string

Categories

Resources