Regex with conditional html tag - c#

I need to write a Regex that captures whats inside a specific HTML tag:
<span class="sentences">CAPTURE HERE</span>
So I wrote, in C#:
<span class=\"sentence\">((.|\\\\s)*?)</span>
The problem I'm having and I'm not sure how to solve it, is that there is another span class inside that span that also ends with </span> and therefore is ending the capture on the wrong closing tag. How do I write a condition in a Regex that checks if there is another span class that is not "sentences" and if it does, that the capture should end on the next </span>?
The input string on the Regex.
<span class="sentence">O que a história da escravidão tem a dizer sobre <span class="CharOverride-15">experiências religiosas</span>?</span><span class="sentence"> Só silêncios,</span>
What I want to ideally capture:
O que a história da escravidão tem a dizer sobre <span class="CharOverride-15">experiências religiosas</span>? Só silêncios,

Don't use Regex to parse html. Use a real html parser like HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
var span = doc.DocumentNode.SelectSingleNode("//span[#class='sentence']");
var text = span.InnerText;
var html = span.InnerHtml;

As an exercise (html parsing library is preferred), here is a regex that can parse with or without a nested tag:
<([^>]+)(?:\s+[^>]*)?>[^<>]*?(?:<([^>]+)(?:\s+[^>]*)?>)?(?<capture>[^<>]+)(?:<\/\2>)?[^<>]*?<\/\1>
Demo

Related

Remove all style tages from html using c# using regex [duplicate]

Here is my html code:
<p><span style="background:lime;Color:Red;">Contrary to popular belief, <b><u>Lorem Ipsum is not simply</u></b> random text. It has roots in a piece of classical Latin literature from <span style="background:blue;">45 BC, making it over 2000 years</span> old. Richard McClintock, </span><b>
From above code, i need to remove the background attributes & value using C# from all the spans. The other values in style tag should remain. Eg:
<span style="background:lime;Color:Red;">Contrary to popular belief,.....</span>
should look
<span style="Color:Red;">Contrary to popular belief,.....</span>
Pls help...!
Using HtmlAgilityPack
string html = #"<span style=""background:lime;Color:Red;"">Contrary to popular belief,.....</span>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
foreach (var span in doc.DocumentNode.Descendants("span"))
{
var style = span.Attributes["style"].Value;
span.Attributes["style"].Value = String.Join(";", style.Split(';').Where(s => !s.ToLower().Trim().StartsWith("background:")));
}
var newHtml = doc.DocumentNode.InnerHtml;
Try this
$('span').css("background", "")

How can I get all HTML tags that contains certain text using regular expressions? [duplicate]

This question already has answers here:
Parsing HTML page with HtmlAgilityPack
(2 answers)
Closed 6 years ago.
I'm new to regex and I'm not able to do what I need.
Let's suppose we have this text:
<h1>Título</h1>
<h2>Los gatos felices</h2>
Existen una serie de gatos...
<h2 style="color:red" class="grande">los gatos: curiosidades</h2>
<p style='text-align: justify;' align='justify'>De por si
<strong>los gatos</strong> saben saltar y además
<strong>los perros odian a los gatos</strong>
</p>
And I need to get all tags that contains the "los gatos" text.
It should match 4 coincidences:
- <h2>Los gatos felices</h2>
- <h2 style="color:red" class="grande">los gatos: curiosidades</h2>
- <strong>los gatos</strong>
- <strong>los perros odian a los gatos</strong>
How can I solve it with a regular expression?
Edit:
I finally found what I need! I share it for anyone who might need it:
<(.*)([^<]*)>([^<]*)los gatos([^<]*)<\/\1>
Instead of Regex use a real Html parser like HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourhtmlstring);
var h2s = doc.DocumentNode.SelectNodes("//h2").Select(x => x.InnerText).ToList();

Get value of specific HTML tag(span tag) in C#

I am developing a GOOGLE TRANSLATE software for Windows Phonw 8. I want to get the "value of ALL SPAN TAGS" inside a span tag of specific class="result_box"
in C#.
<html>
.
.
<span id="result_box" class="short_text" lang="pt">
<span class="hps">
Olá
</span>
<span class="">
.
</span>
<span class="hps">
oi
</span>
</span>
.
.
</html>
I tried this but it is not working
html = e.Result;
var r = new Regex(#"(?i)<span[^>]*?>\s*", RegexOptions.IgnoreCase);
string capture = r.Match(html).Groups[1].Value;
MessageBox.Show(capture);
Suggest me REGEX. If possible please give me full function that returns me the text.
what about this?
Regex r = new Regex(#"<span[^>].*?>([^<]*)<\/span>", RegexOptions.IgnoreCase);
foreach (Match matchedSpan in r.Matches(html))
{
string capture = matchedSpan.Groups[1].Value;
MessageBox.Show(capture);
}
Ok since #mason didn't like the previous answer, here's goes another aproach:
XmlDocument htmlXML=new XmlDocument();
htmlXML.LoadXml(html);
foreach (XmlNode spanElement in htmlXML.SelectNodes("//span[#class='short_text']/span") ) {
MessageBox.Show(spanElement.InnerText);
}
remember to add
using System.Xml;

How to get text from html nodes and solve character encoding issue?

I'm trying to get innertext in this site http://www.hurriyet.com.tr/yazarlar/22933964.asp
with htmlagilitypack.
html structure is
<div class="detailText">
<span class="yzrArticleDate">30 Mart 2014</span>
<h1 class="yazarArticleTitle">31 Mart sabahı için acil ihtiyaç listesi</h1>
<p></p><p><p >Akıl.<br />Sağduyu.<br />Barış.<br />
Özgürlük.<br />Kardeşlik.<br />Vicdan.<br />Huzur.............
and my current code
string htmlContent = getsource(s);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(htmlContent);
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerText;
problem is it gets with the heading and date. I mean with "30 Mart 2014" and "31 Mart sabahı için acil ihtiyaç listesi".
I want the part which begins with
<*p><*/p><*p><p* >Akıl.<*br "
I tried different variation
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerHtml;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").NextSibling.NextSibling.InnerText;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").LastSibling.InnerText;
my second question ; if I manage to text this text I ll be faced a character encoding problem, how can I fix this
The easiest solution would be to remove nodes you don't want and than get InnerHtml/InnerText as covered in remove html node from htmldocument :HTMLAgilityPack.
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']")
noa.RemoveChild(noa.SelectSingleNode("span"));
// remove the rest too...
var result = noa.InnerText;
There should be no encoding problem unless site reports invalid encoding as C# strings are Unicode (UTF16).

Grab some text from a markup string using jQuery

I have the following markup:
<span>
Some text blalablalalbal<br/><br/>
"You are Awesome" <br/><br/>
----What can I do for you?<br/><br/>
</span>
Now I want to hide first line and modify the last line.
How can I grab those text using jQuery?
Note:
1: There are multiple instances of similar code-block but with different text. So I won't be able to hardcode. I am wondering if I can split it using tags somehow?
2: If it can be done in server-side code in C#, that is fine as well.
You can play around with textnodes doing something like this.
Update
var $nodes = $('span').contents().map(function(a,b){
return (b.nodeType===3?b:null);
});
// hide first line
$($nodes.get(0)).wrap('<span style="display:none;" />');
$nodes.get(2).data = "foo";
Working example of what I mean.
CSS has :first-line and :last-line modifiers.
Try:
span:first-line {
display: none;
}
Use jQuery with selector to modify last line:
var content = $('span:last-line').html();
content += " - Modified";
$('span:last-line').html(content);
You can split the line breaks and modify the array:
var elm = $('span'),
lines = elm.html().split('<br>');
lines.shift(); // removes the first
lines[lines.length-3] += '-modified'; // modify the last
elm.html(lines.join('<br>'));
Or you can put the three lines in three different spans with ids.
Something like:
<span>
<span id="sentence1">Some text blalablalalbal</span><br/><br/>
<span id="sentence2">"You are Awesome" </span><br/><br/>
<span id="sentence3">----What can I do for you?</span><br/><br/>
</span>
And hiding the first sentence with jquery:
$("span#sentence1").hide();

Categories