Match nested HTML tags - c#

In a C# app, I want to match every HTML "font" tag with "color" attribute.
I have the following text:
1<font color="red">2<font color="blue">3</font>4</font>56
And I want a MatchCollection containing the following items:
[0] <font color="red">234</font>
[1] <font color="blue">3</font>
But when I use this code:
Regex.Matches(result, "<font color=\"(.*)\">(.*)</font>");
The MatchCollection I get is the following one:
[0] <font color="red">2<font color="blue">3</font>4</font>
How can I get the MatchCollection I want using C#?
Thanks.

Regex on "HTML" is an antipattern. Just don't do it.
To steer you on the right path, look at what you can do with HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"1<font color=""red"">2<font color=""blue"">3</font>4</font>56");
var fontElements = doc.DocumentNode.Descendants("font");
var newNodes = fontElements.Select(fe => {
var newNode = fe.Clone();
newNode.InnerHtml = fe.InnerText;
return newNode;
});
var collection = newNodes.Select(n => n.OuterHtml);
Now, in collection we have the following strings:
<font color="red">234</font>
<font color="blue">3</font>
mmm... lovely.

Matches m = Regex.Matches(result, "<font color=\"(.*?)\">(.*?)</font>");
//add a ? after the * and print the result .you will know how to get it.

A way with Html Agility Pack and a XPath query to ensure that the color attribute is present:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
String html = "1<font color=\"red\">2<font color=\"blue\">3</font>4</font>56";
htmlDoc.LoadHtml(html);
HtmlNodeCollection fontTags = htmlDoc.DocumentNode.SelectNodes(".//font[#color]");
foreach (HtmlNode fontTag in fontTags)
{
Console.WriteLine(fontTag.InnerText);
}

Related

HtmlAgilityPack issue

Suppose I have the following HTML code:
<div class="MyDiv">
<h2>Josh</h2>
</div>
<div class="MyDiv">
<h2>Anna</h2>
</div>
<div class="MyDiv">
<h2>Peter</h2>
</div>
And I want to get the names, so this is what I did (C#):
string url = "https://...";
var web = new HtmlWeb();
HtmlNode[] nodes = null;
HtmlDocument doc = null;
doc = web.Load(url);
nodes = doc.DocumentNode.SelectNodes("//div[#class='MyDiv").ToArray() ?? null;
foreach (HtmlNode n in nodes){
var name = n.SelectSingleNode("//h2");
Console.WriteLine(name.InnerHtml);
}
Output:
Josh
Josh
Josh
and it is so strange because n contains only the desired <div>. How can I resolve this issue?
Fixed by writing .//h2 instead of //h2
It's because of your XPath statement "//h2". You should change this simply to "h2". When you start with the two "//" the path starts at the top. And then it selects "Josh" every time, because that is the first h2 node.
You could also do like this:
List<string> names =
doc.DocumentNode.SelectNodes("//div[#class='MyDiv']/h2")
.Select(dn => dn.InnerText)
.ToList();
foreach (string name in names)
{
Console.WriteLine(name);
}

Find text between know pattern

I have a webpage source code which has several occurrences of
<div class="detName">some unpredictable text</div>
I want to be able to get a colleection of all some unpredictable text.
I tried something like:
var match = Regex.Match(pageSourceCode, #"<div class='detName'>/(A-Za-z0-9\-]+)\</div>", RegexOptions.IgnoreCase);
But had no success, what would be a good solution for this issue?
Don't use regex to parse HTML, you can use HTML Agility Pack:
string html = "<div class=\"detName\">some unpredictable text</div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[contains(#class,'detName')]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
var match = Regex.Match(pageSourceCode, #"(?<=<div class='detName'>)(.*)(?=</div>)", RegexOptions.IgnoreCase);

remove only some html tags on c#

I have a string:
string hmtl = "<DIV><B> xpto </B></DIV>
and need to remove the tags of <div> and </DIV>. With a result of : <B> xpto </B>
Just <DIV> and </DIV> without the removal of a lot of html tags, but save the <B> xpto </B>.
Use htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<html>yourHtml</html>");
foreach(var item in doc.DocumentNode.SelectNodes("//div"))// "//div" is a xpath which means select div nodes that are anywhere in the html
{
item.InnerHtml;//your div content
}
If you want only B tags..
foreach(var item in doc.DocumentNode.SelectNodes("//B"))
{
item.OuterHtml;//your B tag and its content
}
If you are just removing div tags, this will get div tags as well as any attributes they may have.
var html =
"<DIV><B> xpto <div text='abc'/></B></DIV><b>Other text <div>test</div>"
var pattern = "#"(\</?DIV(.*?)/?\>)"";
// Replace any match with nothing/empty string
Regex.Replace(html, pattern, string.Empty, RegexOptions.IgnoreCase);
Result
<B> xpto </B><b>Other text test
Use Regex:
var result = Regex.Replace(html, #"</?DIV>", "");
UPDATED
as you mentioned, by this code, regex removes all tages else B
var hmtl = "<DIV><B> xpto </B></DIV>";
var remainTag = "B";
var pattern = String.Format("(</?(?!{0})[^<>]*(?<!{0})>)", remainTag );
var result = Regex.Replace(hmtl , pattern, "");
you can use regular
<[(/body|html)\s]*>
in c#:
var result = Regex.Replace(html, #"<[(/body|html)\s]*>", "");
<html>
<body>
< / html>
< / body>
html = Regex.Replace(html,#"<*DIV>", String.Empty);

How can I loop through a string, replacing sections that match a pattern?

I have the following HTML markup
<p>xxxx</p>
<pre>xxx</pre>
<p>xxxx</p>
<pre>yyy</pre>
I need to be able to change this to:
<p>xxxx</p>
<pre>ABC xxx ABC</pre>
<p>xxxx</p>
<pre>ABC yyy ABC </pre>
I had a suggestion to use:
var loDoc = XDocument.Parse(lcHTML);
foreach (XElement loElement in loDoc.Descendants("pre"))
This does extract all the pre elements but it doesn't give me a way to tie things together and reinsert code into the original string.
Is there another way I could do this that would allow me to make the code change I need. I was thinking of using split and splitting on the <pre>..</pre> but then that would not really give me what I need as I need to replace the code inside the <pre>...</pre>
One possibility is to use XDocument but it has to be valid XHTML and you need to introduce a root node:
public class Program
{
static void Main()
{
var doc = XDocument.Parse(
#"<html>
<p>xxxx</p>
<pre>xxx</pre>
<p>xxxx</p>
<pre>yyy</pre>
</html>"
);
foreach (var pre in doc.Descendants("pre"))
{
pre.Value = string.Format("ABC {0} ABC", pre.Value);
}
Console.WriteLine(doc);
}
}
Another possibility is to use Html Agility Pack:
public class Program
{
static void Main()
{
var doc = new HtmlDocument();
doc.LoadHtml(
#"<p>xxxx</p>
<pre>xxx</pre>
<p>xxxx</p>
<pre>yyy</pre>"
);
foreach (var pre in doc.DocumentNode.Descendants("pre"))
{
pre.InnerHtml = string.Format("ABC {0} ABC", pre.InnerHtml);
}
Console.WriteLine(doc.DocumentNode.OuterHtml);
}
}
Get the XML Doc element from this link How to read HTML as XML?
using the doc element, try
XmlElement root = doc.DocumentElement;
XmlNodeList nodes = root.SelectNodes("pre");
foreach (XmlNode node in nodes) {
node.value = "ABC" + node.value + "ABC";
}
How about using Strings instead of xml?
String xmlString = ... \\ get string representation from somewhere
xmlString = xmlString.Replace( "<pre>", "<pre>ABC " );
xmlString = xmlString.Replace( "</pre>", " ABC </pre>" );

HTML Agility Pack Question (Attempting to parse string from source)

I am attempting to use the Agility pack to parse certain bits of info from various pages. I am kind of worried that using this might be overkill for what I need, if that is case feel free to let me know. Anyway, I am attempting to parse a page from motley fool to get the name of a company based on the ticker. I will be parsing several pages to get stock info in a similar way.
The HTML that I want to parse looks like:
<h1 class="subHead">
Microsoft Corp <span>(NASDAQ:MSFT)</span>
</h1>
Also, the page I want to parse is: http://caps.fool.com/Ticker/MSFT.aspx
So, I guess my question is how do I simply get the Microsoft Corp from the html and should I even be using the agility pack to do things like this?
Edit: Current code
public String getStockName(String ticker)
{
String text ="";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://caps.fool.com/Ticker/" + ticker + ".aspx");
var node = doc.DocumentNode.SelectSingleNode("/h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
}
This would give you a list of all stock names, for your sample Html just of Microsoft:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("test.html");
var nodes = doc.DocumentNode.SelectNodes("//h1[#class='subHead']");
foreach (var node in nodes)
{
string text = node.FirstChild.InnerText; //output: "Microsoft Corp"
string textAll = node.InnerText; //output: "Microsoft Corp (NASDAQ:MSFT)"
}
Edit based on updated question - this should work for you:
string text = "";
HtmlWeb web = new HtmlWeb();
string url = string.Format("http://caps.fool.com/Ticker/{0}.aspx", ticker);
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var node = doc.DocumentNode.SelectSingleNode("//h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
Use an xpath expression to select the element then pickup the text.
foreach (var element in doc.DocumentNode.SelectNodes("//h1[#clsss='subHead']/span"))
{
Console.WriteLine (element.InnerText);
}

Categories