c# Regex inside some html tag - c#

I'm trying during some hour with regex to take text inside some html tag:
<div class="ewok-rater-header-section">
<ul class="header">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class="work-weight">9.0 minutes</span></h1></li>
</ul>
</div>
i take meow with
var regexpost = new System.Text.RegularExpressions.Regex(#"<h1(.*?)>(.*?)</h1>");
var mpost = regexpost.Match(reqpost);
string lechat = (mpost.Groups[2].Value).ToString();
but not other
I like to add meow in a textbox , meow2 in a second textbox and 9.0 (minutes) in a last one

In these situations a Html parser can help a lot, and can also be a lot more precise and robust
Html Agility pack
Example
var html = #"<div class=""ewok-rater-header-section"">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class=""work-weight"">9.0 minutes</span></h1></li>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
// you can search for the heading
foreach (var node in doc.DocumentNode.SelectNodes("//li//h1"))
{
Console.WriteLine("Found heading : " + node.InnerText);
}
// or you can be more specific
var someSpan = doc.DocumentNode
.SelectNodes("//span[#class='work-weight']")
.FirstOrDefault();
Console.WriteLine("Found span : " + someSpan.InnerText);
Output
Found heading : meow
Found heading : meow2
Found heading : Time = 9.0 minutes
Found span : 9.0 minutes
Demo here

it s for parse http reponse. Then is it not slow to use a html parser to create document ?

Related

Wrong string found while parsing HTML

Here is my Regular Expression for getting version number from playstore HTML content:
var content = responseMsg.Content == null
? null
: await responseMsg.Content.ReadAsStringAsync();
var versionMatch = Regex.Match(
content,
"<div[^>]*>Current Version</div><span[^>]*><div><span[^>]*>(.*?)<").Groups[1];
if (versionMatch.Success)
{
version = versionMatch.Value.Trim();
}
Here I am getting this value Inside VersionMatch= "{}"
So how to get this proper version? like VersionMatch="1.9"
The html content is very large so I cut off from that html content :
<div class="hAyfc">
<div class="BgcNfc">Current Version</div>
<span class="htlgb">
<div class="IQ1z0d">
<span class="htlgb">1.9</span>
</div>
To skip over the intermediate text between Current Version</div> and the <span> where the version number is in, you can use a (non-greedy) .*?. The dot will also match \r\n, if RegexOptions.Singleline is given. To get the correct span, specify its content as "digits and dots" ([\d\.]+) instead of "anything" (.*?)
var content = #"<div class=""hAyfc"">
<div class=""BgcNfc"">Current Version</div>
<span class=""htlgb"">
<div class=""IQ1z0d"">
<span class=""htlgb"">1.9</span>
</div>";
var versionMatch = Regex.Match(
content,
#"<div[^>]*>Current Version</div>.*?<span[^>]*>([\d\.]+)<", RegexOptions.Singleline).Groups[1];
versionMatch.Value is then "1.9"
You could try using HtmlAgilityPack with Fizzler.Systems.HtmlAgilityPack so you can basically do something like this:
var web = new HtmlWeb();
var html = web.Load(uri);
var documentNode = html.DocumentNode;
var version = documentNode.QuerySelector(".htlgb").InnerHtml;
And you don't have to worry about the regex

HTML Agility Pack Parsing div

I'm trying to parse HTML, I need to get "text" from this part:
<div class="_gdf kno-fb-ctx">
<span data-ved="0ahUKEwjIr9brjO7UAhUnYZoKHda-ALgQ2koIogEoAjAT"> text</span>
</div>
Here's my C# code:
var message = doc.DocumentNode.SelectSingleNode("//div[#class='_gdf kno-fb-ctx']").InnerText;
Console.WriteLine(message);
What I'm doing wrong ?
I see that you are not selecting the actual 'Span' node to read the InnertTex. You have selected div and trying to read InnertTex, which won't give you desired result "Text". Instead you can do like below:
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<div class='_gdf kno-fb-ctx'><span data-ved = '0ahUKEwjIr9brjO7UAhUnYZoKHda-ALgQ2koIogEoAjAT'> text </span ></div >");
var text = doc.DocumentNode.SelectSingleNode("//div[#class=\"_gdf kno-fb-ctx\"]//span").InnerText;

Get URLs inside a HTML page with HTML Agility Pack

I have this code:
foreach (HtmlNode node in hd.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a"))
{
string s=("node:" + node.GetAttributeValue("href", string.Empty));
}
I want to get urls in tags like this:
<div class="compTitle options-toggle">
<a class=" ac-algo fz-l ac-21th lh-24" href="http://www.bestbuy.com">
<b>Huawei</b> Products - Best Buy
</a>
</div>
I want to get "http://www.bestbuy.com" and "Huawei Products - Best Buy"
what should I do? Is my code correct?
this is an example of working code
var document = new HtmlDocument();
document.LoadHtml("<div class=\"compTitle options-toggle\"><a class=\" ac-algo fz-l ac-21th lh-24\" href=\"http://www.bestbuy.com\"><b>Huawei</b> Products - Best Buy</a></div>");
var tags = document.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a").ToList();
foreach (var tag in tags)
{
var link = tag.Attributes["href"].Value; // http://www.bestbuy.com
var text = tag.InnerText; // Huawei Products - Best Buy
}
The closing double quote should fix the selecting (it worked for me).
Get the plain text as
string contentText = node.InnerText;
or having the Huawei word in bold, like this:
string contentHtml = node.InnerHtml;

AngleSharp Parsing

Can't find many examples of using AngleSharp for parsing when you don't have a class name or id to use.
HTML
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
I want to find the href from any <a> tags that have a title = Bing
In Python BeautifulSoup I would use
item_needed = a_row.find('a', {'title': 'Bing'})
and then grab the href attribute
or jQuery
a[title='Bing']
But, I'm stuck using AngleSharp
eg. following example
https://github.com/AngleSharp/AngleSharp/wiki/Examples#getting-certain-elements
c# AngleSharp
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.Parse(#"<span><span class=""icon icon_none""></span></span>< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none""></span></a></span><span><span class=""icon icon_none""></span></span>");
//Do something with LINQ
var blueListItemsLinq = document.All.Where(m => m.LocalName == "a" && //stuck);
Looks like there was problem in your HTML markup that cause AngleSharp failed to find the target element i.e the spaces around angle-brackets :
< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none"">
Having the HTML fixed, both LINQ and CSS selector successfully select the target link :
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.ParseDocument(#"<span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span>");
//LINQ example
var blueListItemsLinq = document.All
.Where(m => m.LocalName == "a" &&
m.GetAttribute("title") == "Bing"
);
//LINQ equivalent CSS selector example
var blueListItemsCSS = document.QuerySelectorAll("a[title='Bing']");
//print href attributes value to console
foreach (var item in blueListItemsCSS)
{
Console.WriteLine(item.GetAttribute("href"));
}

Remove whole div with specific class name

Is it possible to remove the whole div with a specific class name? For example;
<body>
<div class="head">...</div>
<div class="container">...</div>
<div class="foot">...</div>
</body>
I would like to remove the div with the "container" class.
A C# code example would be verry useful, thank you.
The proper way (I suppose) to do this is via built in Gecko DOM classes and methods.
So, in your case something like:
var containers = yourDocument.GetElementsByClassName("container");
//this returns an IEnumerable of elements with this class. If you only ever gonna have one, you can do it like that:
var yourContainer = containers.FirstOrDefault();
yourContainer.Parent.RemoveChild(yourContainer);
Obviously, you can also do loops etc.
If you want to parse html in c# the best way is to use Html agility pack :
https://htmlagilitypack.codeplex.com/
HtmlDocument document = new HtmlDocument();
document.Load(#"C:\yourfile.html")
HtmlNode nodesToRemove= document .DocumentNode.SelectNodes("//div[#class='container']").ToList();
foreach (var node in nodesToRemove)
node.Remove();
Well, with the help of regex, you can remove your desired div
var data = "<body>\n<div class=\"head\">...</div>\n" +
"<div class=\"container\">...</div>\n" +
"<div class=\"foot\">...</div>\n</body>";
var rxStr = "<div[^<]+class=([\"'])container\\1.*</div>";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var nStr = rx.Replace (data, "");
Console.WriteLine (nStr);
This will reduce your string to
<body>
<div class="head">...</div>
<div class="foot">...</div>
</body>

Categories