Wrong string found while parsing HTML - c#

Here is my Regular Expression for getting version number from playstore HTML content:
var content = responseMsg.Content == null
? null
: await responseMsg.Content.ReadAsStringAsync();
var versionMatch = Regex.Match(
content,
"<div[^>]*>Current Version</div><span[^>]*><div><span[^>]*>(.*?)<").Groups[1];
if (versionMatch.Success)
{
version = versionMatch.Value.Trim();
}
Here I am getting this value Inside VersionMatch= "{}"
So how to get this proper version? like VersionMatch="1.9"
The html content is very large so I cut off from that html content :
<div class="hAyfc">
<div class="BgcNfc">Current Version</div>
<span class="htlgb">
<div class="IQ1z0d">
<span class="htlgb">1.9</span>
</div>

To skip over the intermediate text between Current Version</div> and the <span> where the version number is in, you can use a (non-greedy) .*?. The dot will also match \r\n, if RegexOptions.Singleline is given. To get the correct span, specify its content as "digits and dots" ([\d\.]+) instead of "anything" (.*?)
var content = #"<div class=""hAyfc"">
<div class=""BgcNfc"">Current Version</div>
<span class=""htlgb"">
<div class=""IQ1z0d"">
<span class=""htlgb"">1.9</span>
</div>";
var versionMatch = Regex.Match(
content,
#"<div[^>]*>Current Version</div>.*?<span[^>]*>([\d\.]+)<", RegexOptions.Singleline).Groups[1];
versionMatch.Value is then "1.9"

You could try using HtmlAgilityPack with Fizzler.Systems.HtmlAgilityPack so you can basically do something like this:
var web = new HtmlWeb();
var html = web.Load(uri);
var documentNode = html.DocumentNode;
var version = documentNode.QuerySelector(".htlgb").InnerHtml;
And you don't have to worry about the regex

Related

Getting InnerText ignoring script node by using Html Agility Pack in C#

I have following page from which I want to get a list of proxy servers from a table:
http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any
Each row in the table is an ul element. My problem is when obtaining the first li element which associated class is "proxy" from the ul element. I want to obtain the IP and Port so I perform an InnerText but as li element has an script child node, it returns the text of the script node.
Below an image of the structure of the page:
I have tried below code using Html Agility Pack and LINQ:
WebClient webClient = new WebClient();
string page = webClient.DownloadString("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
List<List<string>> table = doc.DocumentNode.SelectSingleNode("//div[#class='table']")
.Descendants("ul")
.Where(ul => ul.Elements("li").Count() > 1)
.Select(ul => ul.Elements("li").Select(li =>
{
string result = string.Empty;
if (li.HasClass("proxy"))
{
HtmlNode liTmp = li.Clone();
liTmp.RemoveAllChildren();
result = liTmp.InnerText.Trim();
}
else
{
result = li.InnerText.Trim();
}
return result;
}).ToList()).ToList();
I can obtain a list which each item is a list containing the fields (Proxy, País, Tipo, Velocidad, HTTPS/SSL) but field proxy is always empty. Also I am not getting at all the "País" and "Ciudad" columns.
That is because those values are injected into the DOM by JavaScript after page load. Actually the value inside the Proxy() is a Base64 representation of what you are looking for.
In the image you have posted above the value MTQ4LjI0My4zNy4xMDE6NTMyODE= decodes to 148.243.37.101:53281
The raw parsed string you are feeding to the Agility pack only contains the Proxy field...
<div class=\ "table-wrap\">\r\n
<div class=\ "table\">\r\n
<ul>\r\n
<li class=\ "proxy\">
<script type=\ "text/javascript\">
Proxy('MTM4Ljk3LjkyLjI0OTo1MzgxNg==')
</script>
</li>\r\n
<li class=\ "https\">HTTP</li>\r\n
<li class=\ "speed\">29.5kbit</li>\r\n
<li class=\ "type\">
<strong>Elite</strong>
</li>\r\n
<li class=\ "country-city\">\r\n
<div>\r\n
<span class=\ "country\" title=\ "Brazil\">
<span class=\ "country-code\">
<span class=\ "flag br\"></span>
<span class=\ "name\">BR Brasil</span>
</span>
</span>
<!--\r\n -->
<span class=\ "city\">
<span>Rondon</span>
</span>\r\n </div>\r\n </li>\r\n </ul>\r\n
<div class=\ "clear\"></div>\r\n
Using the following code:
HttpClient client = new HttpClient();
var docResult = client.GetStringAsync("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any").Result;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(docResult);
Regex reg = new Regex(#"Proxy\('(?<value>.*?)'\)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
var stuff = doc.DocumentNode.SelectSingleNode("//div[#class='table']")
.Descendants("li")
.Where(x => x.HasClass("proxy"))
.Select(li =>
{
return li.InnerText;
}).ToList();
foreach (var item in stuff)
{
var match = reg.Match(item);
var proxy = Encoding.Default.GetString(System.Convert.FromBase64String(match.Groups["value"].Value));
Console.WriteLine($"{item}\t\tproxy = {proxy}");
}
I get:

c# Regex inside some html tag

I'm trying during some hour with regex to take text inside some html tag:
<div class="ewok-rater-header-section">
<ul class="header">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class="work-weight">9.0 minutes</span></h1></li>
</ul>
</div>
i take meow with
var regexpost = new System.Text.RegularExpressions.Regex(#"<h1(.*?)>(.*?)</h1>");
var mpost = regexpost.Match(reqpost);
string lechat = (mpost.Groups[2].Value).ToString();
but not other
I like to add meow in a textbox , meow2 in a second textbox and 9.0 (minutes) in a last one
In these situations a Html parser can help a lot, and can also be a lot more precise and robust
Html Agility pack
Example
var html = #"<div class=""ewok-rater-header-section"">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class=""work-weight"">9.0 minutes</span></h1></li>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
// you can search for the heading
foreach (var node in doc.DocumentNode.SelectNodes("//li//h1"))
{
Console.WriteLine("Found heading : " + node.InnerText);
}
// or you can be more specific
var someSpan = doc.DocumentNode
.SelectNodes("//span[#class='work-weight']")
.FirstOrDefault();
Console.WriteLine("Found span : " + someSpan.InnerText);
Output
Found heading : meow
Found heading : meow2
Found heading : Time = 9.0 minutes
Found span : 9.0 minutes
Demo here
it s for parse http reponse. Then is it not slow to use a html parser to create document ?

HTML Agility Pack Parsing div

I'm trying to parse HTML, I need to get "text" from this part:
<div class="_gdf kno-fb-ctx">
<span data-ved="0ahUKEwjIr9brjO7UAhUnYZoKHda-ALgQ2koIogEoAjAT"> text</span>
</div>
Here's my C# code:
var message = doc.DocumentNode.SelectSingleNode("//div[#class='_gdf kno-fb-ctx']").InnerText;
Console.WriteLine(message);
What I'm doing wrong ?
I see that you are not selecting the actual 'Span' node to read the InnertTex. You have selected div and trying to read InnertTex, which won't give you desired result "Text". Instead you can do like below:
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<div class='_gdf kno-fb-ctx'><span data-ved = '0ahUKEwjIr9brjO7UAhUnYZoKHda-ALgQ2koIogEoAjAT'> text </span ></div >");
var text = doc.DocumentNode.SelectSingleNode("//div[#class=\"_gdf kno-fb-ctx\"]//span").InnerText;

html agility pack getting same output twice c#

<div class="header">
<span id="content">test1</span>
</div>
<div class="header">
<span id="content">test2</span>
</div>
var web = new HtmlWeb();
var doc = web.Load(url)
var value = doc.DocumentNode.SelectNodes("//div[#class='header']")
foreach(var v in value)
{
var name = v.SelectSingleNode("//span[#id='content']")
Console.Writeline(name.OuterHtml);
}
the code above gives me as output twice <span id="content">test1</span>instead of <span id="content">test2</span> as second output. So it gets the correct number of nodes but not the correct output.
Using // and / in XPath will query the root node even you are using the current node.
Please see my fix in your code.
var value = doc.DocumentNode.SelectNodes("//div[#class='header']");
foreach (var v in value)
{
var name = v.SelectSingleNode("span[#id='content']");
Console.WriteLine(name.OuterHtml);
}
See this fiddle. https://dotnetfiddle.net/nih2lw
A side note, id attribute should always be unique in the document. Use class instead.

Remove whole div with specific class name

Is it possible to remove the whole div with a specific class name? For example;
<body>
<div class="head">...</div>
<div class="container">...</div>
<div class="foot">...</div>
</body>
I would like to remove the div with the "container" class.
A C# code example would be verry useful, thank you.
The proper way (I suppose) to do this is via built in Gecko DOM classes and methods.
So, in your case something like:
var containers = yourDocument.GetElementsByClassName("container");
//this returns an IEnumerable of elements with this class. If you only ever gonna have one, you can do it like that:
var yourContainer = containers.FirstOrDefault();
yourContainer.Parent.RemoveChild(yourContainer);
Obviously, you can also do loops etc.
If you want to parse html in c# the best way is to use Html agility pack :
https://htmlagilitypack.codeplex.com/
HtmlDocument document = new HtmlDocument();
document.Load(#"C:\yourfile.html")
HtmlNode nodesToRemove= document .DocumentNode.SelectNodes("//div[#class='container']").ToList();
foreach (var node in nodesToRemove)
node.Remove();
Well, with the help of regex, you can remove your desired div
var data = "<body>\n<div class=\"head\">...</div>\n" +
"<div class=\"container\">...</div>\n" +
"<div class=\"foot\">...</div>\n</body>";
var rxStr = "<div[^<]+class=([\"'])container\\1.*</div>";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var nStr = rx.Replace (data, "");
Console.WriteLine (nStr);
This will reduce your string to
<body>
<div class="head">...</div>
<div class="foot">...</div>
</body>

Categories