Get href from html using mshtml in C# - c#

I am trying to get the href link out of the following HTML code using mshtml in C# (WPF).
<a class="button_link" href="https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308" style="border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;">Confirm your account now</a>
I have tried using the following code to make this work by using mshtml in C# (WPF) but I have failed miserably.
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
string str = "https://rhystowey.com/account/confirm_email/";
int index = innerHtml.IndexOf(str);
innerHtml = innerHtml.Remove(0, index + str.Length);
int startIndex = innerHtml.IndexOf("\"");
string str3 = innerHtml.Remove(startIndex, innerHtml.Length - startIndex);
string thelink = "https://rhystowey.com/account/confirm_email/" + str3;
Can someone please help me to get this to work.

Use this:
var ex = new Regex("href=\"(.*)\" style");
var tag = "<a class=\"button_link\" href=\"https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308\" style=\"border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;\">Confirm your account now</a>";
var address = ex.Match(tag).Groups[1].ToString();
But you should extend it with checks because for instance Groups[1] could be out of range.
In your example
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
var ex = new Regex("href=\"([^\"\"]+)\"");
var address = ex.Match(innerHtml).Groups[1].ToString();
will match the first href="...". Or you select all occurrences:
var matches = (from Match match in ex.Matches(innerHtml) select match.Groups[1].Value).ToList();
This will give you a List<string> with all the links in your HTML. To filter this, you can either go this way
var wantedMatches = matches.Where(m => m.StartsWith("https://rhystowey.com/account/confirm_email/"));
which is more flexible because you could check against a list of start strings or whatever. Or you do it in your regex, which will lead in better performance:
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
Bringing it all together to what you want as far as I understand
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
var matches = (from Match match in ex.Matches(innerHTML)
where match.Groups.Count >= 1
select match.Groups[1].Value).ToList();
var firstAddress = matches.FirstOrDefault();
firstAddress holds your link, if there is one.

If your link will always start with the same path and isn't repeated on the page, you can use this (untested):
var match = Regex.Match(html, #"href=""(?<href>https\:\/\/rhystowey\.com\/account\/confirm_email\/[^""]+)""");
if (match.Success)
{
var href = match.Groups["href"].Value;
....
}

Related

regex variable from script 32/34 characters

from the following code I am trying to get the data from the script variable. I'm interested in the text between ""
var code = "a37965dcd8421328a767c697448ed735";
XPathResult xpathResult = geckoWebBrowser1.Document.EvaluateXPath("/html/body/table[3]/tbody/tr[1]/td[2]/script");
var foundNodes = xpathResult.GetNodes();
foreach (var node in foundNodes)
{
var x = node.TextContent; // get text text contained by this node (including children)
GeckoHtmlElement element = node as GeckoHtmlElement; //cast to access.. inner/outerHtml
string inner = element.InnerHtml;
string outer = element.OuterHtml;
String pattent = ".[0-9a-zA-Z]{34}$.";
Match match = Regex.Match(inner, pattent);
regex is correct? what am I doing wrong?
Your Regex string can try to use [0-9a-zA-Z]{32,34} instead of .[0-9a-zA-Z]{34}$.
The . could be removed.
regex online
Your Regex rule can try like this:
bool result = Regex.Match(inner, #"^[0-9a-zA-Z]{32,34}$").Success;
Console.WriteLine(result);
If result equal true, it match success!

Get the titles and URLs of Yahoo result page in c#

I want to get titles and URLs of Yahoo result page with htmlagility pack
HtmlWeb w = new HtmlWeb();
string SearchResults = "https://en-maktoob.search.yahoo.com/search?p=" + query.querytxt;
var hd = w.Load(SearchResults);
var nodes = hd.DocumentNode.SelectNodes("//a[#cite and #href]");
if (nodes != null)
{
foreach (var node in nodes)
{
{
string Text = node.Attributes["title"].Value;
string Href = node.Attributes["href"].Value;
}
}
It works but all links in search result are not Appropriate links how to omit ads link , Yahoo links and etc .
I want to access the correct links
What about this:
HtmlWeb w = new HtmlWeb();
string search = "https://en-maktoob.search.yahoo.com/search?p=veverke";
//ac-algo ac-21th lh-15
var hd = w.Load(search);
var titles = hd.DocumentNode.CssSelect(".title a").Select(n => n.InnerText);
var links = hd.DocumentNode.CssSelect(".fz-15px.fw-m.fc-12th.wr-bw.lh-15").Select(n => n.InnerText);
for (int i = 0; i < titles.Count() - 1; i++)
{
var title = titles.ElementAt(i);
string link = string.Empty;
if (links.Count() > i)
link = links.ElementAt(i);
Console.WriteLine("Title: {0}, Link: {1}", title, link);
}
Keep in mind that I am using the extension method CssSelect, from nuget package's ScrapySharp. Install it just like you installed HtmlAgilityPack, then add a using statement at the top of the code like using ScrapySharp.Extensions; and you are good to go. (I use it because its easier to refer to css selectors instead of xpath expressions...)
Regarding skipping ads, I noticed ads in these yahoo search results will come at the last record only ? Assuming I am correct, simply skip the last one.
Here's the output I get for running the code above:

HTMLAgilityPack selects nodes from first iteration through divs

I'm trying to use HTMLAgilityPack to parse some website for the first time. Everything works as expected but only for first iteration. On each iteration I get unique div with its data, but SelectNodes() always gets data from first iteration.
The code listed below explains the problem
All the properties for station get values from first iteration.
static void Main(string[] args)
{
List<Station> stations = new List<Station>();
wClient = new WebClient();
wClient.Proxy = null;
wClient.Encoding = encode;
for (int i = 1; i <= 1; i++)
{
HtmlDocument html = new HtmlDocument();
string link = string.Format("http://energybase.ru/powerPlant/index?PowerPlant_page={0}&pageSize=20&q=/powerPlant", i);
html.LoadHtml(wClient.DownloadString(link));
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']").First().ChildNodes.Where(x=>x.Name=="div").ToList();//get list of nodes with PowerStation Data
foreach (var item in stationList) //each iteration returns Item with unique InnerHTML
{
Station st = new Station();
st.Name = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].InnerText;//gets name from first iteration
st.Url = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].Attributes["href"].Value;//gets url from first iteration and so on
st.Company = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["small"].ChildNodes["em"].ChildNodes["a"].InnerText;
stations.Add(st);
}
}
Maybe I am not getting some of essentials of OOP?
Your code can be greatly simplified by using the full power of XPath.
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']/div");
// XPath-expression may be so: "//div[#class='items'][1]/div"
// where [1] means first node
foreach (var item in stationList)
{
Station st = new Station();
st.Name = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").InnerText;
st.Url = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").Attributes["href"].Value;
string rawText = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
stations.Add(st);
}
Your mistake was to use XPath descendants axis: //div.
Even better rewrite code like this:
var divName = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']");
var nodeA = divName.SelectSingleNode("a");
st.Name = nodeA.InnerText;
st.Url = nodeA.Attributes["href"].Value;
string rawText = divName.SelectSingleNode("small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
This article contains some good exaples on various aspects of html agility pack.
have a look into this article, it would give you a quick start.

Regex expression to search upto nested level

How to search search string upto nested level using Regex expression
Like say: I have string like
var str = "samir patel {samirpatel#test1.com{sam#somedomain.com}}";
Out put should be sam#somedomain.com
You could simply use this pattern:
{([^{}]*)}
This will match any string like {some content} which does not contain any other group like {some content}. You can test this here.
You can capture this using:
var str = "samir patel {samirpatel#test1.com{sam#somedomain.com}}";
var regex = new Regex("{([^{}]*)}");
var matches = regex.Matches(str);
var output = matches[0].Groups[1].Value;
// output == "sam#somedomain.com"
Or more simply:
var str = "samir patel {samirpatel#test1.com{sam#somedomain.com}}";
var output = Regex.Match(str, "{([^{}]*)}").Groups[1].Value;
// output == "sam#somedomain.com"
You could get this result using (?<=\{)[^{}]*(?=\}), assuming a language other than JavaScript. In C#, for example, that's
result = Regex.Match(str, #"(?<=\{)[^{}]*(?=\})").Value;
If you're using JavaScript, use \{([^{}]*)\} and access $1 for the match result:
var myregexp = /\{([^{}]*)\}/;
var match = myregexp.exec(subject);
if (match != null) {
result = match[1];
}

Working with HtmlAgilityPack

I'm trying to get a link and another element from an HTML page, but I don't really know what to do. This is what I have right now:
var client = new HtmlWeb(); // Initialize HtmlAgilityPack's functions.
var url = "http://p.thedgtl.net/index.php?tag=-1&title={0}&author=&o=u&od=d&page=-1&"; // The site/page we are indexing.
var doc = client.Load(string.Format(url, textBox1.Text)); // Index the whole DB.
var nodes = doc.DocumentNode.SelectNodes("//a[#href]"); // Get every url.
string authorName = "";
string fileName = "";
string fileNameWithExt;
foreach (HtmlNode link in nodes)
{
string completeUrl = link.Attributes["href"].Value; // The complete plugin download url.
#region Get all jars
if (completeUrl.Contains(".jar")) // Check if the url contains .jar
{
fileNameWithExt = completeUrl.Substring(completeUrl.LastIndexOf('/') + 1); // Get the filename with extension.
fileName = fileNameWithExt.Remove(fileNameWithExt.LastIndexOf('.')); ; // Get the filename without extension.
Console.WriteLine(fileName);
}
#endregion
#region Get all Authors
if (completeUrl.Contains("?author=")) // Check if the url contains .jar
{
authorName = completeUrl.Substring(completeUrl.LastIndexOf('=') + 1); // Get the filename with extension.
Console.WriteLine(authorName);
}
#endregion
}
I am trying to get all the filenames and authors next to each other, but now everything is like randomly placed, why?
Can someone help me with this? Thanks!
If you look at the HTML, it's very unfortunate it is not well-formed. There's a lot of open tags and the way HAP structures it is not like a browser, it interprets the majority of the document as deeply nested. So you can't just simply iterate through the rows of the table like you would in the browser, it gets a lot more complicated than that.
When dealing with such documents, you have to change your queries quite a bit. Rather than searching through child elements, you have to search through descendants adjusting for the change.
var title = System.Web.HttpUtility.UrlEncode(textBox1.Text);
var url = String.Format("http://p.thedgtl.net/index.php?title={0}", title);
var web = new HtmlWeb();
var doc = web.Load(url);
// select the rows in the table
var xpath = "//div[#class='content']/div[#class='pluginList']/table[2]";
var table = doc.DocumentNode.SelectSingleNode(xpath);
// unfortunately the `tr` tags are not closed so HAP interprets
// this table having a single row with multiple descendant `tr`s
var rows = table.Descendants("tr")
.Skip(1); // skip header row
var query =
from row in rows
// there may be a row with an embedded ad
where row.SelectSingleNode("td/script") == null
// each row has 6 columns so we need to grab the next 6 descendants
let columns = row.Descendants("td").Take(6).ToList()
let titleText = columns[1].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let authorText = columns[2].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let downloadLink = columns[5].Elements("a").Select(a => a.GetAttributeValue("href", null)).FirstOrDefault()
select new
{
Title = titleText ?? "",
Author = authorText ?? "",
FileName = Path.GetFileName(downloadLink ?? ""),
};
So now you can just iterate through the query and write out what you want for each of the rows.
foreach (var item in query)
{
Console.WriteLine("{0} ({1})", item.FileName, item.Author);
}

Categories