HTMLAgilityPack selects nodes from first iteration through divs - c#

I'm trying to use HTMLAgilityPack to parse some website for the first time. Everything works as expected but only for first iteration. On each iteration I get unique div with its data, but SelectNodes() always gets data from first iteration.
The code listed below explains the problem
All the properties for station get values from first iteration.
static void Main(string[] args)
{
List<Station> stations = new List<Station>();
wClient = new WebClient();
wClient.Proxy = null;
wClient.Encoding = encode;
for (int i = 1; i <= 1; i++)
{
HtmlDocument html = new HtmlDocument();
string link = string.Format("http://energybase.ru/powerPlant/index?PowerPlant_page={0}&pageSize=20&q=/powerPlant", i);
html.LoadHtml(wClient.DownloadString(link));
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']").First().ChildNodes.Where(x=>x.Name=="div").ToList();//get list of nodes with PowerStation Data
foreach (var item in stationList) //each iteration returns Item with unique InnerHTML
{
Station st = new Station();
st.Name = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].InnerText;//gets name from first iteration
st.Url = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].Attributes["href"].Value;//gets url from first iteration and so on
st.Company = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["small"].ChildNodes["em"].ChildNodes["a"].InnerText;
stations.Add(st);
}
}
Maybe I am not getting some of essentials of OOP?

Your code can be greatly simplified by using the full power of XPath.
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']/div");
// XPath-expression may be so: "//div[#class='items'][1]/div"
// where [1] means first node
foreach (var item in stationList)
{
Station st = new Station();
st.Name = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").InnerText;
st.Url = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").Attributes["href"].Value;
string rawText = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
stations.Add(st);
}
Your mistake was to use XPath descendants axis: //div.
Even better rewrite code like this:
var divName = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']");
var nodeA = divName.SelectSingleNode("a");
st.Name = nodeA.InnerText;
st.Url = nodeA.Attributes["href"].Value;
string rawText = divName.SelectSingleNode("small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());

This article contains some good exaples on various aspects of html agility pack.
have a look into this article, it would give you a quick start.

Related

Get the titles and URLs of Yahoo result page in c#

I want to get titles and URLs of Yahoo result page with htmlagility pack
HtmlWeb w = new HtmlWeb();
string SearchResults = "https://en-maktoob.search.yahoo.com/search?p=" + query.querytxt;
var hd = w.Load(SearchResults);
var nodes = hd.DocumentNode.SelectNodes("//a[#cite and #href]");
if (nodes != null)
{
foreach (var node in nodes)
{
{
string Text = node.Attributes["title"].Value;
string Href = node.Attributes["href"].Value;
}
}
It works but all links in search result are not Appropriate links how to omit ads link , Yahoo links and etc .
I want to access the correct links
What about this:
HtmlWeb w = new HtmlWeb();
string search = "https://en-maktoob.search.yahoo.com/search?p=veverke";
//ac-algo ac-21th lh-15
var hd = w.Load(search);
var titles = hd.DocumentNode.CssSelect(".title a").Select(n => n.InnerText);
var links = hd.DocumentNode.CssSelect(".fz-15px.fw-m.fc-12th.wr-bw.lh-15").Select(n => n.InnerText);
for (int i = 0; i < titles.Count() - 1; i++)
{
var title = titles.ElementAt(i);
string link = string.Empty;
if (links.Count() > i)
link = links.ElementAt(i);
Console.WriteLine("Title: {0}, Link: {1}", title, link);
}
Keep in mind that I am using the extension method CssSelect, from nuget package's ScrapySharp. Install it just like you installed HtmlAgilityPack, then add a using statement at the top of the code like using ScrapySharp.Extensions; and you are good to go. (I use it because its easier to refer to css selectors instead of xpath expressions...)
Regarding skipping ads, I noticed ads in these yahoo search results will come at the last record only ? Assuming I am correct, simply skip the last one.
Here's the output I get for running the code above:

Why this difference between foreach vs Parallel.ForEach?

Can anyone explain to me in simple langauage why I get a file about 65 k when using foreach and more then 3 GB when using Parallel.ForEach?
The code for the foreach:
// start node xml document
var logItems = new XElement("log", new XAttribute("start", DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ss")));
var products = new ProductLogic().SelectProducts();
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
// loop through all products
foreach (var product in products)
{
// is in a specific group
var id = Convert.ToInt32(product["ProductID"]);
var isInGroup = productGroupLogic.GetProductGroups(new int[] { id }.ToList(), groupId).Count > 0;
// get product stock per option
var productSizes = productOptionLogic.GetProductStockByProductId(id).ToList();
// any stock available
var stock = productSizes.Sum(ps => ps.Stock);
var hasStock = stock > 0;
// get webpage for this product
var productUrl = string.Format(url, id);
var htmlPage = Html.Page.GetWebPage(productUrl);
// check if there is anything to log
var addToLog = false;
XElement sizeElements = null;
// if has no stock or in group
if (!hasStock || isInGroupNew)
{
// page shows => not ok => LOG!
if (!htmlPage.NotFound) addToLog = true;
}
// if page is ok
if (htmlPage.IsOk)
{
sizeElements = GetSizeElements(htmlPage.Html, productSizes);
addToLog = sizeElements != null;
}
if (addToLog) logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
}
// save
var xDocument = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("log", logItems));
xDocument.Save(fileName);
Use of the parallel code is a minor change, just replaced the foreach with Parallel.ForEach:
// loop through all products
Parallel.ForEach(products, product =>
{
... code ...
};
The methods GetSizeElements and CreateElements are both static.
update1
I made the methods GetSizeElements and CreateElements threadsafe with a lock, also doesn't help.
update2
I get answer to solve the problem. That's nice and fine. But I would like to get some more insigths on why this codes creates a file that is so much bigger then the foreach solutions. I am trying get some more sense in how the code is working when using threads. That way I get more insight and can I learn to avoid the pitfalls.
One thing stands out:
if (addToLog)
logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
logItems is not tread-safe. That could be your core problem but there are lots of other possibilities.
You have the output files, look for the differences.
Try to define the following parameters inside the foreach loop.
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
I think the only two is used by all of your threads inside the parallel foreach loop and the result is multiplied unnecessaryly.

Get href from html using mshtml in C#

I am trying to get the href link out of the following HTML code using mshtml in C# (WPF).
<a class="button_link" href="https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308" style="border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;">Confirm your account now</a>
I have tried using the following code to make this work by using mshtml in C# (WPF) but I have failed miserably.
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
string str = "https://rhystowey.com/account/confirm_email/";
int index = innerHtml.IndexOf(str);
innerHtml = innerHtml.Remove(0, index + str.Length);
int startIndex = innerHtml.IndexOf("\"");
string str3 = innerHtml.Remove(startIndex, innerHtml.Length - startIndex);
string thelink = "https://rhystowey.com/account/confirm_email/" + str3;
Can someone please help me to get this to work.
Use this:
var ex = new Regex("href=\"(.*)\" style");
var tag = "<a class=\"button_link\" href=\"https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308\" style=\"border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;\">Confirm your account now</a>";
var address = ex.Match(tag).Groups[1].ToString();
But you should extend it with checks because for instance Groups[1] could be out of range.
In your example
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
var ex = new Regex("href=\"([^\"\"]+)\"");
var address = ex.Match(innerHtml).Groups[1].ToString();
will match the first href="...". Or you select all occurrences:
var matches = (from Match match in ex.Matches(innerHtml) select match.Groups[1].Value).ToList();
This will give you a List<string> with all the links in your HTML. To filter this, you can either go this way
var wantedMatches = matches.Where(m => m.StartsWith("https://rhystowey.com/account/confirm_email/"));
which is more flexible because you could check against a list of start strings or whatever. Or you do it in your regex, which will lead in better performance:
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
Bringing it all together to what you want as far as I understand
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
var matches = (from Match match in ex.Matches(innerHTML)
where match.Groups.Count >= 1
select match.Groups[1].Value).ToList();
var firstAddress = matches.FirstOrDefault();
firstAddress holds your link, if there is one.
If your link will always start with the same path and isn't repeated on the page, you can use this (untested):
var match = Regex.Match(html, #"href=""(?<href>https\:\/\/rhystowey\.com\/account\/confirm_email\/[^""]+)""");
if (match.Success)
{
var href = match.Groups["href"].Value;
....
}

Working with HtmlAgilityPack

I'm trying to get a link and another element from an HTML page, but I don't really know what to do. This is what I have right now:
var client = new HtmlWeb(); // Initialize HtmlAgilityPack's functions.
var url = "http://p.thedgtl.net/index.php?tag=-1&title={0}&author=&o=u&od=d&page=-1&"; // The site/page we are indexing.
var doc = client.Load(string.Format(url, textBox1.Text)); // Index the whole DB.
var nodes = doc.DocumentNode.SelectNodes("//a[#href]"); // Get every url.
string authorName = "";
string fileName = "";
string fileNameWithExt;
foreach (HtmlNode link in nodes)
{
string completeUrl = link.Attributes["href"].Value; // The complete plugin download url.
#region Get all jars
if (completeUrl.Contains(".jar")) // Check if the url contains .jar
{
fileNameWithExt = completeUrl.Substring(completeUrl.LastIndexOf('/') + 1); // Get the filename with extension.
fileName = fileNameWithExt.Remove(fileNameWithExt.LastIndexOf('.')); ; // Get the filename without extension.
Console.WriteLine(fileName);
}
#endregion
#region Get all Authors
if (completeUrl.Contains("?author=")) // Check if the url contains .jar
{
authorName = completeUrl.Substring(completeUrl.LastIndexOf('=') + 1); // Get the filename with extension.
Console.WriteLine(authorName);
}
#endregion
}
I am trying to get all the filenames and authors next to each other, but now everything is like randomly placed, why?
Can someone help me with this? Thanks!
If you look at the HTML, it's very unfortunate it is not well-formed. There's a lot of open tags and the way HAP structures it is not like a browser, it interprets the majority of the document as deeply nested. So you can't just simply iterate through the rows of the table like you would in the browser, it gets a lot more complicated than that.
When dealing with such documents, you have to change your queries quite a bit. Rather than searching through child elements, you have to search through descendants adjusting for the change.
var title = System.Web.HttpUtility.UrlEncode(textBox1.Text);
var url = String.Format("http://p.thedgtl.net/index.php?title={0}", title);
var web = new HtmlWeb();
var doc = web.Load(url);
// select the rows in the table
var xpath = "//div[#class='content']/div[#class='pluginList']/table[2]";
var table = doc.DocumentNode.SelectSingleNode(xpath);
// unfortunately the `tr` tags are not closed so HAP interprets
// this table having a single row with multiple descendant `tr`s
var rows = table.Descendants("tr")
.Skip(1); // skip header row
var query =
from row in rows
// there may be a row with an embedded ad
where row.SelectSingleNode("td/script") == null
// each row has 6 columns so we need to grab the next 6 descendants
let columns = row.Descendants("td").Take(6).ToList()
let titleText = columns[1].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let authorText = columns[2].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let downloadLink = columns[5].Elements("a").Select(a => a.GetAttributeValue("href", null)).FirstOrDefault()
select new
{
Title = titleText ?? "",
Author = authorText ?? "",
FileName = Path.GetFileName(downloadLink ?? ""),
};
So now you can just iterate through the query and write out what you want for each of the rows.
foreach (var item in query)
{
Console.WriteLine("{0} ({1})", item.FileName, item.Author);
}

how to get count of the repeated values xml element in wpf

In above figure the element Chandru is repeated as two times.
so i have to count the repeated element.
But i don't know to get count of repeated element.
please help me.
Here the code what i wrote
public XML_3()
{
this.InitializeComponent();
XmlDocument doc = new XmlDocument();
doc.Load("D:/student_2.xml");
XmlNodeList student_list = doc.GetElementsByTagName("Student");
foreach (XmlNode node in student_list)
{
XmlElement student = (XmlElement)node;
string sname = student.GetElementsByTagName("Chandru")[0].InnerText;
string fname = student.GetElementsByTagName("FName")[0].InnerText;
string id = student.GetElementsByTagName("Chandru")[0].Attributes["ID"].InnerText;
Window.Content = sname + fname + id;
}
}
var count = student.GetElementsByTagName("Chandru").Count;
I think it would be easier using LINQ to XML and X-Classes:
var doc = XDocument.Load("D:/student_2.xml");
var results = (from s in doc.Root.Elements()
group s by s.Name into g
select new { Name = g.Key, Count = g.Count() }).ToList();
this will give you a list of elements with two properties each: Name and Count. If you want to receive only that names, which occurred more than once you can add where g.Count() > 1 between group and select statement.
I am not a C# programmer, but hence, I can give you the algorithm then you can try to apply it on your program.
Define an array of struct, the struct must have two fields: ElmntName and NbOccurance.
struct Elmnt
{ String ElmntName;
Int NbOccurance;
} MyElement;
Then each element you pass thorugh, pass through your array, if the elementwas not found, save it as new element with NbOccurance=0;
Else, if it was found increment the number of occursnces.
At the end of reading your xml, you will get a list containing the names of the elements and their occurances.

Categories