HtmlAgilityPack - Getting div content - c#

So I am trying to get a list of users online on a forum. Here is what the html looks like:
<!-- logged-in users -->
<div id="wgo_onlineusers" class="wgo_subblock section">
<h3 class="blocksubhead"><img src="images/metro/red/misc/users_online.png" alt="Currently Active Users" />Currently Active Users</h3>
<div>
<p>There are currently 3 users online. <span class="shade">3 members and 0 guests</span></p>
<p>Most users ever online was 23, 01-06-2013 at <span class="time">12:09 PM</span>.</p>
<ol class="commalist" id="wgo_onlineusers_list">
<li><a class="username" href="http://website.com/member.php?u=13"><span class="vip_username">Duncanrp</span></a>, </li>
<li><a class="username" href="http://website.com/member.php?u=17"><span class="regular_username">Jessica</span></a></li>
</ol>
</div>
</div>
<!-- end logged-in users -->
Is it possible to get each individual user that is online using the HtmlAgilityPack? The users are formatted using the <li> tags.
Code I have attempted:
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml("http://www.vizor.us/forum.php");
List<string> onlineUsers = new List<string>();
foreach (HtmlAgilityPack.HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//li/a[#class='username']"))
{
onlineUsers.Add(selectNode.InnerText);
}
Thanks.

Try
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml("http://vizor.us/forum.php");
List<string> onlineUsers = new List<string>();
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//li/a[#class='username']")) {
onlineUsers.Add(selectNode.InnerText);
}
}
where is the string value of the website url that you are parsing.
For explanation of code please review documentation at http://htmlagilitypack.codeplex.com/

Related

Selenium C# List

Hello Stackoverflow Users,
I have a internet site with 99 list elements.
The diffrence between the elements are only the names.
<li class="_6e4x5">
<div class="_npuc5">
<div class="_f5wpw">
<div class="_eryrc">
<div class="_2nunc">
<a class="_2g7d5 notranslate _o5iw8" title="Name1" href="/"Name1/">"Name1</a>
</div>
</div>
</div>
</div>
</li>
[...]
<li class="_6e4x5">
<div class="_npuc5">
<div class="_f5wpw">
<div class="_eryrc">
<div class="_2nunc">
<a class="_2g7d5 notranslate _o5iw8" title="Name99" href="/"Name99/">"Name99</a>
</div>
</div>
</div>
</div>
</li>
What I want:
I want to take the "title" of each list element and put it in a new list.
What I tried:
List<string> following = new List<string>();
By name = By.XPath("//div[#class='_2nunc']");
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
IList<IWebElement> displayedOptions = driver.FindElements(name);
foreach (IWebElement option in displayedOptions)
{
string temp = displayedOptions[i].ToString();
following.Add(temp);
i++;
}
If I run the code, I only get the element ID, and not the "title" (name34 for example). I hope you have enough information to help me with my problem. Thanks in advance for every help!
To take the title of each list element and put it in a new list you can use the following code block :
List<string> following = new List<string>();
IList<IWebElement> displayedOptions = driver.FindElements(By.XPath("//li[#class='_6e4x5']//a[#class='_2g7d5 notranslate _o5iw8']"));
foreach (IWebElement option in displayedOptions)
{
string temp = option.GetAttribute("title");
following.Add(temp);
}
You're looking to get the a element's title attribute. The selenium IWebElement interface has a GetAttribute method you can use to get the title of your elements.
foreach (IWebElement option in displayedOptions)
{
following.Add(option.GetAttribute("title"));
}

fetching span value from html document

I have following xpath fetched using firefox xpath plugin
id('some_id')/x:ul/x:li[4]/x:span
using html agility pack I'm able to fetch id('some_id')/x:ul/x:li[4]
htmlDoc.DocumentNode.SelectNodes(#"//div[#id='some_id']/ul/li[4]").FirstOrDefault();
but I dont know how to get this span value.
update
<div id="some_id">
<ul>
<li><li>
<li><li>
<li><li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>
You don't need parse HTML with LINQ2XML, HTMLAgilityPack it's for it and it's more easy to obtain the node in the following way :
var html = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var value = doc.DocumentNode.SelectSingleNode("div[#id='some_id']/ul/li/span").InnerText;
Console.WriteLine(value);
An alternative approach (without html-agility-pack) would be to use LINQ2XML. You can use the XDocument.Descendants method to take the span element and take it's value:
var xml = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Root.Descendants("span").FirstOrDefault().Value);
The code can be extended to check if the div element has the matching id, using the XElement.Attribute property:
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Elements("div").Where (e => e.Attribute("id").Value == "some_id").Descendants("span").FirstOrDefault().Value);
One drawback of this solution is that the XML structure (HTML, XHTML) needs to be properly closed or else the parsing will fail.

C# - Convert HTML unordered list to array

My HTML string is like this , stored in a variable named sourceCode
<ul class="yom-list col first" style="width:33.333333333333%">
<li class="first">
<a href="/india/andaman-and-nicobar-islands/">
<span>Andaman and Nicobar Islands</span>
</a>
</li>
<li>
<a href="/india/jammu-and-kashmir/">
<span>Jammu and Kashmir</span>
</a>
</li>
<li class="last">
<a href="/india/andhra-pradesh/">
<span>Andhra Pradesh</span>
</a>
</li>
<li>
<a href="/india/jammu-and-kashmir/">
<span>Jammu and Kashmir</span>
</a>
</li>
</ul>
I want to convert it in to a generic List
So that I can access the data inside it in my code like href, name etc..
I have tried something like this
foreach (Match match in Regex.Matches(sourceCode, #"<li><a href=""(?<url>[^""])</a></li>"))
items.Add(new Item()
{
name = match.Groups["span"].Value, // i don't know how to get value inside that span
url = match.Groups["url"].Value,
});
But it does not work, Probably the regex is wrong. Can any one tell me what I am doing wrong?
Note: I can't use HTMLAgilityPack in this project
Try the below regex to get the values between <a href> tag and <span> tag only if it is present inside <li> tag.
/<li>\s*<a href=\"(?<url>[^"]*)\">\s*<span>(?<span>[^<]*)<\/span>/m
DEMO
Your c# code would be,
Regex rgx = new Regex(#"<li>\s*<a href=""(?<url>[^""]*)"">\s*<span>(?<span>[^<]*)</span>");
foreach (Match m in rgx.Matches(input))
{
Console.WriteLine(m.Groups["url"].Value);
Console.WriteLine(m.Groups["span"].Value);
}
IDEONE

c# - reading HTML?

I'm developing a program in C# and I require some help. I'm trying to create an array or a list of items, that display on a certain website. What I'm trying to do is read the anchor text and it's href. So for example, this is the HTML:
<div class="menu-1">
<div class="items">
<div class="minor">
<ul>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-1"
href="/?item=1">Item 1</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-2"
href="/?item=2">Item 2</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-3"
href="/?item=3">Item 3</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-4"
href="/?item=4">Item 4</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-5"
href="/?item=5">Item 5</a>
</li>
</ul>
</div>
</div>
</div>
So from that HTML I would like to read this:
string[,] array = {{"Item 1", "/?item=1"}, {"Item 2", "/?item=2"},
{"Item 3", "/?item=3"}, {"Item 4", "/?item=4"}, {"Item 5", "/?item=5"}};
The HTML is an example I had written, the actual site does not look like that.
As others said HtmlAgilityPack is the best for html parsing, also be sure to download HAP Explorer from HtmlAgilityPack site, use it to test your selects, anyway this SelectNode command will get all anchors that have ID and it start with menu-item :
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlFile);
var myNodes = doc.DocumentNode.SelectNodes("//a[starts-with(#id,'menu-item-')]");
foreach (HtmlNode node in myNodes)
{
Console.WriteLine(node.Id);
}
If the HTML is valid XML you can load it using the XmlDocument class and then access the pieces you want using XPaths, or you can use and XmlReader as Adriano suggests (a bit more work).
If the HTML is not valid XML I'd suggest to use some existing HTML parsers - see for example this - that worked OK for us.
You can also use the HtmlAgility pack
I think this case is simple enough to use a regular expression, like <a.*title="([^"]*)".*href="([^"]*)":
string strRegex = #"<a.*title=""([^""]*)"".*href=""([^""]*)""";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = ...;
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Use the groups matched
}
}

Parsing HTML and counting tags with C#

Suppose I have a block of HTML in a string:
<div class="nav mainnavs">
<ul>
<li><a id="nav-questions" href="/questions">Questions</a></li>
<li><a id="nav-tags" href="/tags">Tags</a></li>
<li><a id="nav-users" href="/users">Users</a></li>
<li><a id="nav-badges" href="/badges">Badges</a></li>
<li><a id="nav-unanswered" href="/unanswered">Unanswered</a></li>
</ul>
</div>
How can I parse the HTML and count the number of instances of a specific type of tag, such as <div> or <li>?
You can use HtmlAgilityPack for this - the latest version supports Linq so this is straight-forward:
For a local html file:
HtmlDocument doc = new HtmlDocument();
doc.Load(#"test.html");
int liCount = doc.DocumentNode.Descendants("li").Count(); //returns 5
From the web:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
int liCount = doc.DocumentNode.Descendants("li").Count();

Categories