Trying to find second div with same class on a page. I only retrieve the first one when fetching the data and cannot figure out how to get the second or third etc..
HtmlAgilityPack.HtmlDocument data = web.Load(URL);
var res = data.DocumentNode.SelectSingleNode("//div[#class='col-sm-5']");
Also I'm using two slash signs in the start, I don't know why but it worked. I've seen numerous of different solutions ("/", "./" "//" ".//"). Could someone explain the difference please?
Thanks in advance,
xolo
Try this command:
var res = data.DocumentNode.SelectNodes("//div[#class='col-sm-5']");
This is the difference between single and double slash:
/
start selection from the document node
allows you to create 'absolute' path expressions
e.g. ā/html/body/pā matches all the paragraph elements
//
start selection matching anywhere in the docume
allows you to create 'relative' path expressions
e.g. ā//pā matches all the paragraph elements
Related
So I have a SharePoint site and I have users who submit new items into a SharePoint List. Some fields in the list item contain URLs that reference files or images, e.g. "http://host/abc.jpg" or "/abc.jpg".
In another field, users edit HTML code which may contain any tags such as <a href="/abc.jpg">, <img src="/abc.jpg"> and so on.
My goal is to find fields that contain links/URLs, and extract those URLs that point to something that has a filename plus extension. I have no problem extracting this from the SharePoint fields which may contain either some irrelevant information or the URL (and the URL only) using these two regexes:
//this will match full url e.g. http://localhost/path/a.jpg
var fullUrlRegex =
new Regex(#"^https?:\/\/(?:.*)[\.]+(?:[a-z0-9]{1,4})$");
//this will match an absolute path like //test/files to upload/222.jpg
var absolutePathRegex =
new Regex(#"^\/.*[\.]+(?:[a-z0-9]{1,4})$");
var fullUrlRegexMatch = fullUrlRegex.Match(value);
var absolutePathRegexMatch = absolutePathRegex.Match(value);
//now check which one matched and save the value
However, I am not sure how to approach extracting URLs (both relative and full URLs) from HTML code that users enter in the other field.
Suppose this is the user's input, and I need to extract both links to files from that HTML code.
<p>This is a picture!
And this is a pic too: <img src="/abc.jpg"></p>
The tags can really be anything, not just limited to <a> and <img>. One way I thought I could approach this is to use HTML Agility Pack, but this seems like an overkill. Would it be sufficient to regex-search for src="(match this)" and href="(match this)"? Anything I might miss?
Your regexes should not contain ^ at the start and $ at the end. It is an achor. See: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx.
Also use Matches method to get all matches.
Try this regex
(?<=(href="|src="))[/]*(?:[A-Za-z0-9-._~!$&'()*+,;=:#]|%[0-9a-fA-F]{2})*(?:/(?:[A-Za-z0-9-._~!$&'()*+,;=:#]|%[0-9a-fA-F]{2})*)*
Just add any other valid tags to the list in (href="|src=")
I'm working in Xamarin on one android app which is parsing xml from this webiste: http://video.cazin.net/rss.php, and populate listview and in particular I have a problem getting value from this tag:
<media:thumbnail url="http://video.cazin.net/uploads/thumbs/2d07f1e49-1.jpg" width="480" height="360"/>
I created namespace:
xmlNameSpaceManager.AddNamespace("ab", "http://search.yahoo.com/mrss/");
and than tried to get value from url attribute:
XmlNodeList xmlNode = document.SelectNodes("rss/channel/item");
if (xmlNode[i].SelectSingleNode("//ab:thumbnail[#url='http://video.cazin.net/rss.php']", xmlNameSpaceManager) != null)
{
var thumbnail = xmlNode[i].SelectSingleNode("//ab:thumbnail=[#url='http://video.cazin.net/rss.php']", xmlNameSpaceManager);
feedItem.Thumbnail = thumbnail.Value;
}
I also tried something like this:
//ab:thumbnail/#url
but than I got value of just first image. I'm sure the problem is here somewhere because I have the same code parisng images from another xml tag without colon inside and it's working correctly. Does anyone had similar experience and knows what I should put in those braces? Thanks
Your current query is searching for a thumbnail element where the url attribute is equal to http://video.cazin.net/rss.php - there are none that match this.
Your 'I also tried' query of //ab:thumbnail/#url is closer, but the // means that the query will start from the root of the document, so you get the all urls (but you only take the first).
If you want the element that matches taking the current node context into consideration, you need to include the current node context in the query - this is represented by .. So .//ab:thumbnail/#url would find all url attributes in a thumbnail element contained by the current node. You can see the result in this fiddle.
I would strongly suggest you use LINQ to XML instead, however. It's a lot nicer to work with than the old XmlDocument API. For example, you could find all item thumbnail urls using this code:
var doc = XDocument.Load("http://video.cazin.net/rss.php");
XNamespace media = "http://search.yahoo.com/mrss/";
var thumbnailUrls = doc.Descendants("item")
.Descendants(media + "thumbnail")
.Attributes("url");
As we all know, Regex patterns will make your stomache turn the first time you see them (or 10th time since you never went head first and truly learned it. Quilty.). I'm currently reading upon it, but since I'm on a tight deadline I'll check here if I can get a quicker and better answer/explaination meanwhile.
I have some url to a forum thread, and I want to scan through the html and find the last page for the thread.
So say I have one of the following urls identifying the thread in question:
https://www.somesite.com/forum/thread-93912* (absolute url to the
thread)
/forum/thread-93912 (relative url to the thread)
and I want to get all values (integers) that appear directly (next path) after any of the above "partial" match in the html-document.
So from any of the following hrefs located anywhere in the html-document (the doc is represented as a single string):
https://www.somesite.com/forum/thread-93912/34
https://www.somesite.com/forum/thread-93912/34/morestuffhere/whatevs
/forum/thread-93912/34
/forum/thread-93912/34/somethingheretoo
I want to extract the number 34 (only 34), so I can parse it to int.
EDIT
Okay, to make it simpler:
Say I have all the html in htmlString, and in this string I want to find all numbers x that appear after my inputString /forum/thread-93912.
These all appear in the htmlString, and I want to extract the numbers:
thread-93912/34
thread-93912/14
thread-93912/84
thread-93912/64
thread-93912/4
You don't need regex. Just use System.Uri.Segments
Uri url = new Uri("your url here");
Console.WriteLine(url.Segments[4]);
\b(\d+)\b(?=[^\d]*$)
Try this.See demo.grab the capture.
http://regex101.com/r/sU3fA2/55
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex regex = new Regex(#"\b\d+\b(?=[^\d]*$)");
Match match = regex.Match("/forum/thread-93912/34");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}
Since my question was a little hard to explain thuroughly (and since I "changed" my problem a little), I thought I'd add my own answer to get the exact code I went with (which I came up with thanks to the other answers here, so I'll give you all an upvote!).
I'm sure this can be made prettier and more compact, but I went for clearity since I'm new to regex!
First, get all strings matching the url + some number (separated with a slash "/"), then extract that number to a group called "page".
Regex regex = new Regex(urlToThread + #"/(?<page>\d+)");
MatchCollection matches = regex.Matches(htmlString);
Then iterate all matches and extract the "page"-value (garanteed to be an integer), and parse it to an integer. Add all parsed integers to a list and sort when done. The last one will be the greatest (last page).
List<int> pages = new List<int>();
foreach(Match match in matches)
pages.Add(int.Parse(match.Groups["page"].Value));
pages.Sort();
// And here we get the last page
int nrOfPages = pages[pages.Count-1];
I am trying to get the data between the html (span) provided (in this case 31)
Here is the original code (from inspect elements in chrome)
<span id="point_total" class="tooltip" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again." aria-describedby="ui-tooltip-0">31</span>
I have a rich textbox which contains the source of the page, here is the same code but in line 51 of the rich textbox:
<DIV id=point_display>You have<BR><SPAN id=point_total class=tooltip jQuery16207621750175125325="23" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.">17</SPAN><BR>Points </DIV><IMG style="FLOAT: right" title="Gain subscribers" border=0 alt="When people subscribe to you, you lose a point" src="http://static.subxcess.com/images/page/decoration/remove-1-point.png"> </DIV>
How would I go about doing this? I have tried several methods and none of them seem to work for me.
I am trying to retrieve the point value from this page: http://www.subxcess.com/sub4sub.php
The number changes depending on who subs you.
You could be incredibly specific about it:
var regex = new Regex(#"<span id=""point_total"" class=""tooltip"" oldtitle="".*?"" aria-describedby=""ui-tooltip-0"">(.*?)</span>");
var match = regex.Match(#"<span id=""point_total"" class=""tooltip"" oldtitle=""Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again."" aria-describedby=""ui-tooltip-0"">31</span>");
var result = match.Groups[1].Value;
You'll want to use HtmlAgilityPack to do this, it's pretty simple:
HtmlDocument doc = new HtmlDocument();
doc.Load("filepath");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span"); //Here, you can also do something like (".//span[#id='point_total' class='tooltip' jQuery16207621750175125325='23' oldtitle='Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.']"); to select specific spans, etc...
string value = node.InnerText; //this string will contain the value of span, i.e. <span>***value***</span>
Regex, while a viable option, is something you generally would want to avoid if at all possible for parsing html (see Here)
In terms of sustainability, you'll want to make sure that you understand the page source (i.e., refresh it a few times and see if your target span is nested within the same parents after every refresh, make sure the page is in the same general format, etc..., then navigate to the span using the above principle).
There are multiple possibilities.
Regex
Let HTML be parsed as XML and get the value via XPath
Iterate through all elements. If you get on a span tag, skip all characters until you find the closing '>'. Then the value you need is everything before the next opening '<'
Also look at System.Windows.Forms.HtmlDocument
I know it may be of my noobness in XPath, but let me ask to make sure, cuz I've googled enough.
I have a website and wanna get the news headings from it: www.farsnews.com (it is Persian)
Using FireBug and FireXpath extensions under firefox and by hand I extract and test multiple Xpath expressions that matches the headings, such as:
* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[#class="topnewsinfotitle "]
* .//div[#class="topnewsinfotitle "]
I also tested these using XPather extension and they seem to work pretty well, but when I get to test them... the SelectNodes returns null!
Any clue or hint?
here is a chunk of the code:
listBox2.ResetText();
HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[#class=\"topnewsinfotitle \"]");
listBox2.Items.Add(nc.Count+" Items selected!");
foreach (HtmlAgilityPack.HtmlNode node in nc) {
listBox2.Items.Add(node.InnerText);
}
Thanks.
I have tested your expressions. And as mentioned by Dialecticus in a comment, you have a ending space which shouldn't there.
//div[#class='topnewsinfotitle ']/text()
Returns 'empty sequence', see evaluation: http://xmltools.dk/EQA-ACA6
//div[#class='topnewsinfotitle']/text()
Returns a list of your headlines, see: http://xmltools.dk/EgA2APAj
However, if there could be other classes you use this ( http://xmltools.dk/EwA8AJAW ):
//div[contains(#class, 'topnewsinfotitle')]/text()
(I see they is an encoding issue in the links I've provided, however, it shouldn't matter for the meaning and for all the XPath expressions, you can remove /text() to get the nodes instead of only the text)
BUT, if you own this site, you should provide the headlines with a XML (maybe RSS or ATOM) or JSON which will have better performance and, most important, be more bullet-proof.