XPATH query, HtmlAgilityPack and Extracting Text - c#

I had been trying to extract links from a class called "tim_new" . I have been given a solution as well.
Both the solution, snippet and necessary information is given here
The said XPATH query was "//a[#class='tim_new'], my question is, how did this query differentiate between the first line of the snippet (given in the link above and the second line of the snippet).
More specifically, what is the literal translation (in English) of this XPATH query.
Furthermore, I want to write a few lines of code to extract the text written against NSE:
<div class="FL gL_12 PL10 PT15">BSE: 523395 | NSE: 3MINDIA | ISIN: INE470A01017</div>
Would appreciate help in forming the necessary selection query.
My code is written as:
IEnumerable<string> NSECODE = doc.DocumentNode.SelectSingleNode("//div[#NSE:]");
But this doesnt look right. Would appreciate some help.

The XPath in the first selection reads "select all document elements that have an attribute named class with a value of tim_new". The stuff in brackets is not what you're returning, it's the criteria you're applying to the search.
I don't have the HTML Agility pack, but if you are trying to query the divs that have "NSE:" as its text, your XPath for the second query should just be "//div" then you'll want to filter using LINQ.
Something like
var nodes =
doc.DocumentNode.SelectNodes("//div[text()]").Where(a => a.InnerText.IndexOf("NSE:") > -1);
So in English, "Return all the div elements that immediately contain text to LINQ, then check that the inner text value contains NSE:".
Again, I'm not sure the syntax is perfect, but that's the idea.
The XPath "//div[#NSE:]" would return all divs that have and attribute named, NSE:, which would be illegal anyway because ":" isn't allowed in an attribute name. Youre looking for the text of the element, not one of its attributes.
Hope that helps.'
Note: If you have nested divs that both contain text as in <div>NSE: some text<div>NSE: more text</div></div> you're going to get duplicate results.

Related

Parsing Html tags using c#

I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.
Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.
In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.

Select "src" value with XPath to HtmlAgilityPack

I'm on a development process of a crawling engine. My program crawls websites through Xpath with HtmlAgilityPack. I need to get some image src tag's directly. You can see my simple code below which is not working correctly, thanks in advice!
PS: Please ignore " char problem, XPath patterns are provided by database.
Agility.DocumentNode.SelectSingleNode("//img[#id="product_photo"]/#src");
And this is the line i need to crawl (the *...* part shows block to extract
<img id="product_photo" src="*/images/thumb/4400/10280/st.jpg*">
Some pages provide image in meta tags so .Attributes["src"] wont work.
UPDATE: You can see my query and result here
You cann't get the value of "src" or any other attributes in using:
Agility.DocumentNode.SelectSingleNode(yourXpath);
Just by using:
string s=Agility.DocumentNode.SelectSingleNode(yourXpath).value;
It's because XPath cann't return value of an attribute by SelectSingleNode() func in HtmlAgilityPack class. So you must use SelectSingleNode(yourXpath).value or use Regex after the pharsing to get just the "src" without the outerText.

How do I find a HTML div contains specific text after a text prefix?

I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3 inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.
Please help
EDIT:
Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

How can I parse the information from this XML?

this is an example of the XML I want to scrape:
http://www.dreamincode.net/forums/xml.php?showuser=335389
Notice that the contactinformation tag has many contact elements, each similar but with different values.
For example, the element that has the AIM content in it, how can I get the content of the Value tag that's in the same family as the AIM content element?
That's where I'm stuck. Thanks!
Basically: I need to find the AIM content tag, make a note of where it is, and find the Value element within that same family. Hope this makes the question clearer
LINQToXML
var doc = XDocument.Load(#"http://www.dreamincode.net/forums/xml.php?showuser=335389");
var aimElements = doc.Descendants("contact").Where(a=>a.Element("title").Value == "AIM").Select(a=>a.Element("value").Value);
this will give you a list of strings that hold the value of the value element for a contact that has the title AIM, you can do a First() or a FirstOrDefault if you believe there should only be 1
Using an xpath like the one below will get you the contact/value node where contact/title is "AIM":
/ipb/profile/contactinformation/contact[title='AIM']/value
Have you tried to parse the XML rather than "scraping" it?

How to find a repeated string and the value between them using regexes?

How would you find the value of string that is repeated and the data between it using regexes? For example, take this piece of XML:
<tagName>Data between the tag</tagName>
What would be the correct regex to find these values? (Note that tagName could be anything).
I have found a way that works that involves finding all the tagNames that are inbetween a set of < > and then searching for the first instance of the tagName from the opening tag to the end of the string and then finding the closing </tagName> and working out the data from between them. However, this is extremely inefficient and complex. There must be an easier way!
EDIT: Please don't tell me to use XMLReader; I doubt I will ever use my custom class for reading XML, I am trying to learn the best way to do it (and the wrong ways) through attempting to make my own.
Thanks in advance.
You can use: <(\w+)>(.*?)<\/\1>
Group #1 is the tag, Group #2 is the content.
Using regular expressions to parse XML is a terrible error.
This is efficient (it doesn't parse the XML into a DOM) and simple enough:
string s = "<tagName>Data between the tag</tagName>";
using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
xr.Read();
Console.WriteLine(xr.ReadElementContentAsString());
}
Edit:
Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:
Consider this fairly trivial test case:
<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>
There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.
You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.
And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.
You can use a backreference like \1 to refer to an earlier match:
#"<([^>]*)>(.*)</\1>"
The \1 will match what was captured by the first parenthesized group.
with Perl:
my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;
where $1 is now filled with the data you captured
Going forward, if you get stuck check out regexlib.com
It's the first place I go when i get stuck on regex

Categories