Is there any way to use DOMXPath in C# like PHP for web scraping? Or is there any alternative way to do the same?
You can use the HTML Agility Pack - it is an HTML parser and supports querying using XPath.
Related
We are moving an e-commerce website to a new platform and because all of their pages are static html and they do not have all their product information in a database, we must scrape their current website for the product descriptions.
Here is one of the pages: http://www.cabinplace.com/accrugsbathblackbear.htm
What is the best was to get the description into a string? Should I use html agility pack? and if so how would this be done? as I am new to html agility pack and xhtml in general.
Thanks
The HTML Agility Pack is a good library to use for this kind of work.
You did not indicate if all of the content is structured this way nor if you have already gotten the kind of fragment you posted from the HTML files, so it is difficult to advise further.
In general, if all pages are structured similarly, I would use an XPath expression to extract the paragraph and pick the innerHtml or innerText from each page.
Something like the following:
var description = htmlDoc.SelectNodes("p[#class='content_txt']")[0].innerText;
Also,
If you need a good tool for testing or finding the Xpath for the HAP you can use this one:
HTML-Agility-xpath-finder. It is made using the same library so if you find a xpath in this tool you be securely able to use in your code.
I have written a few programs over the last few months that load HTML pages into a string and does various things like extract bits and pieces. I was basically writing my own GUI for some websites which have no API.
I've done this by stringing together many String.Substring(), String.IndexOf(), and String.LastIndexOf() statements.
I realise this is probably not the best way to do it - I was just writing a few "quick-and-dirty" trials to begin with.
What is the proper way to extract tokens from a web page?
Thanks :)
For XHTML, load it into XmlDocument or XDoxument.
For (non-X)HTML, load it into the HTML Agility Pack's HtmlDocument - the API is almost the same as XmlDocument, so it should be familiar.
Use Html Agility Pack
This is just a general question. Currently I am doing webpage scraping using regex. But I think it is sometimes too difficult to figure out the regular expression, so I am thinking is XSL/XPath an alternative to regex in C#?
Also, I would like to know if there are more advanced techniques for webpage scraping other than the two listed above. Thanks.
You may take a look at SgmlReader or Html Agility Pack which are HTML parsing libraries for .NET.
Easy way to gather data from a web page is WebsiteParser. It's based on Html Agility Pack and you can simply describe your properties using attributes and CSS selectors.
Github here
It seems to me just using the html agility pack would work to prevent xss (parse then get innertext). Would it be repetitive to use antixss after using hap?
Thanks,
rod.
Apples and oranges.
The HTML Agility Pack is a tool to parse HTML and work with the resulting parsed document.
the AntiXSSLibrary is a tool to use on your HTML and website to prevent XSS.
Comparing the two does not make much sense to me.
Do you know of any extension for HTML Agility Pack, that allows querying HtmlDocument object (created by HAP) in jQuery style (instead of XPath)?
Yes, check http://code.google.com/p/fizzler/
It seems unlikely that such a library exists. You can always go the Linq to XML route though.