This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
I want a regular expression to remove the following:
<a class="a" href="a.com">string</a>
What I want is if there was a class attribute in the tag i want the whole tag removed (<a class="a" href="a.com"></a>) and the the string between tag retrieved (string), else keep it as it's.
I suggest using an HTML parser like the HTML Agility Pack instead of trying to do this with RegEx - RegEx is not a good tool for parsing general HTML, as this answer explains.
The download comes with a bunch of Visual Studio projects as examples for usage.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature
Given you want to parse HTML, it's way better to use XML parser, that's what others already recommended.
But since you want regex, I've come up with this: http://regexr.com?2vuqs
<([^ ]+)([ \t]+[a-zA-Z-]+=(["'])[^\3]+?\3)*[ \t]+class=(["'])[^\4]\4([ \t]+[a-zA-Z-]+=(["'])[^\6]+?\6)*>([^<]+)</(\1)>
It's not fail proof, but it should handle most situations. Check the link to see it works.
Related
We are moving an e-commerce website to a new platform and because all of their pages are static html and they do not have all their product information in a database, we must scrape their current website for the product descriptions.
Here is one of the pages: http://www.cabinplace.com/accrugsbathblackbear.htm
What is the best was to get the description into a string? Should I use html agility pack? and if so how would this be done? as I am new to html agility pack and xhtml in general.
Thanks
The HTML Agility Pack is a good library to use for this kind of work.
You did not indicate if all of the content is structured this way nor if you have already gotten the kind of fragment you posted from the HTML files, so it is difficult to advise further.
In general, if all pages are structured similarly, I would use an XPath expression to extract the paragraph and pick the innerHtml or innerText from each page.
Something like the following:
var description = htmlDoc.SelectNodes("p[#class='content_txt']")[0].innerText;
Also,
If you need a good tool for testing or finding the Xpath for the HAP you can use this one:
HTML-Agility-xpath-finder. It is made using the same library so if you find a xpath in this tool you be securely able to use in your code.
Been struggling with this for awhile. I can't seem to figure out how to get a regex request to only return the value attribute of a particular html tag. Any help is greatly appreciated.
Are you trying to parse HTML with Regex? Don't or you will continue to struggle. Use SgmlReader or HTML Agility Pack for this purpose.
RegEx is not a good solution for parsing unstructured (or unknown) HTML.
See this SO post for compelling reasons why this is the case.
I suggest using a parser such as the HTML Agility Pack and querying the parsed document.
This is just a general question. Currently I am doing webpage scraping using regex. But I think it is sometimes too difficult to figure out the regular expression, so I am thinking is XSL/XPath an alternative to regex in C#?
Also, I would like to know if there are more advanced techniques for webpage scraping other than the two listed above. Thanks.
You may take a look at SgmlReader or Html Agility Pack which are HTML parsing libraries for .NET.
Easy way to gather data from a web page is WebsiteParser. It's based on Html Agility Pack and you can simply describe your properties using attributes and CSS selectors.
Github here
I am using HttpWebRequest to put a remote web page into a String and I want to make a list of all it's script tags (and their contents) for parsing.
What is the best method to do this?
The best method is to use an HTML parser such as the HTML Agilty Pack.
From the site:
It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Sample applications:
Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.
Use an XML parser to get all the script tags with their content.
Like this one: simple xml
Basically I want to extract keywords or words or tokens that are present in the webpage after removing the stopwords. Does anybody know how to do this? Code in C# would be appreciated.
Use an HTML parsing library like the HTML Agility Pack.
Once you load an HTML document with it, you can query it with Xpath syntax - it exposes the HTML in a similar way to an XmlDocument.
The HTML Agility Pack that Oded mentions will help you get at the plain text inside the HTML, but to extract keywords from the webpage after removing the stopwords you'll need to do more work. There's a good informative answer from Joseph Turian to this question: How do I extract keywords used in text?