Regular Expressions select whole outer DIV

Regular Expressions select whole outer DIV - c#

been trying for hours to solve this problem. I want to use regular expressions to select whole divs including nested divs see example string below:
AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC
Want to return the following values
<div> Text1 </div>
<div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div>
The closes I've got is using the following code but just gives me each DIV
(?<BeginTag><\s*div.*?>)|(?<EndTag><\s*/\s*div.*?>)
Any help would be great.

To expand on my rather snarky comment, a Regex is not a good tool for parsing any kind of HTML. Only in the simplest of scenarios will it be feasible, and even then, I would not recommend it.
What you need is a good tool for parsing HTML. In the .NET world, a nice library for this is the HTMLAgilityPack or perhaps the SGMLReader project.
You do need to invest a little bit of time in learning the API, but it will be worth it.
For the little fragment you are showing, I think the easiest API for you will be SGMLReader. It can read HTML as if it were XML, which means you can convert it to an XDocument and use a much nicer API. The code for that could look like this:
string markup = "<html>AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC</html>";
XDocument doc;
using(var reader = Sgml.SgmlReader.Create(new StringReader(markup)))
doc = XDocument.Load(reader);
var rootLevelDivs = doc.Root.Elements("div");
foreach(var div in rootLevelDivs)
Console.WriteLine(div);

Related

Scraping from a div

I am experimenting with web scraping and I am having trouble scraping a particular value out of some nested div classes. I am using the .NET HtmlAgilityPack class library in a .NET Framework C# Console App. Here is the div code:
<div class="ds-nearby-schools-list">
<div class="ds-school-row">
<div class="ds-school-rating">
<div class="ds-gs-rating-8">
<span class="ds-hero-headline ds-schools-display-rating">8</span>
<span class="ds-rating-denominator ds-legal">/10</span>
</div>
</div>
<div class="ds-nearby-schools-info-section">
<a class="ds-school-name ds-standard-label notranslate" href="https://www.greatschools.org/school?id=00870&state=MD" rel="nofollow noopener noreferrer" target="_blank">Candlewood Elementary School</a>
<ul class="ds-school-info-section">
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Grades:</span>
<span class="ds-school-value ds-body-small">K-5</span>
</li>
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Distance:</span>
<span class="ds-school-value ds-body-small">0.8 mi</span>
</li>
</ul>
</div>
</div>
</div>
I want to scrape the "8" from the ds-hero-headline ds-schools-display-rating class. I am having trouble formulating the selector for the SelectNodes method on the DocumentNode object of the HtmlNode.HtmlDocument class.

I guess you might be having a trouble to write XPath to select the node. Try //*[contains(#class, 'ds-hero-headline') and contains(#class, 'ds-schools-display-rating')] with SelectNodes method.
However, this XPath could have a problem if the page your targeting would also have class name like ds-hero-headline-content, which ds-hero-headline can partially match. In that case, see the solution in How can I find an element by CSS class with XPath?

I would use this to extract 0.8 mi
//div[#class='ds-nearby-schools-list']/div[#class='ds-school-row']/div[#class='ds-nearby-schools-info-section']/ul[#class='ds-school-info-section']/li[#class='ds-school-info']/span[#class='ds-school-value ds-body-small' and preceding-sibling::span[#class='ds-school-key ds-body-small' and text()='Distance:']]/text()
Then this regex to group data:
^[0-9\.]+ (.*)$
At the end you can use some kind of conversion to save distance to an object.

Have you tried the following to get the 8. You can search for a specific span element with the class name to get the inner text.
Note: I used text file to load the html from your question.
string htmlFile = File.ReadAllText(#"TempFile.html");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlFile);
HtmlNode htmlDoc = doc.DocumentNode;
HtmlNode node = htmlDoc.SelectSingleNode("//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(node.InnerText);
// output: 8
Alternate:
Another way is to specify the path that you want the value from, starting from the div element.
HtmlNode node2 = htmlDoc.SelectSingleNode("//div[#class='ds-gs-rating-8']//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(subNode.InnerText);
output
8

Selenium : xpath following-sibling where siblings have more children

I hope I describe my problem/question in a comprehensible way.
I have and html that looks like this:
<div class="class-div">
<label class="class-label">
<span class="class-span">AAAA</span>
</label>
<div class="class-div-a">
<textarea class="class-textarea">
</textarea>
</div>
</div>
<div class="class-div">
<label class="class-label">
<span class="class-span">BBBB</span>
</label>
<div class="class-div-a">
<textarea class="class-textarea">
</textarea>
</div>
</div>
I want the Xpath for the TextArea where the value of the Label is AAAA to populate it with a value in Selenium.
So somelike like this...
wait.Until(ExpectedConditions.ElementIsVisible(
By.XPath("//div[#class='class-div']/label[#class='class-label'][span[#class='class-span' and text()='AAAA']]/following-sibling::div[#class='class-div-a']/textarea[#class='class-textarea']"))).SendKeys(valueTextArea);

Problem could be in this waiter condition, ExpectedConditions.ElementIsVisible
The thing is that your <textarea> is not 'visible' in selenium context, visibility means that element is present in DOM (which is true) and it's size is greater then 0px which could be false for your <textarea> element. In java you would use ExpectedConditions.presenceOfElement() instead of ExpectedConditions.visibilityOfElement(), not sure how it goes in C# but you get the picture.
Try and see if it solves your problem.

Let me quickly rephrase the question to make sure I understand, you need an xpath to find the textbox associated with the label where the text is AAAA.
You'll have to go back up the tree in this case, here are a couple of ways I might do that, although your xpath looks correct:
Using ancestor to be clear about which element you're moving up to (better IMO)
By.XPath("//label/span[text()='AAAA']/ancestor::div[#class='class-div']//textarea");
Or just moving back up the tree with ..
By.XPath("//label/span[text()='AAAA']/../../..//textarea");
If your xpath exists, use asikojevics answer. The C# method is ExpectedConditions.ElementExists(By)
****UPDATE****
Based on your comment of a trailing space after the text value, here is another xpath that should find the textarea in that case, using contains instead of text()=.
By.XPath("//label/span[contains(text(),'AAAA')]/ancestor::div[#class='class-div']//textarea");

Clear raw HTML from malicious data in C#

I'm writing ASP.NET MVC app. Some pieces of HTML comes from user and some of them from third-party sources. Is there easy and fast enough way to clean HTML without heavy artillery like HAP(Html Agility Pack) or Tidy?
I'm just need to remove scripts, styles, <object>/<embed>, href="javascript:", style=, onclick and I'm not think that removing them manually via .Remove/.Replace is a good way even with StringBuilder.
In example, if I have next input
<html>
<style src="http://harmyourpage.com"/>
<script src="http://killyourdog.com"/>
<div>
Good link
Bad link
<p>Some text <b>to</b> test</p><br/>
<h1 style="position:absolute;">Damage your layout</h1>
And an image there <img src="http://co.com/a.jpg"/><br>
<span onclick="harm()">Good span with bad attribute</span>
<object>Your lovely java can be there</object>
</div>
</html>
which must be converted into next:
<div>
Good link
<a>Bad link</a>
<p>Some text <b>to</b> test</p><br/>
<h1>Damage your layout</h1>
And an image there <img src="http://co.com/a.jpg"/><br>
<span>Good span with bad attribute</span>
</div>
So, how to do this — with whitelist of tags and anttributes — in right way?
UPD: I'm tried StackExchange HtmlHelpers library, but it removes needed tags such as div, a and img.

Fastest way to achieve the same is to use Regular Expression
var regex = new Regex(
"(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)|(\\<object(.+?)\\</object\\>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase
);
string ouput = regex.Replace(input, "");
You can also use Microsoft Web Protection Library (http://wpl.codeplex.com/) for same like
Sanitizer.GetSafeHtmlFragment(input);

Get data from HTML child class

I’m attempting to create a tool, in C#, which gathers and analyses data from a web page/form. There are basically 2 different types of data. Data entered by a user and data created by the system (I don’t have access to).
The data created by the user is kept in fields and the form uses IDs - so GetElementByID is used.
The problem I’m running into is obtaining the data created by the system. It shows on the form, but isn’t associated to an ID. I may be reading/interpreting the HTML incorrectly, but it appears to be a child class (I don’t have much HTML experience). I’m attempting to get the “Date Submitted” data (near the bottom of the code). Sample of the HTML code:
<div class="bottomSpace">
<div class="importfromanotherorder">
<div class="level2Panel" >
<div class="left">
<span id="if error" class="error"></span>
</div>
<div class="right">
Enter Submission ID
<input name="Submission$ID" type="text" id="Submission_ID" class="textbox" />
<input type="submit" name="SumbitButton" value="Import" id="SubmitButton" />
</div>
</div>
</div>
</div>
<div class="bottomSpace">
<div class="detailsinfo">
<div class="level2Panel" >
<div class="left">
<h5>Product ID</h5>
1234567
<h5>Sub ID</h5>
Not available
<h5>Product Type</h5>
Type 1
</div>
<div class="right">
<h5>Order Number</h5>
0987654
<h5>Status</h5>
Ordered
<h5>Date Submitted</h5>
7 17 2012 5 45 09 AM
</div>
</div>
</div>
</div>
Using GetElementsByTagName (searching for “div”) and then using GetAttribute(“className”) (searching for “right”) generates some results, but as there are 2 “right” classes, it’s not working as intended.
I’ve tried searching by className = “detailsinfo”, which I can find, but I’m not sure how I could go about getting down to the “right” class. I tried sibling and children, but the results don't appear to be working. The next possible problem is that it appears the date data is actually text belonging to class “right” and not element “Date Submitted” .
So basically, I'm curious as to how the best approach would be to get the data I'm looking for. Would I need to get all of the class “right” text and then try and extract the date string?
Apologizes if there is too much info or not enough of the required info :) Thanks in advance!
EDIT: Added how GetElementsByTagName is called using C# - per Icarus's comment.
HtmlDocument doc = webBrowser1.Document;
HtmlElementCollection elemColl = doc.GetElementsByTagName("div");

This will do it if the 'right' instance you want is the 2nd. Two approaches given:
The commented-out approach is it's zero based, so uses instance 1.
The second approach is xpath and is therefore one-based so uses instance 2.
private string ReadHTML(string html)
{
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(html);
System.Xml.XmlElement element = doc.DocumentElement;
//This commented-out approach works and might be preferred if you want to iterate
//over a node set instead of choosing just one node
//string key = "//div[#class='right']";
//System.Xml.XmlNodeList setting = element.SelectNodes(key);
//return setting[1].LastChild.InnerText;
// This xpath appraoch will let you select exactly one node:
string key = "((//div[#class='right'])[2])/child::text()[last()]";
System.Xml.XmlNode setting = element.SelectSingleNode(key);
return setting.InnerText;
}

How can I extract just text from the html

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)

You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

How about using the XPath expression '//body//text()' to select all text nodes?

You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expressions select whole outer DIV - c#

Related

Scraping from a div

Selenium : xpath following-sibling where siblings have more children

Clear raw HTML from malicious data in C#

Get data from HTML child class

How can I extract just text from the html

Categories

Resources