I have been trying to use the HTML Agility Pack to parse HTML into valid XHTML to go into a larger XML file. This for the most part works however lists become formatted like:
<ul>
<li>item1
<li>item2
</li></li>
</ul>
As oppose to what I would expect:
<ul>
<li>item1</li>
<li>item2</li>
</ul>
Unfortunately this format with nested li tags doesn't pass the schema validation which I have no control over. Does anyone know a simple way to correct this either through the HTML Agility Pack or an alternative. Preferably in .NET.
I found an alternative to the agility pack called HTML Tidy http://tidy.sourceforge.net/ I actually used the .NET port called Tidy.NET http://sourceforge.net/projects/tidynet/ this seemed to fix my issue.
I found your questio on other sites as well. The HTML you are trying to parse is:
<UL>
<LI>NVQ Level 3 in Fabrication and Welding Engineering
<LI>Level 3 Certificate in Engineering
<LI>Level 2 Key Skill in Application of Number
<LI>Level 2 Key Skill in Communication
<LI>Level 2 Key Skill in Information Technology
<LI>Level 2 Key Skill in Working with Others
<LI>Level 2 Key Skill in Improving Own Learning & Performance</LI></UL>
What I notice is that the first <li> is parent to the other <li>'s.
One aproach I would take at this is to take the first <li> and the text (it's a TextNode for HAP), save the other <li> children and remove the children, inserting them (while formating them) after the parent node.
You might have to take the recursive way at this. Here is a peek at my solutuion for a HTML Sanitizer class: HTML Agility Pack strip tags NOT IN whitelist
HtmlNode ul = _sourceForm.SelectSingleNode("//ul");
HtmlNodeCollection childList = ul.ChildNodes;
Then you can loop though the child list to grab the text elements you are interested in.
Related
I'm trying to have my program sit on a webpage and wait for specific tagName within an article to appear. Problem is, I need Selenium to check the article contains two tagNames before clicking it, that's where I'm stumped. The way I have my code setup right now, it doesn't click anywhere. It just sits on the page, I suspect because there's more than one article with the same main tagName that I'm trying to find. Here's the HTML:
<article>
<div class ="inner-article">
<a href ="/shop/shirts/iycbmgtqw/x9vdawcjg" style="height:150px;">
<img alt="Xrtqh7ar444" height="150" src="//d17ol771963kd3.cloudfront.net/120885/vi/xrTQH7Ar444.jpg" width="150">
</a>
<h1>
EXAMPLE_CODE
</h1>
<p>
EXAMPLE_COLOUR
</p>
</div>
</article>
All other items on this page have an identical class, and some have identical tagNames. I want to search for when there's a specific combination of two tagNames in an article. I realize xPath is an option, but I would like to code it before knowing an xPath, where the name of the item is the only available information.
And here's the code I'm working with at the moment:
driver.Manage().Timeouts().ImplicitlyWait(TimeSpan.FromMinutes(10));
IWebElement test = driver.FindElement(By.TagName(textBox12.Text));
test.Click();
where textBox12.Text is "EXAMPLE_CODE". Am I correct in assuming that WebDriver doesn't click anything because there is more than one element with the tagName "EXAMPLE_CODE", and is there a possible way to make it first look for "EXAMPLE_CODE" and then check the secondary: "EXAMPLE_COLOUR"?
Thanks!!
You are using By.TagName incorrectly. Tag refers to the type of element you are trying to find. In this case for the link it is 'a'. Or in case of a div it is 'div'. Te correct way of finding with tagname for a link would be - By.TagName("a").
You need to match text and you will need to use xpath. Assuming that the code is unique you should try.
XPath to get the code href -- //div[class='inner-article']/h1/a[.=EXAMPLE_CODE]
XPath to get the color href -- //div[class='inner-article']/h1/a[.=EXAMPLE_CODE]/following-sibling::a
Selenium, NUnit testing, C#, Visual Studio.
How, in Selenium WebDriver, can I locate element in a page source that looks like following, and set some text in its <p> tag:
<body contenteditable="true" class="cke_editable cke_editable_themed cke_contents_ltr cke_show_borders" spellcheck="false">
<p></p>
</body>
This is body tag from CKEditor component present on a page (not a main page <body> element ).
Actually, I need to set some text in <p> element. What is confusing to me , is that class attribute is complicated, contains from several strings. I am aware of command: driver.findElement( By.className( "some_class_name" )); but how to use it in this case and to set some text in <p> element?
If you give the p tag an ID like so
<p id="derp">Text here</p>
You can send text to it using Selenium like this
driver.find_element_by_id("derp").sendKeys("herp");
Hope this helps!
EDIT: Without adding an ID to the element, you might be able to do something like this
driver.findElement(By.className("some_class_name")).findElement(By.tagName("p")).sendKeys("herp");
If you want the p elelement then this relative xpath should work.
//body[#class='cke_editable cke_editable_themed cke_contents_ltr cke_show_borders']/p
That is assuming that there is only a single body element with this class attribute.
As you are saying, there is no id usable for location, so you have to come up with a different solution.
Selenium is capable of using css selectors - it's the same scheme found in CSS files to specify to which elements the following styling rules should apply.
One possible locator would be the following:
body.cke_editable.cke_editable_themed.cke_contents_ltr.cke_show_borders > p
Advantage over XPath: CSS selectors are aware about groups, so they don't handle them only as strings. Using just an XPath expression against the exact class attribute, your recognition would fail if there would be another, new class withing the attribute. Using CSS selectors, it's possible to really just identify per class.
Simplified and boiled down to the classes that really describe your editable element, the following should be sufficient:
body.cke_editable.cke > p
Imagine the following HTML:
<div>
<b></b>
<div>
<table>...</table>
</div>
</div> <!-- this one -->
...
How could I find the matching closing tag for the first opening div tag? Is there a reg ex that could find it? I guess this is quite a common requirement but I'm struggling to find anything straightforward, just full blown HTML parsers.
No.
Use a full blown HTML parser. There's a reason they exist.
Use Html Agility Pack.
I'm assuming that you have tokeinized the html tags... Now create a stack and every time you see an opening tag push and everytime you see a closing tag pop... and see if the ones you pop macth the closing tag...
But there are already HTML parsers for this so search for one on codeplex.
Well, You need to have a 'clear' view of the syntax ! However, regexp are very limited in scope and I would'nt recommand using it for multi-line/tag syntax.
You rather need to track each tag (open/close) and use a 'handler' to deal with your request. You could use some Lex/Yacc tools but this may be overkilling. Depending on the language you use, you may already have modules for this purpose (like HTMLParser in Python).
There's always LinqToXml if you want to parse HTML and don't need every little detail.
I need to extract Text from webpages mostly related to business news.
say the HTML page is as follows..
<html>
<body>
<div>
<p> <span>Desired Content - 1</span></p>
<p> <span>Desired Content - 2</span></p>
<p> <span>Desired Content - 3</span></p>
</div>
</body>
</html>"
I have a sample stored in a string that can take me to Desired Content -1 directly, so i can collect that content. But i need to collect Desired Content -2 and 3.
For that what i tried is from the current location i.e from with in span node of desired Content -1 i used parentof and moved to the external node i.e Para node and got the content but actually i need to get the entire desired content in div. How to do it? You might ask me to go to div directly using parentof.parentof.span. But that would be specific to this example, i need a general idea.
Mostly news articles will have desired content in a division and i will go directly to some nested inner node of that division. I need to come out of those inner nodes only till i encounter a division and then get the innerText.
I am using XPath and HTMLagilitypack.
Xpath i am using is -
variable = doc.DocumentNode.SelectSingleNode("//*[contains(text(),'" + searchData + "')]").ParentNode.ParentNode.InnerText;
Here "searchData" is a variable that is holding a sample of Desired Content -1 used for searching the node having news in the entire body of the webpage.
What i am thinking is clean up the webpages and have only main tags like HTML, BODY, Tables, Division and Paragraphs but no spans and other formating elements. But some other website might use Spans only instead of divs so i am not sure how to implement this requirement.
Basic requirement is to extract the News content from different webpages(almost 250 different websites). So i can not code specific to each webpage..i need a generic method.
Any ideas appreciated. Thank you.
This XPath expression selects the innermost div element with $searchData variable reference value as part of its string value.
//div[contains(.,$searchData)]
[not(.//div[contains(.,$searchData)])]
Found out the answer myself...
Using a while loop till i find a div parent and then getting innertext is working.
`{ //Select the desired node, move up till you find a div and then get the inner text.
node = hd.DocumentNode.SelectSingleNode("//*[contains(text(),'" + searchData + "')]"); //Find the desired Node.
while (node.ParentNode.Name != "div") //Move up till you find a encapsulating Div node.
{
node = node.ParentNode;
Console.WriteLine(node.InnerText);
}
Body = node.InnerText;
}`
I'm importing some data from another test/bug tracking tool into tfs, and I would like to convert it's description, which is in simple HTML, so a plain string, where the 'layout' of the HTML is preserved.
For example:
<body>
<ol>
<li>Log on with user Acme & Co.</li>
<li>Navigate to the details tab</li>
<li>Check the official name</li>
</ol>
<br>
<br>
Expected Result:<br>
official name is filled in<br>
<br>
Actual Result:<br>
The &-sign is not shown correctly<br>
See attachement.
</body>
Would become plain text with newlines inserted and HTML-entities translated like:
1. Log on with user Acme & Co.
2. Navigate to the details tab
3. Check the official name
Expected Result:
official name is filled in
Actual Result:
The &-sign is not shown correctly
See attachment
I can currently replace some tags with newlines using a regex and strip the rest, but replacing the HTML-entities and stuff like <ol> and <ul> seemed like I'm re-inventing something (browser?). So I was wondering if someone has done this before me. I can't find it using Google.
Rather than regex, you could try loading it into the HTML agility pack? If it was xhtml, then an xslt transformation might be a good option.
In the end, once I got more comfortable with TFS, I customized the work item type to include a new HTML Field, and just copied the contents into that field.
This solution was so much better, because we could now see the intended formatting of the field.