Parsing HTML "Visually" - c#

OKay I am at loss how to name this question. I have some HTML files, probably written by lord Lucifier himself, that I need to parse. It consists of many segments like this, among other html tags
<p>HeadingNumber</p>
<p style="text-indent:number;margin-top:neg_num ">Heading Text</p>
<p>Body</p>
Notice that the heading number and text are in seperate p tags, aligned in a horizontal line by css. the css may be whatever Lucifier fancies, a mixture of indents, paddings, margins and positions.
However that line is a single object in my business model and should be kept as such. So How do I detect whether two p elements are visually in a single line and process them accordingly. I believe the HTML files are well formed if it helps.

You didn't specify how you were parsing, but this is possible in jQuery since you can determine the offset position of any element from the window origin. Check out the example here.
The code:
$(function() {
function sameHorizon( obj1, obj2, tolerance ) {
var tolerance = tolerance || 0;
var obj1top = obj1.offset().top;
var obj2top = obj2.offset().top;
return (Math.abs(obj1top - obj2top) <= tolerance);
}
$('p').each(function(i,obj) {
if ($(obj).css('margin-top').replace('px','') < 0) {
var p1 = $(obj).prev('p');
var p2 = $(obj);
var pTol = 4; // pixel tolerance within which elements considered aligned
if (sameHorizon(p1, p2, pTol)) {
// put what you want to do with these objects here
// I just highlighted them for example
p1.css('background','#cc0');
p2.css('background','#c0c');
// but you can manipulate their contents
console.log(p1.html(), p2.html());
}
}
});
​});
This code is based on the assumption that if a <p> has a negative margin-top then it is attempting to be aligned with the previous <p>, but if you know jQuery it should be apparent how to alter it to meet different criteria.
If you can't use jQuery for your problem, then hopefully this is useful for someone else who is or that you can set something up in jQuery to parse this and output new markup.

You may run irobotsoft web scraper and have a test:
Open the page in its browser window
Select and mark the line
Use menu: Design -> Practice HTQL and see if it can extract the line.

I don't have a ton of experience using it, but if the HTML is well formed and depending on what format you need your parsed data in, you may be able to treat it as an XML doc and use XQuery to parse out your data.
Also open up the HTML in Firefox and see if you can figure out what CSS styles are being applied using Firebug. It may give you a better clue as to how the HTML is being lined up...although it looks like its being done using the 'margin-top:negative_number'...if that's the case I think XQuery should be able to find the elements with that particular style applied.

Related

Combining Regex with Selenium in C#

I have a Automation Suite, currently testing against Wordpress (a test site to practice against). I am attempting to verify when a user edit's an existing Page they are taken to the correct screen. Previously the following code snippet was working fine, however now the ID mentioned below is no longer present (it was an image).
public static bool IsInEditMode()
{
return Driver.Instance.FindElement(By.Id("icon-edit-pages")) != null;
}
Assert.AreEqual(NewPostPage.IsInEditMode(), "You are not in edit mode");
The HTML I am targeting is...
<h2>
Edit Page
Add New
</h2>
I would like to extract the value of the h2 tag 'Edit Page'. Currently I am also getting the value of the anchor 'Add New', which I need to ignore.
using a CssSelector with "h2:first-child" returns both values.
I think I need to use a regular expression, if anyone has any suggestions to help that would be great.
I attempted doing something similar in JSFiddle but require the C# equivalent
var myString = document.getElementsByTagName('h2')[0].innerHTML;
var newString = myString.replace(/<([^>]+?)([^>]*?)>(.*?)<\/\1>/ig, "");
console.log(newString);
You can also get the parent element's text and remove the child element's text from it:
var parent = Driver.FindElement(By.TagName("h2"));
var child = parent.FindElement(By.TagName("a"));
var text = parent.Text.Replace(child.Text, "").Trim();
You can use StringAssert to verify if the string to check contains the expected string. I think is better because you not need to use regex
Example:
StringAssert.Contains(message, expectedmessage);

Edit an existing PDF file using iTextSharp

I have a pdf file which I am processing by converting it into text using the following coding..
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
During processing if I am seeing any type of ambiguity in the content means error in the data of the PDF file, I have to mark the entire line of the pdf(Color that line with Red) file but I am not able to analyze how to achieve that. Please help me.
As already mentioned in comments: What you essentially need is a SimpleTextExtractionStrategy replacement which not only returns text but instead text with positions. The LocationTextExtractionStrategy would be a good starting point for that as it collects the text with positions (to put it in the right order).
If you look into the source of LocationTextExtractionStrategy you'll see that it keeps its text pieces in a member List<TextChunk> locationalResult. A TextChunk (inner class in LocationTextExtractionStrategy) represents a text piece (originally drawn by a single text drawing operation) with location information. In GetResultantText this list is sorted (top-to-bottom, left-to-right, all relative to the text base line) and reduced to a string.
What you need, is something like this LocationTextExtractionStrategy with the difference that you retrieve the (sorted) text pieces including their positions.
Unfortunately the locationalResult member is private. If it was at least protected, you could simply have derived your new strategy from LocationTextExtractionStrategy. Instead you now have to copy its source to add to it (or do some introspection/reflection magic).
Your addition would be a new method similar to GetResultantText. This method might recognize all the text on the same line (just like GetResultantText does) and either
do the analysis / search for ambiguities itself and return a list of the locations (start and end) of any found ambiguities; or
put the text found for the current line into a single TextChunk instance together with the effective start and end locations of that line and eventually return a List<TextChunk> each of which represents a text line; if you do this, the calling code would do the analysis to find ambiguities, and if it finds one, it has the start and end location of the line the ambiguity is on. Beware, TextChunk in the original strategy is protected but you need to make it public for this approach to work.
Either way, you eventually have the start and end location of the ambiguities or at least of the lines the ambiguities are on. Now you have to highlight the line in question (as you say, you have to mark the entire line of the pdf(Color that line with Red)).
To manipulate a given PDF you use a PdfStamper. You can mark a line on a page by either
getting the UnderContent for that page from the PdfStamper and fill a rectangle in red there using your position data; this disadvantage of this approach is that if the original PDF already has underlayed the line with filled areas, your mark will be hidden thereunder; or by
getting the OverContent for that page from the PdfStamper and fill a somewhat transparent rectangle in red; or by
adding a highlight annotation to the page.
To make things even smoother, you might want to extend your copy of TextChunk (inner class in your copy of LocationTextExtractionStrategy) to not only keep the base line coordinates but also maximal ascent and descent of the glyphs used. Obviously you'd have to fill-in those information in RenderText...
Doing so you know exactly the height required for your marking rectangle.
Too long to be a comment; added as answer.
My good fellow and peer Adi, It depends a lot on your PDF contents. It's kind of hard to do a generic solution to something like this. What does currentText contain? Can you give an example of it? Also, if you have a lot of these PDFs to check, you need to get currentText of a few of them, just to make sure that your current PDF to string conversion produces the same result every time. If it is same every time from different PDFs; then you can start to automate.
The automation also depends a lot on your content, for example if current Text is something like this: Value: 10\nValue: 11\nValue: 9Value\n15 then what I recommend is going through every line, extracting the value and checking it against what you need it to be. This is untested semi-pseudo code that gives you an idea of what I mean:
var lines = new List<string>(currentText.Split('\n'));
var newlines = new List<string>();
foreach (var line in lines) {
if (line != "Value: 10") {
newLines.Add(line); // This line is correct, no marking needed
} else {
newlines.Add("THIS IS WRONG: " + line); // Mark as incorrect; use whatever you need here
}
}
// Next, return newlines to the user showing them which lines are bad so they can edit the PDF
If you need to automatically edit the existing PDF, this will be very, very, very hard. I think it's beyond the scope of my answer - I was answering how to identify the wrong lines and not how to mark them - sorry! Someone else please add that answer.
By the way; PDF is NOT a good format for doing something like this. If you have access to any other source of information, most likely the other one will be better.

How do I scrape a website for information?

I want my program to automatically download only certain information off a website. After finding out that this is nearly impossible I figured it would be best if the program would just download the entire web page and then find the information that I needed inside of a string.
How can I find certain words/numbers after specific words? The word before the number I want to have is always the same. The number varies and that is the number I need in my program.
Sounds like screen scraping. I recommend using CSQuery https://github.com/jamietre/CsQuery (or HtmlAgilityPack if you want). Get the source, parse as object, loop over all text nodes and do your string comparison there. The actual way of doing this varies a LOT on how the source HTML is done.
Maby something like this untested example written from memory (CSQuery)
var dom = CQ.Create(stringWithHtml);
dom["*"].Each((i, e) =>
{
// handle only text nodes
if (e.NodeType == NodeType.TEXT_NODE) {
// do your check here
}
}
I've used HTML Agility Pack for multiple applications and it works well. Lots of options too.
It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.

Get text between 2 html tags c#

I am trying to get the data between the html (span) provided (in this case 31)
Here is the original code (from inspect elements in chrome)
<span id="point_total" class="tooltip" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again." aria-describedby="ui-tooltip-0">31</span>
I have a rich textbox which contains the source of the page, here is the same code but in line 51 of the rich textbox:
<DIV id=point_display>You have<BR><SPAN id=point_total class=tooltip jQuery16207621750175125325="23" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.">17</SPAN><BR>Points </DIV><IMG style="FLOAT: right" title="Gain subscribers" border=0 alt="When people subscribe to you, you lose a point" src="http://static.subxcess.com/images/page/decoration/remove-1-point.png"> </DIV>
How would I go about doing this? I have tried several methods and none of them seem to work for me.
I am trying to retrieve the point value from this page: http://www.subxcess.com/sub4sub.php
The number changes depending on who subs you.
You could be incredibly specific about it:
var regex = new Regex(#"<span id=""point_total"" class=""tooltip"" oldtitle="".*?"" aria-describedby=""ui-tooltip-0"">(.*?)</span>");
var match = regex.Match(#"<span id=""point_total"" class=""tooltip"" oldtitle=""Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again."" aria-describedby=""ui-tooltip-0"">31</span>");
var result = match.Groups[1].Value;
You'll want to use HtmlAgilityPack to do this, it's pretty simple:
HtmlDocument doc = new HtmlDocument();
doc.Load("filepath");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span"); //Here, you can also do something like (".//span[#id='point_total' class='tooltip' jQuery16207621750175125325='23' oldtitle='Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.']"); to select specific spans, etc...
string value = node.InnerText; //this string will contain the value of span, i.e. <span>***value***</span>
Regex, while a viable option, is something you generally would want to avoid if at all possible for parsing html (see Here)
In terms of sustainability, you'll want to make sure that you understand the page source (i.e., refresh it a few times and see if your target span is nested within the same parents after every refresh, make sure the page is in the same general format, etc..., then navigate to the span using the above principle).
There are multiple possibilities.
Regex
Let HTML be parsed as XML and get the value via XPath
Iterate through all elements. If you get on a span tag, skip all characters until you find the closing '>'. Then the value you need is everything before the next opening '<'
Also look at System.Windows.Forms.HtmlDocument

Most efficient way to add missing alt tags for images in a large html document

In order to comply with accessibility standards, I need to ensure that all images in some dynamically-generated html (which I don't control) have an empty alt tag if none is specified.
Example input:
<html>
<body>
<img src="foo.gif" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
Desired output:
<html>
<body>
<img src="foo.gif" alt="" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out.
Can anyone suggest an efficient way to accomplish this?
Update:
It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.
Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way.
I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere):
string addAltTag(string html) {
StringBuilder sb = new StringBuilder();
int pos=0;
int lastPos=0;
while(pos>=0) {
int nextpos;
pos=html.IndexOf("<img",pos);
if (pos>=0) {
// images can't have children, and there should not be any angle braces
// anyhere in the attributes, so should work fine
nextPos =html.IndexOf(">",pos);
}
if (nextPos>0) {
// back up if XML formed
if (html.indexOf(nextPos-1,1)=="/") {
nextPos--;
}
// output everything from last position up to but
// before the closing caret
sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
// can't just look for "alt" could be in the image url or class name
if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
sb.Append(" alt="\"\"");
}
lastPos=nextPos;
} else {
// unclosed image -- just quit
pos=-1;
}
}
sb.Append(html.Substring(lastPos);
return sb.ToString();
}
You may need to do things like convert to lowercase before testing, parse or test for variants e.g alt = " (that is, with spaces), etc. depending on the consistency you can expect from your HTML.
By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g.
obj.Select("img").Not("[alt]").Attr("alt",String.Empty);
Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.
I just tested this on a 8mb HTML file with about 250,000 lines. It did take a few seconds for the document to load, but the select method was very fast. Not sure how big your file is or what you are expecting. I even edited the HTML file to include some missing tags, such as </body> and some random </div>. It still was able to parse correctly.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"c:\\test.html");
HtmlNodeCollection col = doc.DocumentNode.SelectNodes("//img[not(#alt)]");
I had a total of 54,322 nodes. The select took milliseconds.
If the above will not work, and you can reliably predict the output, it is possible for you to stream the file in and break it in to manageable chunks.
pseduo-code
stream file in
parse in HtmlAgilityPack
loop until end of stream
I imagine you could incorporate Parallel.ForEach() in there as well, although I can't find documentation on whether this is safe with HtmlAgilityPack.
Well, if I review your content for Section 508 compliance, I will fail your web site or content - unless the blank alt text is for decorative (not needed for comprehension of content) only.
Blank alt text is only for decoration. Inserting it might fool some automated reporting tools, but you certainly are not meeting Section 508 compliance.
From a project management standpoint, you are better off leaving it failing so the end-users creating the content become responsible and the automated tool accurately reports it as non-compliant.
Hoping Chaps are clever enough to generate the Html markup wherever they need. Then here is the quick trick to convert the find out the SEO result for Images missing ALT attribute without too much struggle.
private static bool HasImagesWithoutAltTags(string htmlContent)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
return doc.DocumentNode.Descendants("img").Any() && doc.DocumentNode.SelectNodes("//img[not(#alt)]").Any();
}

Categories