Parsing not labeled HTML with "HTML Agility Pack" in C# - c#

Using HTML Agility Pack, I want to parse a not tagged text in a HTML document.
The next HTML is an example of the HTML structure that I will treat and the text below the last div is an example of the text that I want to extract.
(The one that begins with "I am selling..." and ends in "...services or offers")
<div class="mapbox">
<div id="map" class="viewposting" data-latitude="32.965732" data-longitude="-96.882528" data-accuracy="22"></div>
<p class="mapaddress">
<small>
(<a target="_blank" href="https://maps.google.com/maps/preview/#32.965732,-96.882528,16z">google map</a>)
</small>
</p>
</div>
<p class="attrgroup">
<span><b>2012 jeep grand cherokee laredo</b></span>
<br>
</p>
<p class="attrgroup">
<span>VIN: <b>ask me</b></span>
<br>
<span>condition: <b>excellent</b></span>
<br>
<span>cylinders: <b>6 cylinders</b></span>
<br>
<span>drive: <b>rwd</b></span>
<br>
<span>fuel: <b>gas</b></span>
<br>
<span>odometer: <b>98000</b></span>
<br>
<span>title status: <b>clean</b></span>
<br>
<span>transmission: <b>automatic</b></span>
<br>
</p>
<div class="print-information print-qrcode-container">
<p class="print-qrcode-label">QR Code Link to This Post</p>
<div class="print-qrcode" data-location="east"></div>
</div>
I am selling my 2012 Jeep Grand Cherokee. The Jeep runs and drives great. Zero issues. Always been well maintained and serviced on time. Very dependable car has never left me stranded. Very healthy. Everything works like it should. This Grand Cherokee would make a great family car or First car.<br>
<br>
*3.6 V6 <br>
*Automatic Transmission <br>
*98,000 Original Miles<br>
*Leather and Heated Seats<br>
*Navigation<br>
*Back Up Camera <br>
*Good Tires<br>
*Cold A/C Hot Heater <br>
*Clean Texas Title<br>
*Clean Carfax<br>
Much More!!<br>
<br>
Call or Text me for anymore information. <br>
show contact info
<li>do NOT contact me with unsolicited services or offers</li>
Can anyone tell me how to do this? How to extract that text using HTML Agility Pack in .NET?
Thanks in advance

After you load the document, use xpath for selecting the text following a specific node.
const string xpath = "//div[#class='print-information print-qrcode-container']/following-sibling::text()[1]";
string text = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
returns:
I am selling my 2012 Jeep Grand Cherokee. The Jeep runs and drives
great. Zero issues. Always been well maintained and serviced on time.
Very dependable car has never left me stranded. Very healthy.
Everything works like it should. This Grand Cherokee would make a
great family car or First car.
and visca catalunya!

Related

Cleaning up HTML created by contentEditable in c#

I've written a document editor which uses contentEditable to create HTML content. In some larger documents the style of syntax seems is all over the place. This is most likely a result of content pasted in from wordpad and earlier versions of the editor.
The problem is, now I'm left with a lot of very inconsistent documents.
It starts off fairly normal. Simple <p> tags for each line
<p>It is a truth</p>
<p>universally acknowledged</p>
<p>that a single man</p>
The only "bad" html up to this point is a few empty <i></i> tags, and the occasional instead of whitespace (anyone know why?)
Then it about halfway down the document, the line breaks switched to this format.
<div>
<br>
CHAPTER 1<br>
<br>
The sky above the port
<br>
was the color of a television
<br>
tuned to a dead channel.
</div>
<div>
<br>
</div>
Then about 3/4 down the page, we get this. It seems to have reverted to <p></p> tags, but now embeds them randomly in <span> tags with empty lang attributes
<div>
<span lang="">
<p>It was the best of times,</p>
<p>it was the worst of times,</p>
</span>
<p>it was the age of wisdom,</p>
<p>it was the age of foolishness,</p>
</div>
Note: some lines are inside a <span>, others are outside.
Worse, later on we get nested <span> tags
<span lang="">
<div>
<span lang="EN-GB">
<p>Stately, plump </p>
<p>Buck Mulligan came </p>
<span lang="EN-GB">
<p>from the stairhead, </p>
<p>bearing a bowl of lather </p>
<span lang="EN-GB">
<p> on which a mirror and a razor lay crossed</p>
</span>
</span>
</span>
</div>
</span>
You may also notice the parentage of the <span> and <div> tags is now reversed at the outset, with the <div> now a child of the <span>
I've noticed other oddities. <i></i> is used at the start but later <em></em> is used.
What's the best way to clean this HTML up?
Should I try and surround orphaned lines with <p> tags?
How do I remove only those <div> tags which contain <p> tags themselves? And how do I avoid leaving orphaned text in the document?
is a hard question, I had the same problem editing HTML from texts.
I found out this free pure HTML + JS editor: TinyMCE
http://www.tinymce.com/
which includes cleaning text options, you can choose the tags you want to clean from the text.
Is very powerful if you have the chance to change the editor you are using.

Max Value with Substring with HTML Agility Pack

I can't seem to get this xpath query to work with the HTMLAgilityPack with this code and I was wondering if anyone had any suggestions.
This is the query I have so far, but I can't seem to get it to return a number.
DocumentNode.GetAttributeValue("max(a[(#class='shackmsg')]/#href/substring-after(.,?id='))", "");
I'm trying to get the MAX value in the href attribute after the = sign on all hrefs with a class of shackmsg.
How long is the beta live before it goes retail? No one knows. We do know t</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218936" class="">
<div class="oneline oneline3 op olmod_ontopic olauthor_189801">
<a class="shackmsg" rel="nofollow" href="?id=31218936" onclick="return clickItem( 31218933, 31218936);"><span class="oneline_body"><b><u><span class="jt_yellow">Current Multiplayer Servers</span>!</u></b>
<span class="jt_sample"><span class="jt_green">Nighteyes's Japan Server: </span> <span class="jt_lime">(PvE)</span>: <b>211.15.2.34</b></span>
<span class="jt_sample"><span class="jt_green">zolointo's Canada Server: </span> <span class="jt_lime">(</span></span></span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218938" class="last">
<div class="oneline oneline2 op olmod_ontopic olauthor_189801">
<div class="treecollapse">
<a class="open" rel="nofollow" href="#" onclick="toggle_collapse(31218938); return false;" title="Toggle">toggle</a>
</div>
<a class="shackmsg" rel="nofollow" href="?id=31218938" onclick="return clickItem( 31218933, 31218938);"><span class="oneline_body">Had fun freezing my ass off last night with a bunch of shackers. Not sure who started the big tower we f...</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
<ul>
<li id="item_31218966" class="">
<div class="oneline oneline1 olmod_ontopic olauthor_128401">
<a class="shackmsg" rel="nofollow" href="?id=31218966" onclick="return clickItem( 31218933, 31218966);"><span class="oneline_body">wasn't me. I hung out on my ship for a bit listening to your kid play Christmas songs for a bit and then ...</span> : </a><span class="oneline_user ">jonin</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/jonin/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
<li id="item_31219008" class="last">
<div class="oneline oneline0 olmod_ontopic olauthor_8618">
<a class="shackmsg" rel="nofollow" href="?id=31219008" onclick="return clickItem( 31218933, 31219008);"><span class="oneline_body">haha i heard you guys booby trapped some poor sap's space ship</span> : </a><span class="oneline_user ">Break</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/Break/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
</ul>
Any suggestions?
There are two problems as far as I can see:
You're only scanning for anchor tags in the current context. You probably want to extend to scan everywhere (use // in the beginning of your query):
//a[#class='shackmsg']/#href/substring-after(., '?id=')
Note that I removed a pair of unnecessary parenthesis.
If I'm not completely mistaken, HTML Agility Pack only supports XPath 1.0 (yet I'm not totally sure). While System.Xml.XPath says it implements the XPath 2.0 data model, it does not actually implement XPath 2.0 (probably this is done so third party APIs can implement this API and offer XPath 2.0/XQuery support at the same time). Also have a look at this discussion on .NET's XPath 2.0 support.
Missing XPath 2.0 support would show up as two problems:
Function substring-after(...) does not exist.
A solution for your problem could be to use string-lenght($string) and substring($string, $start, $length) to extract the last n digits, or translate(...) to remove some characters:
translate('?id=31219008', '?id=', '')
will remove all occurences in the character class [?id=] (yet it is none, I just want to highlight it does not match strings, but individual characters of this set!).
You cannot apply functions in axis steps. This means, you cannot find the maximum value of substrings.
Possible solution: Only fetch all substrings and find the maximum from outside XPath.
You can combine XPath with HTML Agility Pack and make the following code :
var value = doc.DocumentNode.SelectNodes("//a[#class='shackmsg']").Select(
x => x.Attributes["href"].Value.Substring(4)).Max();
Console.WriteLine(value);
And this output :
31219008
In this code I assume to always exist the href attribute and always have the following structure :
"?id=XXXX"

Regular Expressions select whole outer DIV

been trying for hours to solve this problem. I want to use regular expressions to select whole divs including nested divs see example string below:
AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC
Want to return the following values
<div> Text1 </div>
<div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div>
The closes I've got is using the following code but just gives me each DIV
(?<BeginTag><\s*div.*?>)|(?<EndTag><\s*/\s*div.*?>)
Any help would be great.
To expand on my rather snarky comment, a Regex is not a good tool for parsing any kind of HTML. Only in the simplest of scenarios will it be feasible, and even then, I would not recommend it.
What you need is a good tool for parsing HTML. In the .NET world, a nice library for this is the HTMLAgilityPack or perhaps the SGMLReader project.
You do need to invest a little bit of time in learning the API, but it will be worth it.
For the little fragment you are showing, I think the easiest API for you will be SGMLReader. It can read HTML as if it were XML, which means you can convert it to an XDocument and use a much nicer API. The code for that could look like this:
string markup = "<html>AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC</html>";
XDocument doc;
using(var reader = Sgml.SgmlReader.Create(new StringReader(markup)))
doc = XDocument.Load(reader);
var rootLevelDivs = doc.Root.Elements("div");
foreach(var div in rootLevelDivs)
Console.WriteLine(div);

Get data from HTML child class

I’m attempting to create a tool, in C#, which gathers and analyses data from a web page/form. There are basically 2 different types of data. Data entered by a user and data created by the system (I don’t have access to).
The data created by the user is kept in fields and the form uses IDs - so GetElementByID is used.
The problem I’m running into is obtaining the data created by the system. It shows on the form, but isn’t associated to an ID. I may be reading/interpreting the HTML incorrectly, but it appears to be a child class (I don’t have much HTML experience). I’m attempting to get the “Date Submitted” data (near the bottom of the code). Sample of the HTML code:
<div class="bottomSpace">
<div class="importfromanotherorder">
<div class="level2Panel" >
<div class="left">
<span id="if error" class="error"></span>
</div>
<div class="right">
Enter Submission ID
<input name="Submission$ID" type="text" id="Submission_ID" class="textbox" />
<input type="submit" name="SumbitButton" value="Import" id="SubmitButton" />
</div>
</div>
</div>
</div>
<div class="bottomSpace">
<div class="detailsinfo">
<div class="level2Panel" >
<div class="left">
<h5>Product ID</h5>
1234567
<h5>Sub ID</h5>
Not available
<h5>Product Type</h5>
Type 1
</div>
<div class="right">
<h5>Order Number</h5>
0987654
<h5>Status</h5>
Ordered
<h5>Date Submitted</h5>
7 17 2012 5 45 09 AM
</div>
</div>
</div>
</div>
Using GetElementsByTagName (searching for “div”) and then using GetAttribute(“className”) (searching for “right”) generates some results, but as there are 2 “right” classes, it’s not working as intended.
I’ve tried searching by className = “detailsinfo”, which I can find, but I’m not sure how I could go about getting down to the “right” class. I tried sibling and children, but the results don't appear to be working. The next possible problem is that it appears the date data is actually text belonging to class “right” and not element “Date Submitted” .
So basically, I'm curious as to how the best approach would be to get the data I'm looking for. Would I need to get all of the class “right” text and then try and extract the date string?
Apologizes if there is too much info or not enough of the required info :) Thanks in advance!
EDIT: Added how GetElementsByTagName is called using C# - per Icarus's comment.
HtmlDocument doc = webBrowser1.Document;
HtmlElementCollection elemColl = doc.GetElementsByTagName("div");
This will do it if the 'right' instance you want is the 2nd. Two approaches given:
The commented-out approach is it's zero based, so uses instance 1.
The second approach is xpath and is therefore one-based so uses instance 2.
private string ReadHTML(string html)
{
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(html);
System.Xml.XmlElement element = doc.DocumentElement;
//This commented-out approach works and might be preferred if you want to iterate
//over a node set instead of choosing just one node
//string key = "//div[#class='right']";
//System.Xml.XmlNodeList setting = element.SelectNodes(key);
//return setting[1].LastChild.InnerText;
// This xpath appraoch will let you select exactly one node:
string key = "((//div[#class='right'])[2])/child::text()[last()]";
System.Xml.XmlNode setting = element.SelectSingleNode(key);
return setting.InnerText;
}

How to generate PDF on the fly with same layout as HTML code in C#

I am generating HTML code on the fly for a catalog, and I would like to generate a PDF as well. I considered just printing the HTML page to a PDF doc, but I lose some of the background shading and things, and it splits content across pages.
I've read a bit about iText, but I haven't figured out how to format it properly, and I don't know how to make it so it doesn't split my content across pages.
This is the beginning of my HTML page, I included several items so you can see how the content is broken down. I apologize for the ugly HTML, I cannot for the live of me get a div table to look right!
<style type="text/css">
<!--
tr#odd {
background-color:#e2e2e2;
vertical-align:top;
}
tr#even {
vertical-align:top;
}
div#title {
font-size:16px;
font-weight:bold;
}
div#mpaa {
font-size:10px;
}
div#genre {
font-size:12px;
font-style:italic;
}
div#plot {
height: 63px;
font-size:12px;
overflow:hidden;
}
-->
</style>
<html>
<title>Movie Catalog</title>
<body>
718 Movies
<br />
<br />
<table>
<tr id="odd">
<td>
<img src=".\images\10,000BCDVDrip.jpg" width="75" height="110">
</td>
<td>
<div id="title">10,000 BC</div>
<div id="mpaa"> </div>
<div id="genre">Adventure, Drama</div>
<div id="plot">A prehistoric epic that follows a young mammoth hunter's journey through uncharted territory to secure the future of his tribe.</div>
</td>
</tr>
<tr id="even">
<td>
<img src=".\images\101Dalmatians1961PlatinumEditionDVDRipXviD.jpg" width="75" height="110">
</td>
<td>
<div id="title">101 Dalmatians (Platinum Edition)</div>
<div id="mpaa">G </div>
<div id="genre">Comedy, Family, Disney</div>
<div id="plot">The Live action adaptation of a Disney Classic. When a litter of dalmatian puppies are abducted by the minions of Cruella De Vil, the parents must find them before she uses them for a diabolical fashion statement.</div>
</td>
</tr>
<tr id="odd">
<td>
<img src=".\images\102DalmationsDVDrip.jpg" width="75" height="110">
</td>
<td>
<div id="title">102 Dalmations</div>
<div id="mpaa">G </div>
<div id="genre">Family</div>
<div id="plot">After a spot of therapy Cruella De Vil is released from prison a changed woman. Devoted to dogs and good causes, she is delighted that Chloe, her parole officer, has a dalmatian family and connections with a dog charity. But the sound of Big Ben can reverse the treatment so it is only a matter of time before Ms De Vil is back to her incredibly ghastly ways, using her new-found connections with Chloe and friends</div>
</td>
</tr>
<tr id="even">
<td>
<img src=".\images\127Hours2010720pBluRayx264.jpg" width="75" height="110">
</td>
<td>
<div id="title">127 Hours</div>
<div id="mpaa">R Rated R for language and some disturbing violent content/bloody images.</div>
<div id="genre">Action, Adventure, Drama, Suspense, Thriller</div>
<div id="plot">127 Hours is the true story of mountain climber Aron Ralston's (James Franco) remarkable adventure to save himself after a fallen boulder crashes on his arm and traps him in an isolated canyon in Utah. Over the next five days Ralston examines his life and survives the elements to finally discover he has the courage and the wherewithal to extricate himself by any means necessary, scale a 65 foot wall and hike over eight miles before he is finally rescued. Throughout his journey, Ralston recalls friends, lovers (Clemence Poesy), family, and the two hikers (Amber Tamblyn and Kate Mara) he met before his accident. Will they be the last two people he ever had the chance to meet?</div>
</td>
</tr>
<tr id="odd">
<td>
<img src=".\images\13GoingOn30DVDrip.jpg" width="75" height="110">
</td>
<td>
<div id="title">13 Going On 30</div>
<div id="mpaa">PG-13 for some sexual content and brief drug references</div>
<div id="genre">Comedy, Fantasy, Romance</div>
<div id="plot">After total humiliation at her thirteenth birthday party, Jenna Rink wants to just hide until she's thirty. Thanks to some wishing dust, Jenna's prayer has been answered. With a knockout body, a dream apartment, a fabulous wardrobe, an athlete boyfriend, a dream job, and superstar friends, this can't be a better life. Unfortunetly, Jenna realizes that this is not what she wanted. The only one that she needs is her childhood best friend, Matt, a boy that she thought destroyed her party. But when she finds him, he's a grown up, and not the same person that she knew.</div>
</td>
</tr>
...
...
</table>
</body>
</html>
You can see what it looks like at: http://timelessdesigncafe.com/movies/catalog.html
Notice that the background shading alternates. When I print to PDF I lose the shading, and more importantly, it spits a "row"/movie over two pages, and I need to avoid that.
Thanks in advance!!
Nobody has mentioned wkhtmltopdf? :)
You can use the OpenOffice API to do this conversion, following these steps in your code:
Load the OpenOffice API
Open the desired HTML file
Save it as PDF
I know it works for VB (already used it in VBScripts), C++ and Java, you should be able to do the same thing with C#.
Links:
http://www.kalitech.fr/clients/doc/VB_APIOOo_en.html
http://wiki.services.openoffice.org/wiki/API/Tutorials/PDF_export
There are too many ways that you can do it. Please check this topic.
If you want to use free library or tool you can use iTextSharp, but free version doesn't cover all requirement. So you can use some other tools such as ABCPdf
Properly layouting HTML is a non-trivial task. My estimate is it would probably take me one or two years to get it right.
So this is not the way to go. Instead, you should filter the HTML for the data and then write a small, dedicated PDF formatter which does exactly what you need and which breaks with even the smallest changes in the input HTML.
That should take a week or so. When you're done with that, make it more resilient to changes in the input HTML.
If you are in a position to use WPF you might want to consider using FixedDocument and doing your layout for print in XAML. You can then rasterize the XAML (taking advantage of data-binding if appropriate) to an XPS, Microsoft's XML Paper Standard for document layout (essentially their version of PDF).
The advantage of this approach is the ability to leverage data-binding and XAML's (IMHO) superior (to HTML) layout functionality. I have been using this stack as a lightweight reporting solution for a while now. (You need to generate the report on an STA thread).
The next step (yes, this is perhaps getting a bit complicated) would be to then pass your XPS stream through some converter to PDF format, not sure if such a thing exists however. You would otherwise be relying on your clients having an XPS reader (although this is built into recent version of Windows & Office).
If you don't mind spending a bit of money you could invest in PrinceXML, which formats any Xml document (including XHtml) into a .pdf document, applying full layout rules to the Html content. In fact Prince is more compliant with web standards when doing its layout pass than many web browsers are :)
Take a look at WebToPDF.NET which is a .NET component written in C# that converts HTML to PDF. You will get a pdf file which looks exactly the same as your HTML file. I belive there is ability to specify the page size you can use it to specify a very long page to get everything on the one page.
The converter supports HTML 4.01, XHTML 1.0, XHTML 1.1 and CSS 2.1 including page breaks, forms and links. It passes all W3C tests (except BIDI).

Categories