I've written a document editor which uses contentEditable to create HTML content. In some larger documents the style of syntax seems is all over the place. This is most likely a result of content pasted in from wordpad and earlier versions of the editor.
The problem is, now I'm left with a lot of very inconsistent documents.
It starts off fairly normal. Simple <p> tags for each line
<p>It is a truth</p>
<p>universally acknowledged</p>
<p>that a single man</p>
The only "bad" html up to this point is a few empty <i></i> tags, and the occasional instead of whitespace (anyone know why?)
Then it about halfway down the document, the line breaks switched to this format.
<div>
<br>
CHAPTER 1<br>
<br>
The sky above the port
<br>
was the color of a television
<br>
tuned to a dead channel.
</div>
<div>
<br>
</div>
Then about 3/4 down the page, we get this. It seems to have reverted to <p></p> tags, but now embeds them randomly in <span> tags with empty lang attributes
<div>
<span lang="">
<p>It was the best of times,</p>
<p>it was the worst of times,</p>
</span>
<p>it was the age of wisdom,</p>
<p>it was the age of foolishness,</p>
</div>
Note: some lines are inside a <span>, others are outside.
Worse, later on we get nested <span> tags
<span lang="">
<div>
<span lang="EN-GB">
<p>Stately, plump </p>
<p>Buck Mulligan came </p>
<span lang="EN-GB">
<p>from the stairhead, </p>
<p>bearing a bowl of lather </p>
<span lang="EN-GB">
<p> on which a mirror and a razor lay crossed</p>
</span>
</span>
</span>
</div>
</span>
You may also notice the parentage of the <span> and <div> tags is now reversed at the outset, with the <div> now a child of the <span>
I've noticed other oddities. <i></i> is used at the start but later <em></em> is used.
What's the best way to clean this HTML up?
Should I try and surround orphaned lines with <p> tags?
How do I remove only those <div> tags which contain <p> tags themselves? And how do I avoid leaving orphaned text in the document?
is a hard question, I had the same problem editing HTML from texts.
I found out this free pure HTML + JS editor: TinyMCE
http://www.tinymce.com/
which includes cleaning text options, you can choose the tags you want to clean from the text.
Is very powerful if you have the chance to change the editor you are using.
Related
Using HTML Agility Pack, I want to parse a not tagged text in a HTML document.
The next HTML is an example of the HTML structure that I will treat and the text below the last div is an example of the text that I want to extract.
(The one that begins with "I am selling..." and ends in "...services or offers")
<div class="mapbox">
<div id="map" class="viewposting" data-latitude="32.965732" data-longitude="-96.882528" data-accuracy="22"></div>
<p class="mapaddress">
<small>
(<a target="_blank" href="https://maps.google.com/maps/preview/#32.965732,-96.882528,16z">google map</a>)
</small>
</p>
</div>
<p class="attrgroup">
<span><b>2012 jeep grand cherokee laredo</b></span>
<br>
</p>
<p class="attrgroup">
<span>VIN: <b>ask me</b></span>
<br>
<span>condition: <b>excellent</b></span>
<br>
<span>cylinders: <b>6 cylinders</b></span>
<br>
<span>drive: <b>rwd</b></span>
<br>
<span>fuel: <b>gas</b></span>
<br>
<span>odometer: <b>98000</b></span>
<br>
<span>title status: <b>clean</b></span>
<br>
<span>transmission: <b>automatic</b></span>
<br>
</p>
<div class="print-information print-qrcode-container">
<p class="print-qrcode-label">QR Code Link to This Post</p>
<div class="print-qrcode" data-location="east"></div>
</div>
I am selling my 2012 Jeep Grand Cherokee. The Jeep runs and drives great. Zero issues. Always been well maintained and serviced on time. Very dependable car has never left me stranded. Very healthy. Everything works like it should. This Grand Cherokee would make a great family car or First car.<br>
<br>
*3.6 V6 <br>
*Automatic Transmission <br>
*98,000 Original Miles<br>
*Leather and Heated Seats<br>
*Navigation<br>
*Back Up Camera <br>
*Good Tires<br>
*Cold A/C Hot Heater <br>
*Clean Texas Title<br>
*Clean Carfax<br>
Much More!!<br>
<br>
Call or Text me for anymore information. <br>
show contact info
<li>do NOT contact me with unsolicited services or offers</li>
Can anyone tell me how to do this? How to extract that text using HTML Agility Pack in .NET?
Thanks in advance
After you load the document, use xpath for selecting the text following a specific node.
const string xpath = "//div[#class='print-information print-qrcode-container']/following-sibling::text()[1]";
string text = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
returns:
I am selling my 2012 Jeep Grand Cherokee. The Jeep runs and drives
great. Zero issues. Always been well maintained and serviced on time.
Very dependable car has never left me stranded. Very healthy.
Everything works like it should. This Grand Cherokee would make a
great family car or First car.
and visca catalunya!
I'm trying to figure out how can i select only main nodes from a loaded HTML document as the following example:
<div id="main">
<p>paragraph 1</p>
<p>paragraph 2</p>
<img src="exzample.jpg" />
</div>
<div id="main2">
<div>some text</div>
<p>some text</p>
<img src="exzample.jpg" />
</div>
<p class="a_class">
<div>some text</div>
<span>some text</span>
</p>
I know I can iterate over all elements but in my case, I just need to get only the 3 blocks (in this example) from the loaded html. I do not know how can I select such nodes using SelectNodes function or any other function.
I'm using HtmlAgilityPack library.
Note: Main nodes can be any html tag (div, p, span and so on...)
/* will select all immediate descendats of the root node (which the document posted in this question is lacking).
i have some html code as follows, which was supplied by our graphics developers. the issue is when i import this into asp.net (c#) page i get to see a lot of orphan divs. it feels as if there are not opening divs for several of the closing divs. following is code snippet.
<div class="col-lg-2 col-lg-3 quick-launch">
<div class="thumbnail">
<a href=""> <img src="assets/img/app_images/app_7.jpg" width="115" height="114">
<div class="caption">
<h3>TEST</h3>
</a></div>
</div>
</div>
could someone here please let me know if there is something in visual studio that is causing this?
You're inverting <div> and <a> closing tags. This is valid HTML (but not valid XHTML so you'd better to check your DOCTYPE) but it may confuse Visual Studio editor:
<a href=""> <img src="assets/img/app_images/app_7.jpg" width="115" height="114">
<div class="caption">
<h3>TEST</h3>
</a>
</div>
a
Should be:
<a href=""> <img src="assets/img/app_images/app_7.jpg" width="115" height="114">
<div class="caption">
<h3>TEST</h3>
</div>
</a>
Edit: what's wrong with that? It works because HTML parser doesn't complain about <a><div><a/></div> (if DOCTYPE isnt XHTML) but you should complain about it. Let me explain: parser won't complain because </div> (closing tag) isn't optional then it won't just silently add it. This is theory, in practice browsers handle this in many ways. Some of them silently close <div> when </a> is reached (then </div> will close outer one), some others don't do it (I repeat because it's not an optional closing tag) then </div> will close inner (and right) one. IMO With such unreliable behavior you should ask your developer/graphics designer to fix that code. In general (and with few exceptions like <hr> and <br>) I would write HTML code as it was XHTML.
I can't seem to get this xpath query to work with the HTMLAgilityPack with this code and I was wondering if anyone had any suggestions.
This is the query I have so far, but I can't seem to get it to return a number.
DocumentNode.GetAttributeValue("max(a[(#class='shackmsg')]/#href/substring-after(.,?id='))", "");
I'm trying to get the MAX value in the href attribute after the = sign on all hrefs with a class of shackmsg.
How long is the beta live before it goes retail? No one knows. We do know t</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218936" class="">
<div class="oneline oneline3 op olmod_ontopic olauthor_189801">
<a class="shackmsg" rel="nofollow" href="?id=31218936" onclick="return clickItem( 31218933, 31218936);"><span class="oneline_body"><b><u><span class="jt_yellow">Current Multiplayer Servers</span>!</u></b>
<span class="jt_sample"><span class="jt_green">Nighteyes's Japan Server: </span> <span class="jt_lime">(PvE)</span>: <b>211.15.2.34</b></span>
<span class="jt_sample"><span class="jt_green">zolointo's Canada Server: </span> <span class="jt_lime">(</span></span></span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218938" class="last">
<div class="oneline oneline2 op olmod_ontopic olauthor_189801">
<div class="treecollapse">
<a class="open" rel="nofollow" href="#" onclick="toggle_collapse(31218938); return false;" title="Toggle">toggle</a>
</div>
<a class="shackmsg" rel="nofollow" href="?id=31218938" onclick="return clickItem( 31218933, 31218938);"><span class="oneline_body">Had fun freezing my ass off last night with a bunch of shackers. Not sure who started the big tower we f...</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
<ul>
<li id="item_31218966" class="">
<div class="oneline oneline1 olmod_ontopic olauthor_128401">
<a class="shackmsg" rel="nofollow" href="?id=31218966" onclick="return clickItem( 31218933, 31218966);"><span class="oneline_body">wasn't me. I hung out on my ship for a bit listening to your kid play Christmas songs for a bit and then ...</span> : </a><span class="oneline_user ">jonin</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/jonin/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
<li id="item_31219008" class="last">
<div class="oneline oneline0 olmod_ontopic olauthor_8618">
<a class="shackmsg" rel="nofollow" href="?id=31219008" onclick="return clickItem( 31218933, 31219008);"><span class="oneline_body">haha i heard you guys booby trapped some poor sap's space ship</span> : </a><span class="oneline_user ">Break</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/Break/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
</ul>
Any suggestions?
There are two problems as far as I can see:
You're only scanning for anchor tags in the current context. You probably want to extend to scan everywhere (use // in the beginning of your query):
//a[#class='shackmsg']/#href/substring-after(., '?id=')
Note that I removed a pair of unnecessary parenthesis.
If I'm not completely mistaken, HTML Agility Pack only supports XPath 1.0 (yet I'm not totally sure). While System.Xml.XPath says it implements the XPath 2.0 data model, it does not actually implement XPath 2.0 (probably this is done so third party APIs can implement this API and offer XPath 2.0/XQuery support at the same time). Also have a look at this discussion on .NET's XPath 2.0 support.
Missing XPath 2.0 support would show up as two problems:
Function substring-after(...) does not exist.
A solution for your problem could be to use string-lenght($string) and substring($string, $start, $length) to extract the last n digits, or translate(...) to remove some characters:
translate('?id=31219008', '?id=', '')
will remove all occurences in the character class [?id=] (yet it is none, I just want to highlight it does not match strings, but individual characters of this set!).
You cannot apply functions in axis steps. This means, you cannot find the maximum value of substrings.
Possible solution: Only fetch all substrings and find the maximum from outside XPath.
You can combine XPath with HTML Agility Pack and make the following code :
var value = doc.DocumentNode.SelectNodes("//a[#class='shackmsg']").Select(
x => x.Attributes["href"].Value.Substring(4)).Max();
Console.WriteLine(value);
And this output :
31219008
In this code I assume to always exist the href attribute and always have the following structure :
"?id=XXXX"
<div id="bulletinContents">
<div class="headingArea">
<div id="heading">
Critical Notice Description: <br>
Notice Effective Date: <br>
</div>
<div id="headingData">Critical notice<br>
10/16/2013<br>
</div>
</div>
<br>
<div id="bulletin">
<br>
<div id="Div1">Notice Text:</div>
<br>
To notify
<br>
<br>
</div>
</div>
there are also other <div> tags on the page, but I want to extract this particular section.
Can anyone please suggest me a proper regular expression for this.
I have used this regex:
<div[^>]*>(?<Value>[^<]*(?:(?!</div)<[^<]*)*)[</div>]*
but it does not give me proper content. It returns only the <div> with id heading and Div1.
I need to complete this task only by using regular expression and nothing else. Please suggest me proper Regex to do it.