Suggestions on how build an HTML Diff tool?

Suggestions on how build an HTML Diff tool? - c#

In this post I asked if there were any tools that compare the structure (not actual content) of 2 HTML pages. I ask because I receive HTML templates from our designers, and frequently miss minor formatting changes in my implementation. I then waste a few hours of designer time sifting through my pages to find my mistakes.
The thread offered some good suggestions, but there was nothing that fit the bill. "Fine, then", thought I, "I'll just crank one out myself. I'm a halfway-decent developer, right?".
Well, once I started to think about it, I couldn't quite figure out how to go about it. I can crank out a data-driven website easily enough, or do a CMS implementation, or throw documents in and out of BizTalk all day. Can't begin to figure out how to compare HTML docs.
Well, sure, I have to read the DOM, and iterate through the nodes. I have to map the structure to some data structure (how??), and then compare them (how??). It's a development task like none I've ever attempted.
So now that I've identified a weakness in my knowledge, I'm even more challenged to figure this out. Any suggestions on how to get started?
clarification: the actual content isn't what I want to compare -- the creative guys fill their pages with lorem ipsum, and I use real content. Instead, I want to compare structure:
<div class="foo">lorem ipsum<div>
is different that
<div class="foo"><p>lorem ipsum<p><div>

The DOM is a data structure - it's a tree.

Run both files through the following Perl script, then use diff -iw to do a case-insensitive, whitespace-ignoring diff.
#! /usr/bin/perl -w
use strict;
undef $/;
my $html = <STDIN>;
while ($html =~ /\S/) {
if ($html =~ s/^\s*<//) {
$html =~ s/^(.*?)>// or die "malformed HTML";
print "<$1>\n";
} else {
$html =~ s/^([^<]+)//;
print "(text)\n";
}
}

#Mike - that would compare everything, including the content of the page, which isn't want the original poster wanted.
Assuming that you have access to the browser's DOM (by writing a Firefox/IE plugin or whatever), I would probably put all of the HTML elements into a tree, then compare the two trees. If the tag name is different, then the node is different. You might want to stop enumerating at a certain point (you probably don't care about span, bold, italic, etc. - maybe only worry about divs?), since some tags are really the content, rather than the structure, of the page.

If i was to tacke this issue I would do this:
Plan for some kind of a DOM for html pages. starts at lightweight and then add more as needed. I would use composite pattern for the data structure. i.e. every element has children collection of the base class type.
Create a parser to parse html pages.
Using the parser load html element to the DOM.
After the pages' been loaded up to the DOM, you have the hierachical snapshot of your html pages structure.
Keep iterating through every element on both sides till the end of the DOM. You'll find the diff in the structure, when you hit a mismatched of element type.
In your example you would have only a div element object loaded on one side, on the other side you would have a div element object loaded with 1 child element of type paragraph element. fire up your iterator, first you'll match up the div element, second iterator you'll match up paragraph with nothing. You've got your structural difference.

I think some of the suggestions above don't take into account that there are other tags in the HTML between two pages which would be textually different, but the resulting HTML markup is functionally equivalent. Danimal lists control IDs as an example.
The following two markups are functionlly identical, but would show up as different if you simply compared tags:
<div id="ctl00_TopNavHome_DivHeader" class="header4">foo</div>
<div class="header4">foo</div>
I was going to suggest Danimal write an HTML translation which looks for the HTML tags and converts both docs into a simplified version of both which omits ID tags and any other tags you designate as irrelevant. This’d likely have to be a work in progress, as you ignore certain attributes/tags and then run into new ones which you also want to ignore.
However, I like the idea of using the XmlSchemaInterface to boil it down to the XML schema, then use a diff tool which understands XML rules.

See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a tool that is parameterized by langauge grammar, and produces deltas in terms of language elements (identifiers, expressions, statements, blocks, methods, ...) inserted, deleted, moved, replaced, or has identifiers substituted across it consistently. This tool ignores whitespace reformatting (e.g., different linebreaks or layouts) and semantically indistinguishable values (e.g., it knows that 0x0F and 15 are the same value).
This can be applied to HTML using an HTML parser.
EDIT: 9/12/2009. We've built an experimental SmartDiff tool using an HTML editor.

http://www.mugo.ca/Products/Dom-Diff
Works with FF 3.5. I haven't tested FF 3.6 yet.

This has been an excellent start. A few more clarifications/comments:
I probably don't care about IDs, since .net will mangle them
some of the structure will be in a repeater or other such control, so I might end up having more or fewer repeating elements
further thought:
I think a good start would be to assume the html is XHTML compliant. I could then infer the schema (using the new .net XmlSchemaInference methods), then diff the schemata. I can then look at the differences and consider whether or not they're significant.

My suggestion is just the basic way of doing it... Of course to tackle the issue you mentioned additional rules must be applied here... Which is in your case, we got a matching div element, and then apply attributes/property matching rules and what not...
To be honest, there are many and complicated rules that need to be applied for the comparison, and its not just a simple matching element to another element. For example what happens if you have duplicates.
e.g. 1 div element on one side, and 2 div element on the other side. How are you gonna match up which div elements matches together?
There are alot other complicated issues that you will find in the comparison word. Im speaking based of experience (part of my job is to maitain my company text comparison engine).

Take a look at beyond compare. It has an XML comparison feature that can help you out.

You may also have to consider that the 'content' itself could contain additional mark-up so it's probably worth stripping out everything within certain elements (like <div>s with certain IDs or classes) before you do your comparison. For example:
<div id="mainContent">
<p>lorem ipsum etc..</p>
</div>
and
<div id="mainContent">
<p>Here is some real content<img class="someImage" src="someImage.jpg" /></p>
<ul>
<li>and</li>
<li>some</li>
<li>more..</li>
</ul>
</div>

Pretty Diff can do this. It will compare the code structure only regardless of differences to white space, comments, or even content. Just be sure to check the option "Normalize Content and String Literals".
http://prettydiff.com/

I would use (or contribute to) html5lib and its SAX output. Just zip through the 2 SAX streams looking for mismatches and highlight the whole corresponding subtree.

I don't know any tool but I know there is a simple way to do this:
First, use a regular expression tool to strip off all the text in your HTML file. You can use this regular expression to search for the text (?<=^|>)[^><]+?(?=<|$) and replace them with an empty string (""), i.e. delete all the text. After this step, you will have all HTML markup tags. There are a lot of free regular expression tools out there.
Then, you repeat the first step for the original HTML file.
Last, you use a diff tool to compare the two sets of HTML markups. This will show what is missing between one set and the other.

If i were to do this, first i would learn HTML. (^-^) Then i would build a tool that strips out all of the actual content and then saves that as a file so it can be piped through WinDiff (or other merge tool).

Open each page in the browser and save them as .htm files. Compare the two using windiff.

Related

Finding beginning and end tags in custom HTML-like structure

I am working on a HTML templating engine in C#. I want to achieve some of the same functionality as libraries such as Handlebars.Net (the C# implementation of Handlebars.js), except using basic string manipulation (find/replace) rather than a full-out compiler.
The syntax would be something such as:
{{#each item in items}}
<li>{{item.Name}}</li>
{{/each}}
I am looking to do simple string replacements with Regex, but I realize Regex would have shortcomings in this area, such as finding the end tag in the example below (it would find the first embedded {{/each}} rather than the last when parsing the first tag):
{{#each item in items}} //while parsing this tag...
<p>{{item.Name}}'s Hobbies:</p>
<ul>
{{#each hobby in item}}
<li>{{hobby.Name}}</li>
{{/each}} //this end tag would be found first
</ul>
{{/each}} //rather than this one
I need to find the beginning and end of any given "tag" (such as {{#each}}...{{/each}}) and create a DOM-like structure out of it in order to parse the contents of each tag. The "DOM" could have multiple tags embedded inside of each other (think embedded foreach loops x4). What is a good method for achieving this?

In my opinion, parsing is looking at things very functionally: regex some characters <>, parse as tag, iterate characters, regex the next <>, get content inbetween tags, covert to string...etc. etc. etc.
This functional, parse-each-time-you-add-a-character approach in my opinion is a byproduct of asking a text editor to perform such tasks as tag recognition, syntax highlighting, etc. Most simply put, a text editor's "purpose" was to expect a "model" object to be a panel full of unicode characters that it could be blissfully ignorant of, and we have grown used to forcing it to 'become intelligent' regarding an object model.
I've been designing my own version of what you are working on after playing around with the "hurt me plenty" HTML tagging inside of Monarch editor ( which is 1. possibly the best we can do in 2017 and 2. stupidly complex.)
The way I am approaching it is not as a text editor, but as a view in the MVC sense that is superficially a text editing surface but is actually a templated, ordered model that is reactive to user behaviors.

Why do certain tags go a new line when using HtmlTextWriter?

I'm having a bit of a hard time trying to format my HTML when using HtmlTextWriter. It seems that some tags will automatically go to a new line and some won't.
Is there a way to stop this from happening so all tags are created equally and leave the formatting completely up to me?
In my particular case I'm building out a <ul>'s and <li>'s for a custom HTML Sitemap.
The immediate tag that comes after an <ul> will wrap to a new line.
This is not the case for a <li> tag.
If anyone needs clarification please ask question.

Are you using RenderBeginTag? That method will handle some things automatically, and among other things, it will put line breaks (and indentation) in elements that aren't inline (e.g. ul versus span).
If you want to do the rendering manually, use WriteBeginTag("ul") or WriteFullBeginTag("ul") instead.
Note that WriteBeginTag will still handle indentation. However, you have full control over that if you only use the WriteXXX methods.
In the end, though, do those endlines really bother you at all? You do use compression, right? The overhead usually isn't very significant...

lucene.net/examine weight html tags

I've got this project where we are implementing Examine / Lucene.net. And I'm look for some guidance from you guys.
As far as I have been able to find out from the knowledge of google, is that if I want to boost the weight, I need to boost the weight on the Field, right ?
But could I get something like this: Is it able to give a boost to a term if the term is inside a h1-tag or the title for that matter. When giving a complete site-html, and do a frequent term search.
the thing i would like to do, is no make a service which gets a html document, and from that is able to find what words in a this document optimised after depending on which terms are used in the text and if they are in the important places, like in a title-tag or h2-tag and so forward.
Is this possible to achieve ? its so the editors live can know, "what they are writing are best found with which searchwords.
Big thanks in advance.

I don't think it quite works that way. Yes, you can boost a field but you cannot boost a term dependent on it's location in some markup because you don't know that at the time of the search.
I think what you could do is create an Umbraco event handler that fires when a page is published. This event could:
Utilise the GatheringNodeData event of an Index
Take the contents of the rich text editor-based field and using regex or something like HtmlUtility extract specific text based upon it's markup location, e.g. H1, H2 and H3 text.
For each piece of text in a heading found, add it into a string variable
Add the whole string into the Lucene index as a new field, e.g. "Headings"
You can now boost on the "Headings" field separately to the field containing the field containing the HTML.

Writing xml with at most two tags per line

I am saving xml from .NET's XElement. I've been using the method ToString, but the formatting doesn't look how I'd like (examples below). I'd like at most two tags per line. How can I achieve that?
Saving XElement.Parse("<a><b><c>one</c><c>two</c></b><b>three<c>four</c><c>five</c></b></a>").ToString() gives me
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three<c>four</c><c>five</c></b>
</a>
But for readability I would rather 'three', 'four' and 'five' were on separate lines:
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three
<c>four</c>
<c>five</c>
</b>
</a>
Edit: Yes I understand this is syntactically different and "not in the spirit of xml", but I'm being pragmatic. Recently I've seen megabyte-size xml files with as few as 3 lines—these are challenging to text editors, source control, and diff tools. Something needs to be done! I've tested that changing the formatting above is compatible with our application.

If you want exactly that output, you'll need to do it manually, adding whitespace around nodes as necessary.
Almost all whitespace in XML documents is significant, even if we only think of it as indenting. When we ask the serializer to indent the document for us, it is making changes to the content that can get extracted, so they try to be as conservative as possible. The elements
<tag>foo</tag>
and
<tag>
foo
</tag>
have different content, and if an serializer changed the former into the latter, it would change what you get back from your XML API when asking for the contents of <tag>.
The usual rule of thumb is that no indenting will be applied if there's any existing non-whitespace between the elements. In this case, your three between the tags would be modified if a serializer applied the indenting you desire, so nothing will do it for you automatically.
If you have control over the XML format, it's inadvisable to mix element and text children like this, where <b> has both text (three) and element (<c>) children, as it causes issues like what you're seeing.

The formatting isn't working the way you want because of the naked "three". Is there a reason it's not in it's own tag? Should it be an attribute of "b" instead?

Explained reasons to colleagues - we're going to change the file format. I recommend you try to do the same. It's nigh impossible to do what I wanted, because most xml tools assume whitespace is significant.

XML is an information exchange format, intended for computers. The whitespace is irrelevant (depending on location and schema, really) and as such, it would be arbitrary to use one or the other.
You could use XmlTextWriter with XElement.Save and see whether you can tweak it to your liking with the XmlWriter.Settings Property

I've had to do something similar before (for a client request). All I ended up doing was writing a custom .ToString() method only used for either displaying the XML in a browser(ugh, i know) or for their use in downloading an xml file of the content. Because the code did not have to be computationally efficient, it was merely a matter of checking the children of each tag and arranging the 'hanging' text as such.
Eventually we were able to convince the user that the text should be an attribute instead.

C# regex html table inside a table

I am using the follow regex:
(<(table|h[1-6])[^>]*>(?<op>.+?)<\/(table|h[1-6])>)
to extract tables (and headings) from a html document.
I've found it to work quite well in the documents we are using (documents converted with word save as filtered html), however I have a problem that if the table contains a table inside it the regex will match the initial table start tag and the second table end tag rather than the initial table end tag.
Is there a way in regex to specify that if it finds another table tag within the match to keep to ignore the next match of and go for the next one and so on?

Don't do this.
HTML is not a regular grammar and so a regular expression is not a good tool with which to parse it. What you are asking in your last sentence is for a contextual parser, not a regular expression. Bare regular expression parsing it is too likely fail to parse HTML correctly to be responsible coding.
HtmlAgilityPack is a MsPL-licensed solution I've used in the past that has widely acceptable license terms and provides a well-formed DOM which can be probed with XPath or manipulated in other useful ways ("Extract all text, dropping out tags" being a popular one for importing HTML mail for search, for example, that is nigh trivial after letting a DOM parser rip through the HTML and only coding the part that adds value for your specific business case).

Is there a way in regex to specify
that if it finds another table tag
within the match to keep to ignore the
next match of and go for the next one
and so on?
Since nobody's actually answered this part, I will—No.
This is part of what makes regular languages "regular". A regular language is one that can be recognized by a certain regular grammar, often described in syntax that looks very much like basic regular expressions (10* to match 1 followed by any number of 0s), or a DFA. "Regular Expressions" are based strongly off of these regular languages, as their name implies, but add some functions such as lookaheads and lookbehinds. As a general rule, a regular language knows nothing about what's around it or what it's seen, only what it's looking at currently, and which of its finite states it's in.
TLDNR: Why does this matter to you? Since a regular language cannot "count" elements in that way, it is impossible to keep a tally of the number of <table> and </table> elements you have seen. An HTML Parser does just that - since it is not trying to emulate a regular language, it can count the number of opening and closing tags it sees.
This is the prime example of why it's best not to use regular expressions to parse HTML; even though you know how it may be formed, you cannot parse it since there may be nested elements. If you could guarantee there would be no nested tables, it may be feasible to do this, but even then, using a parser would be much simpler.
Plea to the theoretical computer scientists: I did my best to explain what I know from the CS Theory classes I've taken in a way that most people here should be able to understand. I know that regular languages can "count" finite numbers of things. Feel free to correct me, but please be kind!

Regular expressions are not really suited for this as what you're trying to do contains knowledge about the fact that this is a nested language. Without this knowledge it will be really hard (and also hard to read and maintain) to extract this information.
Maybe do something with an XPath navigator?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.