My SQL database is outputting a blob of html that looks something like this:
<ul><li>Test 1</li><li>Test 2</li></ul>
Which I would like to transform into a list of textboxes which look something like this:
<textarea>Test 1</textarea><textarea>Test 2</textarea>
How do I do this? I've tried parsing into XML but this doesn't seem to work and seems a lengthy way of doing things.
I recently discovered the HTML Agility Pack which makes it very easy to do things like this:
var nodes= doc.DocumentNode.Descendants().Where(n => n.Name.StartsWith("li")).ToList();
You could then iterate over the collection and build your TextArea
foreach (HTML.HtmlNode node in nodes)
{
// ... build your text area here
}
From the wording of your question, I am inferring that you are primarily looking for a way to access the DOM of a snippet of HTML. There may be other tools for this, but I can recommend Html Agility Pack which is available as a Nuget package (http://htmlagilitypack.codeplex.com/).
Once you have the DOM structure loaded, you can access and modify it as you need, and you can generate output html as well.
This library is quite forgiving of poorly written HTML, so if your database values are less controlled (entered by users, for example), it is a good choice.
I am saving xml from .NET's XElement. I've been using the method ToString, but the formatting doesn't look how I'd like (examples below). I'd like at most two tags per line. How can I achieve that?
Saving XElement.Parse("<a><b><c>one</c><c>two</c></b><b>three<c>four</c><c>five</c></b></a>").ToString() gives me
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three<c>four</c><c>five</c></b>
</a>
But for readability I would rather 'three', 'four' and 'five' were on separate lines:
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three
<c>four</c>
<c>five</c>
</b>
</a>
Edit: Yes I understand this is syntactically different and "not in the spirit of xml", but I'm being pragmatic. Recently I've seen megabyte-size xml files with as few as 3 lines—these are challenging to text editors, source control, and diff tools. Something needs to be done! I've tested that changing the formatting above is compatible with our application.
If you want exactly that output, you'll need to do it manually, adding whitespace around nodes as necessary.
Almost all whitespace in XML documents is significant, even if we only think of it as indenting. When we ask the serializer to indent the document for us, it is making changes to the content that can get extracted, so they try to be as conservative as possible. The elements
<tag>foo</tag>
and
<tag>
foo
</tag>
have different content, and if an serializer changed the former into the latter, it would change what you get back from your XML API when asking for the contents of <tag>.
The usual rule of thumb is that no indenting will be applied if there's any existing non-whitespace between the elements. In this case, your three between the tags would be modified if a serializer applied the indenting you desire, so nothing will do it for you automatically.
If you have control over the XML format, it's inadvisable to mix element and text children like this, where <b> has both text (three) and element (<c>) children, as it causes issues like what you're seeing.
The formatting isn't working the way you want because of the naked "three". Is there a reason it's not in it's own tag? Should it be an attribute of "b" instead?
Explained reasons to colleagues - we're going to change the file format. I recommend you try to do the same. It's nigh impossible to do what I wanted, because most xml tools assume whitespace is significant.
XML is an information exchange format, intended for computers. The whitespace is irrelevant (depending on location and schema, really) and as such, it would be arbitrary to use one or the other.
You could use XmlTextWriter with XElement.Save and see whether you can tweak it to your liking with the XmlWriter.Settings Property
I've had to do something similar before (for a client request). All I ended up doing was writing a custom .ToString() method only used for either displaying the XML in a browser(ugh, i know) or for their use in downloading an xml file of the content. Because the code did not have to be computationally efficient, it was merely a matter of checking the children of each tag and arranging the 'hanging' text as such.
Eventually we were able to convince the user that the text should be an attribute instead.
Yes, I really am going to ask about parsing XML with regexes... here goes.
I have some XML-ish data, and I need to parse it. I can't do it completely with an XMLDocument or similar because it's not proper XML, and I'm not sure I can (or want to) change the format. The main problem is tags which have special meaning, and look like this:
<$ something_here $>
C#'s XmlDocument falls over parsing that, and I assume other methods will too. I could, with a lot of work, change the above to something like
<some_special_tag><![CDATA[ something_here ]]></some_special_tag>
But that's ugly, and I don't really want to. The reason it would be time consuming to change is that I have hundreds, maybe thousands of XML documents which would need to be changed.
At the moment, I'm parsing the document with regexes. I only need to pick out a couple of specific tags (not the ones above), and it seems to be working, but I'm uncomfortable with it. I'm doing something like this at the moment:
...
MatchCollection mc = Regex.Matches(Template, "<tagname.*?/tagname>"); // or similar
foreach (Match m in mc) {
try {
XmlDocument xd = new XmlDocument();
xd.LoadXml(m.Value);
...
This at least means I'm not using regexes exclusively :)
Can anyone think of a better way? Is there some way of getting XmlDocument to politely ignore the $ character that causes it to fall over? It doesn't seem likely, but I thought I should at least get some opinions.
No, there is no way to get XmlDocument to parse a document which isn't xml, no matter how close to xml it might look!
If its possible to do then I would definitely recommend that you convert your documents to be actual xml (or at least some recognised document format). Trying to create and maintain a reliable working parser for any format is quite a lot of work, let alone a format that doesn't appear to be rigeriously defined.
Using a some_special_tag element to identify special sections seems like a good idea to me. If necessary you can use a different namespace to ensure no clashes with other elements in your document - this is in fact exactly the way that xslt works ("special" tags are used to mean special things, like templates or nodes that should be replaced) and exactly what xml was designed to support.
Also I don't understand why you would need to place the something_here bit in CDATA sections. All characters that "break" xml can be escaped fairly easily (for example by writing < as <). CDATA sections are generally only used when the contents of a node needs so much escaping that its easier and less messy to just to use CDATA sections instead.
Update: Regarding migration to a new format, can you not use both methods? Attempt to parse the document as an XML document (or if there are performance concerns then perform some other test to quickly determine if the document is in the "old" or "new" format such as checking for a version attribute in the root element) - if it doesn't work then fall back to the old method.
This way as long as everything is working fine (which is will be as long as nothing changes) users don't need to modify their documents, however if they run into problems or want to use any new features then explain to them that they must update their document to the new format.
Depending on how well your current "parser" works, you may even be able to provide an upgrade utility that automatically performns the conversion (as best it can).
Can't you replace <$ something_here $> to that big CDATA section at run-time and then load the XML document as usual?
Is it possible to use variables like <%=person.LastName %> in XML string this way?
XElement letters = new XElement("Letters");
XElement xperson = XElement.Parse("<Table><Row><Cell><Text>
<Segment>Dear <%=person.Title%> <%=person.FirstName%> <%=person.LastName%>,
</Segment></Text></Cell></Row>...");
foreach (Person person in persons){
letters.Add(xperson)
}
If it's possible, it would be a lifesaver since I can't use XElement and XAttribute to add the nodes manually. We have multiple templates and they frequently change (edit on the fly).
If this is not doable, can you think of another way so that I can use templates for the XML?
Look like it's possible in VB
http://aspalliance.com/1534_Easy_SQL_to_XML_with_LINQ_and_Visual_Basic_2008.6
This is an exclusive VB.NET feature known as XML Literals. It was added in VB 9.0. C# does not currently support this feature. Although Microsoft has stated its intent to bridge the gap between the languages in the future, it's not clear whether this feature will make it to C# any time soon.
Your example doesn't seem clear to me. You would want to have the foreach loop before parsing the actual XML since the values are bound to the current Person object. For example, here's a VB example of XML literals:
Dim xml = <numbers>
<%= From i In Enumerable.Range(1, 5)
Select <number><%= i %></number>
%>
</numbers>
For Each e In xml.Elements()
Console.WriteLine(e.Value)
Next
The above snippet builds the following XML:
<numbers>
<number>1</number>
<number>2</number>
<number>3</number>
</numbers>
Then it writes 1, 2, 3 to the console.
If you can't modify your C# code to build the XML dynamically then perhaps you could write code that traverses the XML template and searches for predetermined fields in attributes and elements then sets the values as needed. This means you would have to iterate over all the attributes and elements in each node and have some switch statement that checks for the template field name. If you encounter a template field and you are currently iterating attributes you would set it the way attributes are set, whereas if you were iterating elements you would set it the way elements are set. This is likely not the most efficient approach but is one solution.
The simplest solution would be to use VB.NET. You can always develop it as a stand-alone project, add a reference to the dll from the C# project, pass data to the VB.NET class and have it return the final XML.
EDIT: to clarify, using VB.NET doesn't bypass the need to update the template. It allows you to specify the layout easier as an XML literal. So code still needs to be updated in VB once the layout changes. You can't load an XML template from a text file and expect the fields to be bound that way. For a truly dynamic solution that allows you to write the code once and read different templates my first suggestion is more appropriate.
In this post I asked if there were any tools that compare the structure (not actual content) of 2 HTML pages. I ask because I receive HTML templates from our designers, and frequently miss minor formatting changes in my implementation. I then waste a few hours of designer time sifting through my pages to find my mistakes.
The thread offered some good suggestions, but there was nothing that fit the bill. "Fine, then", thought I, "I'll just crank one out myself. I'm a halfway-decent developer, right?".
Well, once I started to think about it, I couldn't quite figure out how to go about it. I can crank out a data-driven website easily enough, or do a CMS implementation, or throw documents in and out of BizTalk all day. Can't begin to figure out how to compare HTML docs.
Well, sure, I have to read the DOM, and iterate through the nodes. I have to map the structure to some data structure (how??), and then compare them (how??). It's a development task like none I've ever attempted.
So now that I've identified a weakness in my knowledge, I'm even more challenged to figure this out. Any suggestions on how to get started?
clarification: the actual content isn't what I want to compare -- the creative guys fill their pages with lorem ipsum, and I use real content. Instead, I want to compare structure:
<div class="foo">lorem ipsum<div>
is different that
<div class="foo"><p>lorem ipsum<p><div>
The DOM is a data structure - it's a tree.
Run both files through the following Perl script, then use diff -iw to do a case-insensitive, whitespace-ignoring diff.
#! /usr/bin/perl -w
use strict;
undef $/;
my $html = <STDIN>;
while ($html =~ /\S/) {
if ($html =~ s/^\s*<//) {
$html =~ s/^(.*?)>// or die "malformed HTML";
print "<$1>\n";
} else {
$html =~ s/^([^<]+)//;
print "(text)\n";
}
}
#Mike - that would compare everything, including the content of the page, which isn't want the original poster wanted.
Assuming that you have access to the browser's DOM (by writing a Firefox/IE plugin or whatever), I would probably put all of the HTML elements into a tree, then compare the two trees. If the tag name is different, then the node is different. You might want to stop enumerating at a certain point (you probably don't care about span, bold, italic, etc. - maybe only worry about divs?), since some tags are really the content, rather than the structure, of the page.
If i was to tacke this issue I would do this:
Plan for some kind of a DOM for html pages. starts at lightweight and then add more as needed. I would use composite pattern for the data structure. i.e. every element has children collection of the base class type.
Create a parser to parse html pages.
Using the parser load html element to the DOM.
After the pages' been loaded up to the DOM, you have the hierachical snapshot of your html pages structure.
Keep iterating through every element on both sides till the end of the DOM. You'll find the diff in the structure, when you hit a mismatched of element type.
In your example you would have only a div element object loaded on one side, on the other side you would have a div element object loaded with 1 child element of type paragraph element. fire up your iterator, first you'll match up the div element, second iterator you'll match up paragraph with nothing. You've got your structural difference.
I think some of the suggestions above don't take into account that there are other tags in the HTML between two pages which would be textually different, but the resulting HTML markup is functionally equivalent. Danimal lists control IDs as an example.
The following two markups are functionlly identical, but would show up as different if you simply compared tags:
<div id="ctl00_TopNavHome_DivHeader" class="header4">foo</div>
<div class="header4">foo</div>
I was going to suggest Danimal write an HTML translation which looks for the HTML tags and converts both docs into a simplified version of both which omits ID tags and any other tags you designate as irrelevant. This’d likely have to be a work in progress, as you ignore certain attributes/tags and then run into new ones which you also want to ignore.
However, I like the idea of using the XmlSchemaInterface to boil it down to the XML schema, then use a diff tool which understands XML rules.
See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a tool that is parameterized by langauge grammar, and produces deltas in terms of language elements (identifiers, expressions, statements, blocks, methods, ...) inserted, deleted, moved, replaced, or has identifiers substituted across it consistently. This tool ignores whitespace reformatting (e.g., different linebreaks or layouts) and semantically indistinguishable values (e.g., it knows that 0x0F and 15 are the same value).
This can be applied to HTML using an HTML parser.
EDIT: 9/12/2009. We've built an experimental SmartDiff tool using an HTML editor.
http://www.mugo.ca/Products/Dom-Diff
Works with FF 3.5. I haven't tested FF 3.6 yet.
This has been an excellent start. A few more clarifications/comments:
I probably don't care about IDs, since .net will mangle them
some of the structure will be in a repeater or other such control, so I might end up having more or fewer repeating elements
further thought:
I think a good start would be to assume the html is XHTML compliant. I could then infer the schema (using the new .net XmlSchemaInference methods), then diff the schemata. I can then look at the differences and consider whether or not they're significant.
My suggestion is just the basic way of doing it... Of course to tackle the issue you mentioned additional rules must be applied here... Which is in your case, we got a matching div element, and then apply attributes/property matching rules and what not...
To be honest, there are many and complicated rules that need to be applied for the comparison, and its not just a simple matching element to another element. For example what happens if you have duplicates.
e.g. 1 div element on one side, and 2 div element on the other side. How are you gonna match up which div elements matches together?
There are alot other complicated issues that you will find in the comparison word. Im speaking based of experience (part of my job is to maitain my company text comparison engine).
Take a look at beyond compare. It has an XML comparison feature that can help you out.
You may also have to consider that the 'content' itself could contain additional mark-up so it's probably worth stripping out everything within certain elements (like <div>s with certain IDs or classes) before you do your comparison. For example:
<div id="mainContent">
<p>lorem ipsum etc..</p>
</div>
and
<div id="mainContent">
<p>Here is some real content<img class="someImage" src="someImage.jpg" /></p>
<ul>
<li>and</li>
<li>some</li>
<li>more..</li>
</ul>
</div>
Pretty Diff can do this. It will compare the code structure only regardless of differences to white space, comments, or even content. Just be sure to check the option "Normalize Content and String Literals".
http://prettydiff.com/
I would use (or contribute to) html5lib and its SAX output. Just zip through the 2 SAX streams looking for mismatches and highlight the whole corresponding subtree.
I don't know any tool but I know there is a simple way to do this:
First, use a regular expression tool to strip off all the text in your HTML file. You can use this regular expression to search for the text (?<=^|>)[^><]+?(?=<|$) and replace them with an empty string (""), i.e. delete all the text. After this step, you will have all HTML markup tags. There are a lot of free regular expression tools out there.
Then, you repeat the first step for the original HTML file.
Last, you use a diff tool to compare the two sets of HTML markups. This will show what is missing between one set and the other.
If i were to do this, first i would learn HTML. (^-^) Then i would build a tool that strips out all of the actual content and then saves that as a file so it can be piped through WinDiff (or other merge tool).
Open each page in the browser and save them as .htm files. Compare the two using windiff.