C# XmlDocument with custom formatting

C# XmlDocument with custom formatting - c#

This code:
XmlNode columnNode = null;
columnNode = xmlDoc.CreateElement("SYSID");
columnNode.InnerText = ""; // Empty string
newRowNode.AppendChild(columnNode);
...does this:
<SYSID>
</SYSID>
And I would like to have this, when string is empty:
<SYSID></SYSID>
Is there any solution?

If you have another tool that requires that format, then the other tool is wrong - it is incapable of reading XML. So if you have control over the other tool, I'd suggest fixing it rather than trying to coerce your code into matching it.
If you can't fix the other tool...
If you're just building a Document to write it out to disk, then you can use a stream and write the elements directly yourself (as simple text). This will be faster (and may well be easier) than using an XmlDoc.
As an improvement on that, you may be able to use an XmlWriter to write elements, but when you go to write an empty element, write raw text to the stream (i.e. writer.WriteRaw("<SYSID></SYSID>\n")) so that you control the formatting for those particular elements.
If you need to build an in-memory XmlDocument, then to a large extent you have to put up with the formatting that it uses when you ask it to serialize to disk (aside from basic settings like PreserveWhitespace, you're asking the document to deal with storing the information, and so you lose a lot of control over the functionality that the XmlDocument encapsulates). THe best suggestion I can think of in this case would be to write the XmlDocument to a MemoryStream and then post-process that memory stream to remove newlines from within empty elements. (Yuck!)

Related

How to get the XmlTextWriter to actually write the & (without CDATA)

I am using XMLTextWriter to serialize a bunch of my objects into HTML (since HTML is basically XML), and all of my objects are able to read/write themselves as XML anyway. The method works great except for one small snag. HTML has some invalid XML such as for a space. The TextWriter always converts this to &nbps;. I can not wrap this in a CDATA tag because the browser will simply ignore the tag, I literally need the XmlTextWriter to leave my & alone.

Have you tried XmlTextWriter.WriteRaw() to write those values?
I'm pretty sure this doesn't get escaped - not sure how this ties in with the code you've got though...

StringBuilder or XMLDocument?

I have written a console application to fetch some information from a web server, convert it into XML and save it. I have manually created XML (append string using StringBuilder). As the XML might be very large is it better to use StringBuilder or XMLDocument class etc as far as memory is concerned?
To be precise my question is that if XML is like 10mb text is it memory efficient to use StringBuilder.append("") or System.XML namespace?
I think a more efficient way would be to use StringBuilder but saving the XML to a file on HD after every iteration and clearing the stringbuilder object. Any comments?
Thanks in advance. :)

Neither; I'd use an XmlWriter:
using(var file = File.Create(path))
using(var writer = XmlWriter.Create(file))
{
// write to writer here
}
This avoids having to buffer a lot of data in memory, which both StringBuilder and XmlDocument would do, and avoids all the encoding problems you will face if creating the xml manually (not a good idea, to be honest).

I wouldn't manually create XML using a StringBuilder, there is just too much room for errors (if it is proper escaping of strings).
To write an XML file of larger size you should use the XmlWriter class.

If you have the XSD for the XML I'd use xsd.exe XML Schema Definition Tool which can generate c# classes from it. From the code it's quite easy to serialize and deserialize the XML so this way you don't have to work with long strings. Just build up your class and save it as a valid XML text.

Loading XML Document - Name cannot begin with the zero character

I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.

As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.

Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.

It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.

The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.

Parsing XML-ish data

Yes, I really am going to ask about parsing XML with regexes... here goes.
I have some XML-ish data, and I need to parse it. I can't do it completely with an XMLDocument or similar because it's not proper XML, and I'm not sure I can (or want to) change the format. The main problem is tags which have special meaning, and look like this:
<$ something_here $>
C#'s XmlDocument falls over parsing that, and I assume other methods will too. I could, with a lot of work, change the above to something like
<some_special_tag><![CDATA[ something_here ]]></some_special_tag>
But that's ugly, and I don't really want to. The reason it would be time consuming to change is that I have hundreds, maybe thousands of XML documents which would need to be changed.
At the moment, I'm parsing the document with regexes. I only need to pick out a couple of specific tags (not the ones above), and it seems to be working, but I'm uncomfortable with it. I'm doing something like this at the moment:
...
MatchCollection mc = Regex.Matches(Template, "<tagname.*?/tagname>"); // or similar
foreach (Match m in mc) {
try {
XmlDocument xd = new XmlDocument();
xd.LoadXml(m.Value);
...
This at least means I'm not using regexes exclusively :)
Can anyone think of a better way? Is there some way of getting XmlDocument to politely ignore the $ character that causes it to fall over? It doesn't seem likely, but I thought I should at least get some opinions.

No, there is no way to get XmlDocument to parse a document which isn't xml, no matter how close to xml it might look!
If its possible to do then I would definitely recommend that you convert your documents to be actual xml (or at least some recognised document format). Trying to create and maintain a reliable working parser for any format is quite a lot of work, let alone a format that doesn't appear to be rigeriously defined.
Using a some_special_tag element to identify special sections seems like a good idea to me. If necessary you can use a different namespace to ensure no clashes with other elements in your document - this is in fact exactly the way that xslt works ("special" tags are used to mean special things, like templates or nodes that should be replaced) and exactly what xml was designed to support.
Also I don't understand why you would need to place the something_here bit in CDATA sections. All characters that "break" xml can be escaped fairly easily (for example by writing < as <). CDATA sections are generally only used when the contents of a node needs so much escaping that its easier and less messy to just to use CDATA sections instead.
Update: Regarding migration to a new format, can you not use both methods? Attempt to parse the document as an XML document (or if there are performance concerns then perform some other test to quickly determine if the document is in the "old" or "new" format such as checking for a version attribute in the root element) - if it doesn't work then fall back to the old method.
This way as long as everything is working fine (which is will be as long as nothing changes) users don't need to modify their documents, however if they run into problems or want to use any new features then explain to them that they must update their document to the new format.
Depending on how well your current "parser" works, you may even be able to provide an upgrade utility that automatically performns the conversion (as best it can).

Can't you replace <$ something_here $> to that big CDATA section at run-time and then load the XML document as usual?

How to tell if a string is xml?

We have a string field which can contain XML or plain text. The XML contains no <?xml header, and no root element, i.e. is not well formed.
We need to be able to redact XML data, emptying element and attribute values, leaving just their names, so I need to test if this string is XML before it's redacted.
Currently I'm using this approach:
string redact(string eventDetail)
{
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
...
Is there a better way?
Are there any edge cases this approach could miss?
I appreciate I could use XmlDocument.LoadXml and catch XmlException, but this feels like an expensive option, since I already know that a lot of the data will not be in XML.
Here's an example of the XML data, apart from missing a root element (which is omitted to save space, since there will be a lot of data), we can assume it is well formed:
<TableName FirstField="Foo" SecondField="Bar" />
<TableName FirstField="Foo" SecondField="Bar" />
...
Currently we are only using attribute based values, but we may use elements in the future if the data becomes more complex.
SOLUTION
Based on multiple comments (thanks guys!)
string redact(string eventDetail)
{
if (string.IsNullOrEmpty(eventDetail)) return eventDetail; //+1 for unit tests :)
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
XmlDocument xml = new XmlDocument();
try
{
xml.LoadXml(string.Format("<Root>{0}</Root>", detail));
}
catch (XmlException e)
{
log.WarnFormat("Data NOT redacted. Caught {0} loading eventDetail {1}", e.Message, eventDetail);
return eventDetail;
}
... // redact

If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.

One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.

If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.

Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance,
are you OK with accepting
<the quick brown fox jumped over the lazy dog's back>
as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.

How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.
Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).
We need to know more about the data format really.

If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:
foo<bar/>baz
In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!

try
{
XmlDocument myDoc = new XmlDocument();
myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
//take care of the exception
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# XmlDocument with custom formatting - c#

This code: XmlNode columnNode = null; columnNode = xmlDoc.CreateElement("SYSID"); columnNode.InnerText = ""; // Empty string newRowNode.AppendChild(columnNode); ...does this: <SYSID> </SYSID> And I would like to have this, when string is empty: <SYSID></SYSID> Is there any solution?

Related

How to get the XmlTextWriter to actually write the & (without CDATA)

StringBuilder or XMLDocument?

Loading XML Document - Name cannot begin with the zero character

Parsing XML-ish data

How to tell if a string is xml?

Categories

Resources