.NET's XmlTextWriter creates invalid xml files.
In XML, some control characters are allowed, like 'horizontal tab' ( ), but others are not, like 'vertical tab' (). (See spec.)
I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.
How can I make sure that XmlTextWriter never produces an illegal XML file?
Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?
Example code:
using (XmlTextWriter writer =
new XmlTextWriter("test.xml", Encoding.UTF8))
{
writer.WriteStartDocument();
writer.WriteStartElement("Test");
writer.WriteValue("hello \xb world");
writer.WriteEndElement();
writer.WriteEndDocument();
}
Output:
<?xml version="1.0" encoding="utf-8"?><Test>hello world</Test>
This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.
The default behavior of an XmlWriter created using Create is to throw
an ArgumentException when attempting to write character values in the
range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD).
These invalid XML characters can be written by creating the XmlWriter
with the CheckCharacters property set to false. Doing so will result
in the characters being replaced with numeric character entities (
through �x1F). Additionally, an XmlTextWriter created with the new
operator will replace the invalid characters with numeric character
entities by default.
So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.
Just found this question when I was struggling with the same issue and I ended up solving it with an regex:
return Regex.Replace(s, #"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
Hope it helps someone as an alternative solution.
Built in .NET escapers such as SecurityElement.Escape don't properly escape/strip it either.
You could set CheckCharacters to false on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though.
See:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
If setting CheckCharacters to true(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them:
Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.
public static class XmlTextExtentions
{
private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
{ '&', "&"}, { '<', "<" }, { '>', ">" },
{ '"', """ }, { '\'', "'" }
};
public static string ToValidXmlString(this string str)
{
var stripped = str
.Select((c,i) => new
{
c1 = c,
c2 = i + 1 < str.Length ? str[i+1]: default(char),
v = XmlConvert.IsXmlChar(c),
p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
})
.Aggregate("", (s, c) => {
if (c.pp)
return s;
if (textEntities.ContainsKey(c.c1))
s += textEntities[c.c1];
else if (c.v)
s += c.c1.ToString();
else if (c.p)
s += c.c1.ToString() + c.c2.ToString();
return s;
});
return stripped;
}
}
This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL which XmlConvert.IsXmlChar, Wikipedia, and the spec marks as a valid (although discouraged) character.
Related
I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.
I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.
sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}
I'm using the XmlSerializer to output a class to a .xml file. For the most part, this is working as expected and intended. However, as a requirement, certain characters need to be removed from the values of the data and replaced with their proper escape characters.
In the elements I need to replace values in, I'm using the Replace() method and returning the updated string. The code below shows this string replacement; the lines commented out are because the XmlSerializer already escapes those particular characters.
I have a requirement from a third-party to escape &, <, >, ', and " characters when they appear within the values of the XML elements. Currently the characters &, <, and > are being escaped appropriately through the XmlSerializer.
The error received when these characters are present is:
Our system has detected a potential threat in the request message attachment.
However, when I serialize the XML Document after performing the string replace, the XmlSerializer sees the & character in ' and makes it '. I think this is a correct functionality of the XmlSerializer object. However, I would like the serializer to either a.) ignore the escape characters; or b.) serialize the other characters which are necessary to escape.
Can anyone shed some light on, specifically, how to accomplish either of these?
String Replacement Method
public static string CheckValueOfProperty(string str)
{
string trimmedString = str.Trim();
if (string.IsNullOrEmpty(trimmedString))
return null;
else
{
// Commented out because the Serializer already transforms a '&' character into the appropriate escape character.
//trimmedString = trimmedString .Replace("&", "&");
//trimmedString = trimmedString.Replace("<", "<");
//trimmedString = trimmedString.Replace(">", ">");
trimmedString = trimmedString.Replace("'", "'");
trimmedString = trimmedString.Replace("\"", """);
return trimmedString;
}
}
XmlSerializer Code
public static void SerializeAndOutput(object obj, string outputFilePath, XmlSerializerNamespaces ns = null)
{
XmlSerializer x = new XmlSerializer(obj.GetType());
// If the Output File already exists, delete it.
if (File.Exists(outputFilePath))
{
File.Delete(outputFilePath);
}
// Then, Create the Output File and Serialize the parameterized object as Xml to the Output File
using (TextWriter tw = File.CreateText(outputFilePath))
{
if (ns == null)
{
x.Serialize(tw, obj);
}
else { x.Serialize(tw, obj, ns); }
}
// =====================================================================
// The code below here is no longer needed, was used to force "utf-8" to
// UTF-8" to ensure the result was what was being expected.
// =====================================================================
// Create a new XmlDocument object, and load the contents of the OutputFile into the XmlDocument
// XmlDocument xdoc = new XmlDocument() { PreserveWhitespace = true };
// xdoc.Load(outputFilePath);
// Set the Encoding property of each XmlDeclaration in the document to "UTF-8";
// xdoc.ChildNodes.OfType<XmlDeclaration>().ToList().ForEach(d => d.Encoding = "UTF-8");
// Save the XmlDocument to the Output File Path.
// xdoc.Save(outputFilePath);
}
The single and double quote characters do not need to be escaped when used inside the node content in XML. The single quote or double quote characters only need to be escaped when used in a value of a node attribute. That's why the XMLSerializer does not escape them. And you also do not need to escape them.
See this question and answer for reference.
BTW: The way you set the Encoding to UTF-8 afterwards, is awkward as well. You can specify the encoding with the StreamWriter and then the XMLSerializer will automatically use that encoding and also specify it in the XML declaration.
Here's the solution I came up with. I have only tested it with a sample XML file and not the actual XML file I'm creating, so performance may take a hit; however, this seems to be working.
I'm reading the XML file line-by-line as a string, and replacing any of the defined "special" characters found in the string with their appropriate escape characters. It should process in the order of the specialCharacterList Dictionary<string, string> variable, which means the & character should process first. When processing <, > and " characters, it will only look at the value of the XML element.
using System;
using System.Collections.Generic;
using System.IO;
namespace testSerializer
{
class Program
{
private static string filePath = AppDomain.CurrentDomain.BaseDirectory + "testFile.xml";
private static string tempFile = AppDomain.CurrentDomain.BaseDirectory + "tempFile.xml";
private static Dictionary<string, string> specialCharacterList = new Dictionary<string, string>()
{
{"&","&"}, {"<","<"}, {">",">"}, {"'","'"}, {"\"","""}
};
static void Main(string[] args)
{
ReplaceSpecialCharacters();
}
private static void ReplaceSpecialCharacters()
{
string[] allLines = File.ReadAllLines(filePath);
using (TextWriter tw = File.CreateText(tempFile))
{
foreach (string strLine in allLines)
{
string newLineString = "";
string originalString = strLine;
foreach (var item in specialCharacterList)
{
// Since these characters are all valid characters to be present in the XML,
// We need to look specifically within the VALUE of the XML Element.
if (item.Key == "\"" || item.Key == "<" || item.Key == ">")
{
// Find the ending character of the beginning XML tag.
int firstIndexOfCloseBracket = originalString.IndexOf('>');
// Find the beginning character of the ending XML tag.
int lastIndexOfOpenBracket = originalString.LastIndexOf('<');
if (lastIndexOfOpenBracket > firstIndexOfCloseBracket)
{
// Determine the length of the string between the XML tags.
int lengthOfStringBetweenBrackets = lastIndexOfOpenBracket - firstIndexOfCloseBracket;
// Retrieve the string that is between the element tags.
string valueOfElement = originalString.Substring(firstIndexOfCloseBracket + 1, lengthOfStringBetweenBrackets - 1);
newLineString = originalString.Substring(0, firstIndexOfCloseBracket + 1) + valueOfElement.Replace(item.Key, item.Value) + originalString.Substring(lastIndexOfOpenBracket);
}
}
// For the ampersand (&) and apostrophe (') characters, simply replace any found with the escape.
else
{
newLineString = originalString.Replace(item.Key, item.Value);
}
// Set the "original" string to the new version.
originalString = newLineString;
}
tw.WriteLine(newLineString);
}
}
}
}
}
Is it possible to have ';' not delimiting comments only when it is on the right hand side of an equal side?
I have the following INI file:
; some comment that should be ignored
[section]
key=value1;value2;value2
With the following code, Nini is removing value2;value3 as it is regarded as a comment:
using (TextReader tr = new StreamReader(iniFile))
{
IniDocument doc = new IniDocument(tr);
foreach (DictionaryEntry entry in doc.Sections)
{
string key = (string)entry.Key;
IniSection section = (IniSection)entry.Value;
if (section.Contains("key"))
{
// ... do stuff
}
}
}
Of course, I can do something like
IniReader ir = new IniReader(tr);
ir.SetCommentDelimiters(new char[] { '!' });
IniDocument doc = new IniDocument(ir);
but then also the initial comment will be treated as a config file and result into an error ("expecting =").
A quick scan of the code shows some useful properties you might be able to use:
result.AcceptCommentAfterKey = false;
result.SetCommentDelimiters (new char[] { ';', '#' });
Maybe setting that AcceptCommentAfterKey to false will help you? Otherwise, you could override the comment delimiter and replace the symbol you use for delimiting comments with whatever you want.
I have a method that serializes an object to xml and returns the string:
public static string SerializeType<T>(T item)
{
var serializer = new XmlSerializer(typeof(T));
var builder = new StringBuilder();
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
using (var stringWriter = XmlWriter.Create(builder, settings))
{
serializer.Serialize(stringWriter, item);
return builder.ToString();
}
}
However, it is not removing all the reserved characters from strings in objects I pass in. Microsoft lists the Reserved Characters as <>&% but when I input an item with a "abc&cd%d" string field, it spits out "a < ;ab> ;bc& ;cd%d" without out the spaces preceding the semicolons. % is not being escaped. How can I add the correct escape sequence for percent? The % causes an error when I send it to a client's app. The escaping listed on that page fixes the problem.
% isn't really a reserved character in XML. The documentation you've referred to is for SQL server, and there's a small note under the table:
The Notification Services XML vocabulary reserves the percent sign (%) for denoting parameters.
But you shouldn't expect XmlSerializer (or any other general-purpose XML library) to escape % for you. Unless you're using "Notification Services XML" I wouldn't expect this to be a problem.
Given this code (C#, .NET 3.5 SP1):
var doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\"?><root>"
+ "<value xml:space=\"preserve\">"
+ "<item>content</item>"
+ "<item>content</item>"
+ "</value></root>");
var text = new StringWriter();
var settings = new XmlWriterSettings() { Indent = true, CloseOutput = true };
using (var writer = XmlWriter.Create(text, settings))
{
doc.DocumentElement.WriteTo(writer);
}
var xml = text.GetStringBuilder().ToString();
Assert.AreEqual("<?xml version=\"1.0\" encoding=\"utf-16\"?>\r\n<root>\r\n"
+ " <value xml:space=\"preserve\"><item>content</item>"
+ "<item>content</item></value>\r\n</root>", xml);
The assertion fails because the XmlWriter is inserting a newline and indent around the <item> elements, which would seem to contradict the xml:space="preserve" attribute.
I am trying to take input with no whitespace (or only significant whitespace, and already loaded into an XmlDocument) and pretty-print it without adding any whitespace inside elements marked to preserve whitespace (for obvious reasons).
Is this a bug or am I doing something wrong? Is there a better way to achieve what I'm trying to do?
Edit: I should probably add that I do have to use an XmlWriter with Indent=true on the output side. In the "real" code, this is being passed in from outside of my code.
Ok, I've found a workaround.
It turns out that XmlWriter does the correct thing if there actually is any whitespace within the xml:space="preserve" block -- it's only when there isn't any that it screws up and adds some. And conveniently, this also works if there are some whitespace nodes, even if they're empty. So the trick that I've come up with is to decorate the document with extra 0-length whitespace in the appropriate places before trying to write it out. The result is exactly what I want: pretty printing everywhere except where whitespace is significant.
The workaround is to change the inner block to:
PreserveWhitespace(doc.DocumentElement);
doc.DocumentElement.WriteTo(writer);
...
private static void PreserveWhitespace(XmlElement root)
{
var nsmgr = new XmlNamespaceManager(root.OwnerDocument.NameTable);
foreach (var element in root.SelectNodes("//*[#xml:space='preserve']", nsmgr)
.OfType<XmlElement>())
{
if (element.HasChildNodes && !(element.FirstChild is XmlSignificantWhitespace))
{
var whitespace = element.OwnerDocument.CreateSignificantWhitespace("");
element.InsertBefore(whitespace, element.FirstChild);
}
}
}
I'm still thinking that this behaviour of XmlWriter is a bug, though.