How can I add specific escape characters to xmlserializer? - c#

I have a method that serializes an object to xml and returns the string:
public static string SerializeType<T>(T item)
{
var serializer = new XmlSerializer(typeof(T));
var builder = new StringBuilder();
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
using (var stringWriter = XmlWriter.Create(builder, settings))
{
serializer.Serialize(stringWriter, item);
return builder.ToString();
}
}
However, it is not removing all the reserved characters from strings in objects I pass in. Microsoft lists the Reserved Characters as <>&% but when I input an item with a "abc&cd%d" string field, it spits out "a &lt ;ab&gt ;bc&amp ;cd%d" without out the spaces preceding the semicolons. % is not being escaped. How can I add the correct escape sequence for percent? The % causes an error when I send it to a client's app. The escaping listed on that page fixes the problem.

% isn't really a reserved character in XML. The documentation you've referred to is for SQL server, and there's a small note under the table:
The Notification Services XML vocabulary reserves the percent sign (%) for denoting parameters.
But you shouldn't expect XmlSerializer (or any other general-purpose XML library) to escape % for you. Unless you're using "Notification Services XML" I wouldn't expect this to be a problem.

Related

XmlSerializer escapes an added escape character

I'm using the XmlSerializer to output a class to a .xml file. For the most part, this is working as expected and intended. However, as a requirement, certain characters need to be removed from the values of the data and replaced with their proper escape characters.
In the elements I need to replace values in, I'm using the Replace() method and returning the updated string. The code below shows this string replacement; the lines commented out are because the XmlSerializer already escapes those particular characters.
I have a requirement from a third-party to escape &, <, >, ', and " characters when they appear within the values of the XML elements. Currently the characters &, <, and > are being escaped appropriately through the XmlSerializer.
The error received when these characters are present is:
Our system has detected a potential threat in the request message attachment.
However, when I serialize the XML Document after performing the string replace, the XmlSerializer sees the & character in &apos; and makes it &apos;. I think this is a correct functionality of the XmlSerializer object. However, I would like the serializer to either a.) ignore the escape characters; or b.) serialize the other characters which are necessary to escape.
Can anyone shed some light on, specifically, how to accomplish either of these?
String Replacement Method
public static string CheckValueOfProperty(string str)
{
string trimmedString = str.Trim();
if (string.IsNullOrEmpty(trimmedString))
return null;
else
{
// Commented out because the Serializer already transforms a '&' character into the appropriate escape character.
//trimmedString = trimmedString .Replace("&", "&");
//trimmedString = trimmedString.Replace("<", "<");
//trimmedString = trimmedString.Replace(">", ">");
trimmedString = trimmedString.Replace("'", "&apos;");
trimmedString = trimmedString.Replace("\"", """);
return trimmedString;
}
}
XmlSerializer Code
public static void SerializeAndOutput(object obj, string outputFilePath, XmlSerializerNamespaces ns = null)
{
XmlSerializer x = new XmlSerializer(obj.GetType());
// If the Output File already exists, delete it.
if (File.Exists(outputFilePath))
{
File.Delete(outputFilePath);
}
// Then, Create the Output File and Serialize the parameterized object as Xml to the Output File
using (TextWriter tw = File.CreateText(outputFilePath))
{
if (ns == null)
{
x.Serialize(tw, obj);
}
else { x.Serialize(tw, obj, ns); }
}
// =====================================================================
// The code below here is no longer needed, was used to force "utf-8" to
// UTF-8" to ensure the result was what was being expected.
// =====================================================================
// Create a new XmlDocument object, and load the contents of the OutputFile into the XmlDocument
// XmlDocument xdoc = new XmlDocument() { PreserveWhitespace = true };
// xdoc.Load(outputFilePath);
// Set the Encoding property of each XmlDeclaration in the document to "UTF-8";
// xdoc.ChildNodes.OfType<XmlDeclaration>().ToList().ForEach(d => d.Encoding = "UTF-8");
// Save the XmlDocument to the Output File Path.
// xdoc.Save(outputFilePath);
}
The single and double quote characters do not need to be escaped when used inside the node content in XML. The single quote or double quote characters only need to be escaped when used in a value of a node attribute. That's why the XMLSerializer does not escape them. And you also do not need to escape them.
See this question and answer for reference.
BTW: The way you set the Encoding to UTF-8 afterwards, is awkward as well. You can specify the encoding with the StreamWriter and then the XMLSerializer will automatically use that encoding and also specify it in the XML declaration.
Here's the solution I came up with. I have only tested it with a sample XML file and not the actual XML file I'm creating, so performance may take a hit; however, this seems to be working.
I'm reading the XML file line-by-line as a string, and replacing any of the defined "special" characters found in the string with their appropriate escape characters. It should process in the order of the specialCharacterList Dictionary<string, string> variable, which means the & character should process first. When processing <, > and " characters, it will only look at the value of the XML element.
using System;
using System.Collections.Generic;
using System.IO;
namespace testSerializer
{
class Program
{
private static string filePath = AppDomain.CurrentDomain.BaseDirectory + "testFile.xml";
private static string tempFile = AppDomain.CurrentDomain.BaseDirectory + "tempFile.xml";
private static Dictionary<string, string> specialCharacterList = new Dictionary<string, string>()
{
{"&","&"}, {"<","<"}, {">",">"}, {"'","&apos;"}, {"\"","""}
};
static void Main(string[] args)
{
ReplaceSpecialCharacters();
}
private static void ReplaceSpecialCharacters()
{
string[] allLines = File.ReadAllLines(filePath);
using (TextWriter tw = File.CreateText(tempFile))
{
foreach (string strLine in allLines)
{
string newLineString = "";
string originalString = strLine;
foreach (var item in specialCharacterList)
{
// Since these characters are all valid characters to be present in the XML,
// We need to look specifically within the VALUE of the XML Element.
if (item.Key == "\"" || item.Key == "<" || item.Key == ">")
{
// Find the ending character of the beginning XML tag.
int firstIndexOfCloseBracket = originalString.IndexOf('>');
// Find the beginning character of the ending XML tag.
int lastIndexOfOpenBracket = originalString.LastIndexOf('<');
if (lastIndexOfOpenBracket > firstIndexOfCloseBracket)
{
// Determine the length of the string between the XML tags.
int lengthOfStringBetweenBrackets = lastIndexOfOpenBracket - firstIndexOfCloseBracket;
// Retrieve the string that is between the element tags.
string valueOfElement = originalString.Substring(firstIndexOfCloseBracket + 1, lengthOfStringBetweenBrackets - 1);
newLineString = originalString.Substring(0, firstIndexOfCloseBracket + 1) + valueOfElement.Replace(item.Key, item.Value) + originalString.Substring(lastIndexOfOpenBracket);
}
}
// For the ampersand (&) and apostrophe (') characters, simply replace any found with the escape.
else
{
newLineString = originalString.Replace(item.Key, item.Value);
}
// Set the "original" string to the new version.
originalString = newLineString;
}
tw.WriteLine(newLineString);
}
}
}
}
}

C# format JSON with backslash '\' in value

I have some JSON from a third party system that contains backslashes in the value. For example:
string extract = #"{""key"": ""\/Date(2015-02-02)\/""}";
which without the c# string escaping corresponds to the string:
{"key": "\/Date(2015-02-02)\/"}
I'd like to be able to format (e.g. indent) this JSON.
Typically for formatting, I might use something like JsonConvert like so:
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract), Formatting.Indented)
This doesn't quite work, as it sees the value as a date, but as it's not in the standard MS format of \/Date(ticks)\/, it goes to a date of 1 Jan 1970:
{
"key": "1970-01-01T00:00:02.015+00:00"
}
Next approach is to use the serializer settings to not convert dates (I'm not bothered whether it recognises the field as a date, although it would probably be handy later on):
JsonSerializerSettings settings = new JsonSerializerSettings
{
DateParseHandling = DateParseHandling.None,
};
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract, settings), Formatting.Indented);
This appears to have treated the backslash as an escape character during the deserialization, so it is "lost" once I see the final result:
{
"key": "/Date(2015-02-02)/"
}
Is there a way that I can format the JSON in C# (with or without JsonConvert), that will preserve the backslash in the value?
Note that the real JSON I am dealing with is (a) reasonably large, but not too large for some regex/find-replace solution, if really necessary (b) not under my control, so I can't change the format. I'm sure the answer is already on StackOverflow, but I'm finding it difficult to find the right search terms...
Have you tried:
extract = extract.Replace("\\","\\\\");
before parsing the string?
The basic problem is that, in a JSON string literal, the escaped solidus "\/" means exactly the same as the unescaped solidus "/", and Json.NET parses and interprets this escaping at a very low level, namely JsonTextReader.ReadStringIntoBuffer(). Thus there's no way for higher level code to detect and remember whether a string literal was formatted as "\/Date(2015-02-02)\/" or "/Date(2015-02-02)/" and later write back one or the other as appropriate.
If you are OK with always adding the extra escaping to strings that start with /Date( and end with )/, you can use a custom subclass of JsonTextWriter to do this:
public class DateLiteralJsonTextWriter : JsonTextWriter
{
public DateLiteralJsonTextWriter(TextWriter writer) : base(writer) { }
public override void WriteValue(string value)
{
const string startToken = #"/Date(";
const string replacementStartToken = #"\/Date(";
const string endToken = #")/";
const string replacementEndToken = #")\/";
if (value != null && value.StartsWith(startToken) && value.EndsWith(endToken))
{
var sb = new StringBuilder();
// Add the initial quote.
sb.Append(QuoteChar);
// Add the new start token.
sb.Append(replacementStartToken);
// Add any necessary escaping to the innards of the "/Date(.*)/" string.
using (var writer = new StringWriter(sb))
using (var jsonWriter = new JsonTextWriter(writer) { StringEscapeHandling = this.StringEscapeHandling, Culture = this.Culture, QuoteChar = '\"' })
{
var content = value.Substring(startToken.Length, value.Length - startToken.Length - endToken.Length);
jsonWriter.WriteValue(content);
}
// Strip the embedded quotes from the above.
sb.Remove(replacementStartToken.Length + 1, 1);
sb.Remove(sb.Length - 1, 1);
// Add the replacement end token and final quote.
sb.Append(replacementEndToken);
sb.Append(QuoteChar);
// Write without any further escaping.
WriteRawValue(sb.ToString());
}
else
{
base.WriteValue(value);
}
}
}
Then parse with DateParseHandling = DateParseHandling.None as you are currently doing:
var settings = new JsonSerializerSettings { DateParseHandling = DateParseHandling.None };
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
using (var jsonWriter = new DateLiteralJsonTextWriter(writer) { Formatting = Formatting.Indented})
{
JsonSerializer.CreateDefault(settings).Serialize(jsonWriter, JsonConvert.DeserializeObject(extract, settings));
}
Console.WriteLine(sb);
This prints:
{
"key": "\/Date(2015-02-02)\/"
}

XmlTextWriter incorrectly writing control characters

.NET's XmlTextWriter creates invalid xml files.
In XML, some control characters are allowed, like 'horizontal tab' ( ), but others are not, like 'vertical tab' (). (See spec.)
I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.
How can I make sure that XmlTextWriter never produces an illegal XML file?
Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?
Example code:
using (XmlTextWriter writer =
new XmlTextWriter("test.xml", Encoding.UTF8))
{
writer.WriteStartDocument();
writer.WriteStartElement("Test");
writer.WriteValue("hello \xb world");
writer.WriteEndElement();
writer.WriteEndDocument();
}
Output:
<?xml version="1.0" encoding="utf-8"?><Test>hello  world</Test>
This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.
The default behavior of an XmlWriter created using Create is to throw
an ArgumentException when attempting to write character values in the
range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD).
These invalid XML characters can be written by creating the XmlWriter
with the CheckCharacters property set to false. Doing so will result
in the characters being replaced with numeric character entities (
through &#0x1F). Additionally, an XmlTextWriter created with the new
operator will replace the invalid characters with numeric character
entities by default.
So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.
Just found this question when I was struggling with the same issue and I ended up solving it with an regex:
return Regex.Replace(s, #"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
Hope it helps someone as an alternative solution.
Built in .NET escapers such as SecurityElement.Escape don't properly escape/strip it either.
You could set CheckCharacters to false on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though.
See:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
If setting CheckCharacters to true(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them:
Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.
public static class XmlTextExtentions
{
private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
{ '&', "&"}, { '<', "<" }, { '>', ">" },
{ '"', """ }, { '\'', "&apos;" }
};
public static string ToValidXmlString(this string str)
{
var stripped = str
.Select((c,i) => new
{
c1 = c,
c2 = i + 1 < str.Length ? str[i+1]: default(char),
v = XmlConvert.IsXmlChar(c),
p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
})
.Aggregate("", (s, c) => {
if (c.pp)
return s;
if (textEntities.ContainsKey(c.c1))
s += textEntities[c.c1];
else if (c.v)
s += c.c1.ToString();
else if (c.p)
s += c.c1.ToString() + c.c2.ToString();
return s;
});
return stripped;
}
}
This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL which XmlConvert.IsXmlChar, Wikipedia, and the spec marks as a valid (although discouraged) character.

XmlWriter inserting spaces when xml:space=preserve

Given this code (C#, .NET 3.5 SP1):
var doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\"?><root>"
+ "<value xml:space=\"preserve\">"
+ "<item>content</item>"
+ "<item>content</item>"
+ "</value></root>");
var text = new StringWriter();
var settings = new XmlWriterSettings() { Indent = true, CloseOutput = true };
using (var writer = XmlWriter.Create(text, settings))
{
doc.DocumentElement.WriteTo(writer);
}
var xml = text.GetStringBuilder().ToString();
Assert.AreEqual("<?xml version=\"1.0\" encoding=\"utf-16\"?>\r\n<root>\r\n"
+ " <value xml:space=\"preserve\"><item>content</item>"
+ "<item>content</item></value>\r\n</root>", xml);
The assertion fails because the XmlWriter is inserting a newline and indent around the <item> elements, which would seem to contradict the xml:space="preserve" attribute.
I am trying to take input with no whitespace (or only significant whitespace, and already loaded into an XmlDocument) and pretty-print it without adding any whitespace inside elements marked to preserve whitespace (for obvious reasons).
Is this a bug or am I doing something wrong? Is there a better way to achieve what I'm trying to do?
Edit: I should probably add that I do have to use an XmlWriter with Indent=true on the output side. In the "real" code, this is being passed in from outside of my code.
Ok, I've found a workaround.
It turns out that XmlWriter does the correct thing if there actually is any whitespace within the xml:space="preserve" block -- it's only when there isn't any that it screws up and adds some. And conveniently, this also works if there are some whitespace nodes, even if they're empty. So the trick that I've come up with is to decorate the document with extra 0-length whitespace in the appropriate places before trying to write it out. The result is exactly what I want: pretty printing everywhere except where whitespace is significant.
The workaround is to change the inner block to:
PreserveWhitespace(doc.DocumentElement);
doc.DocumentElement.WriteTo(writer);
...
private static void PreserveWhitespace(XmlElement root)
{
var nsmgr = new XmlNamespaceManager(root.OwnerDocument.NameTable);
foreach (var element in root.SelectNodes("//*[#xml:space='preserve']", nsmgr)
.OfType<XmlElement>())
{
if (element.HasChildNodes && !(element.FirstChild is XmlSignificantWhitespace))
{
var whitespace = element.OwnerDocument.CreateSignificantWhitespace("");
element.InsertBefore(whitespace, element.FirstChild);
}
}
}
I'm still thinking that this behaviour of XmlWriter is a bug, though.

Convert character entities to their unicode equivalents

I have html encoded strings in a database, but many of the character entities are not just the standard & and <. Entities like “ and —. Unfortunately we need to feed this data into a flash based rss reader and flash doesn't read these entities, but they do read the unicode equivalent (ex “).
Using .Net 4.0, is there any utility method that will convert the html encoded string to use unicode encoded character entities?
Here is a better example of what I need. The db has html strings like: <p>John & Sarah went to see $ldquo;Scream 4$rdquo;.</p> and what I need to output in the rss/xml document with in the <description> tag is: <p>John &#38; Sarah went to see &#8220;Scream 4&#8221;.</p>
I'm using an XmlTextWriter to create the xml document from the database records similar to this example code http://www.dotnettutorials.com/tutorials/advanced/rss-feed-asp-net-csharp.aspx
So I need to replace all of the character entities within the html string from the db with their unicode equivilant because the flash based rss reader doesn't recognize any entities beyond the most common like &.
My first thought is, can your RSS reader accept the actual characters? If so, you can use HtmlDecode and feed it directly in.
If you do need to convert it to the numeric representations, you could parse out each entity, HtmlDecode it, and then cast it to an int to get the base-10 unicode value. Then re-insert it into the string.
EDIT:
Here's some code to demonstrate what I mean (it is untested, but gets the idea across):
string input = "Something with — or other character entities.";
StringBuilder output = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '&')
{
int startOfEntity = i; // just for easier reading
int endOfEntity = input.IndexOf(';', startOfEntity);
string entity = input.Substring(startOfEntity, endOfEntity - startOfEntity);
int unicodeNumber = (int)(HttpUtility.HtmlDecode(entity)[0]);
output.Append("&#" + unicodeNumber + ";");
i = endOfEntity; // continue parsing after the end of the entity
}
else
output.Append(input[i]);
}
I may have an off-by-one error somewhere in there, but it should be close.
would HttpUtility.HtmlDecode work for you?
I realize it doesn't convert to unicode equivalent entities, but instead converts it to unicode. Is there a specific reason you want the unicode equivalent entities?
updated edit
string test = "<p>John & Sarah went to see “Scream 4”.</p>";
string decode = HttpUtility.HtmlDecode(test);
string encode = HttpUtility.HtmlEncode(decode);
StringBuilder builder = new StringBuilder();
foreach (char c in encode)
{
if ((int)c > 127)
{
builder.Append("&#");
builder.Append((int)c);
builder.Append(";");
}
else
{
builder.Append(c);
}
}
string result = builder.ToString();
you can download a local copy of the appropriate HTML and/or XHTML DTDs from the W3C. Then set up an XmlResolver and use it to expand any entities found in the document.
You could use a regular expression to find/expand the entities, but that won't know anything about context (e.g., anything in a CDATA section shouldn't be expanded).
this might help you put input path in textbox
try
{
FileInfo n = new FileInfo(textBox1.Text);
string initContent = File.ReadAllText(textBox1.Text);
int contentLength = initContent.Length;
Match m;
while ((m = Regex.Match(initContent, "[^a-zA-Z0-9<>/\\s(&#\\d+;)-]")).Value != String.Empty)
initContent = initContent.Remove(m.Index, 1).Insert(m.Index, string.Format("&#{0};", (int)m.Value[0]));
File.WriteAllText("outputpath", initContent);
}
catch (System.Exception excep)
{
MessageBox.Show(excep.Message);
}
}

Categories