XmlSerializer escapes an added escape character - c#

I'm using the XmlSerializer to output a class to a .xml file. For the most part, this is working as expected and intended. However, as a requirement, certain characters need to be removed from the values of the data and replaced with their proper escape characters.
In the elements I need to replace values in, I'm using the Replace() method and returning the updated string. The code below shows this string replacement; the lines commented out are because the XmlSerializer already escapes those particular characters.
I have a requirement from a third-party to escape &, <, >, ', and " characters when they appear within the values of the XML elements. Currently the characters &, <, and > are being escaped appropriately through the XmlSerializer.
The error received when these characters are present is:
Our system has detected a potential threat in the request message attachment.
However, when I serialize the XML Document after performing the string replace, the XmlSerializer sees the & character in &apos; and makes it &apos;. I think this is a correct functionality of the XmlSerializer object. However, I would like the serializer to either a.) ignore the escape characters; or b.) serialize the other characters which are necessary to escape.
Can anyone shed some light on, specifically, how to accomplish either of these?
String Replacement Method
public static string CheckValueOfProperty(string str)
{
string trimmedString = str.Trim();
if (string.IsNullOrEmpty(trimmedString))
return null;
else
{
// Commented out because the Serializer already transforms a '&' character into the appropriate escape character.
//trimmedString = trimmedString .Replace("&", "&");
//trimmedString = trimmedString.Replace("<", "<");
//trimmedString = trimmedString.Replace(">", ">");
trimmedString = trimmedString.Replace("'", "&apos;");
trimmedString = trimmedString.Replace("\"", """);
return trimmedString;
}
}
XmlSerializer Code
public static void SerializeAndOutput(object obj, string outputFilePath, XmlSerializerNamespaces ns = null)
{
XmlSerializer x = new XmlSerializer(obj.GetType());
// If the Output File already exists, delete it.
if (File.Exists(outputFilePath))
{
File.Delete(outputFilePath);
}
// Then, Create the Output File and Serialize the parameterized object as Xml to the Output File
using (TextWriter tw = File.CreateText(outputFilePath))
{
if (ns == null)
{
x.Serialize(tw, obj);
}
else { x.Serialize(tw, obj, ns); }
}
// =====================================================================
// The code below here is no longer needed, was used to force "utf-8" to
// UTF-8" to ensure the result was what was being expected.
// =====================================================================
// Create a new XmlDocument object, and load the contents of the OutputFile into the XmlDocument
// XmlDocument xdoc = new XmlDocument() { PreserveWhitespace = true };
// xdoc.Load(outputFilePath);
// Set the Encoding property of each XmlDeclaration in the document to "UTF-8";
// xdoc.ChildNodes.OfType<XmlDeclaration>().ToList().ForEach(d => d.Encoding = "UTF-8");
// Save the XmlDocument to the Output File Path.
// xdoc.Save(outputFilePath);
}

The single and double quote characters do not need to be escaped when used inside the node content in XML. The single quote or double quote characters only need to be escaped when used in a value of a node attribute. That's why the XMLSerializer does not escape them. And you also do not need to escape them.
See this question and answer for reference.
BTW: The way you set the Encoding to UTF-8 afterwards, is awkward as well. You can specify the encoding with the StreamWriter and then the XMLSerializer will automatically use that encoding and also specify it in the XML declaration.

Here's the solution I came up with. I have only tested it with a sample XML file and not the actual XML file I'm creating, so performance may take a hit; however, this seems to be working.
I'm reading the XML file line-by-line as a string, and replacing any of the defined "special" characters found in the string with their appropriate escape characters. It should process in the order of the specialCharacterList Dictionary<string, string> variable, which means the & character should process first. When processing <, > and " characters, it will only look at the value of the XML element.
using System;
using System.Collections.Generic;
using System.IO;
namespace testSerializer
{
class Program
{
private static string filePath = AppDomain.CurrentDomain.BaseDirectory + "testFile.xml";
private static string tempFile = AppDomain.CurrentDomain.BaseDirectory + "tempFile.xml";
private static Dictionary<string, string> specialCharacterList = new Dictionary<string, string>()
{
{"&","&"}, {"<","<"}, {">",">"}, {"'","&apos;"}, {"\"","""}
};
static void Main(string[] args)
{
ReplaceSpecialCharacters();
}
private static void ReplaceSpecialCharacters()
{
string[] allLines = File.ReadAllLines(filePath);
using (TextWriter tw = File.CreateText(tempFile))
{
foreach (string strLine in allLines)
{
string newLineString = "";
string originalString = strLine;
foreach (var item in specialCharacterList)
{
// Since these characters are all valid characters to be present in the XML,
// We need to look specifically within the VALUE of the XML Element.
if (item.Key == "\"" || item.Key == "<" || item.Key == ">")
{
// Find the ending character of the beginning XML tag.
int firstIndexOfCloseBracket = originalString.IndexOf('>');
// Find the beginning character of the ending XML tag.
int lastIndexOfOpenBracket = originalString.LastIndexOf('<');
if (lastIndexOfOpenBracket > firstIndexOfCloseBracket)
{
// Determine the length of the string between the XML tags.
int lengthOfStringBetweenBrackets = lastIndexOfOpenBracket - firstIndexOfCloseBracket;
// Retrieve the string that is between the element tags.
string valueOfElement = originalString.Substring(firstIndexOfCloseBracket + 1, lengthOfStringBetweenBrackets - 1);
newLineString = originalString.Substring(0, firstIndexOfCloseBracket + 1) + valueOfElement.Replace(item.Key, item.Value) + originalString.Substring(lastIndexOfOpenBracket);
}
}
// For the ampersand (&) and apostrophe (') characters, simply replace any found with the escape.
else
{
newLineString = originalString.Replace(item.Key, item.Value);
}
// Set the "original" string to the new version.
originalString = newLineString;
}
tw.WriteLine(newLineString);
}
}
}
}
}

Related

C# format JSON with backslash '\' in value

I have some JSON from a third party system that contains backslashes in the value. For example:
string extract = #"{""key"": ""\/Date(2015-02-02)\/""}";
which without the c# string escaping corresponds to the string:
{"key": "\/Date(2015-02-02)\/"}
I'd like to be able to format (e.g. indent) this JSON.
Typically for formatting, I might use something like JsonConvert like so:
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract), Formatting.Indented)
This doesn't quite work, as it sees the value as a date, but as it's not in the standard MS format of \/Date(ticks)\/, it goes to a date of 1 Jan 1970:
{
"key": "1970-01-01T00:00:02.015+00:00"
}
Next approach is to use the serializer settings to not convert dates (I'm not bothered whether it recognises the field as a date, although it would probably be handy later on):
JsonSerializerSettings settings = new JsonSerializerSettings
{
DateParseHandling = DateParseHandling.None,
};
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract, settings), Formatting.Indented);
This appears to have treated the backslash as an escape character during the deserialization, so it is "lost" once I see the final result:
{
"key": "/Date(2015-02-02)/"
}
Is there a way that I can format the JSON in C# (with or without JsonConvert), that will preserve the backslash in the value?
Note that the real JSON I am dealing with is (a) reasonably large, but not too large for some regex/find-replace solution, if really necessary (b) not under my control, so I can't change the format. I'm sure the answer is already on StackOverflow, but I'm finding it difficult to find the right search terms...
Have you tried:
extract = extract.Replace("\\","\\\\");
before parsing the string?
The basic problem is that, in a JSON string literal, the escaped solidus "\/" means exactly the same as the unescaped solidus "/", and Json.NET parses and interprets this escaping at a very low level, namely JsonTextReader.ReadStringIntoBuffer(). Thus there's no way for higher level code to detect and remember whether a string literal was formatted as "\/Date(2015-02-02)\/" or "/Date(2015-02-02)/" and later write back one or the other as appropriate.
If you are OK with always adding the extra escaping to strings that start with /Date( and end with )/, you can use a custom subclass of JsonTextWriter to do this:
public class DateLiteralJsonTextWriter : JsonTextWriter
{
public DateLiteralJsonTextWriter(TextWriter writer) : base(writer) { }
public override void WriteValue(string value)
{
const string startToken = #"/Date(";
const string replacementStartToken = #"\/Date(";
const string endToken = #")/";
const string replacementEndToken = #")\/";
if (value != null && value.StartsWith(startToken) && value.EndsWith(endToken))
{
var sb = new StringBuilder();
// Add the initial quote.
sb.Append(QuoteChar);
// Add the new start token.
sb.Append(replacementStartToken);
// Add any necessary escaping to the innards of the "/Date(.*)/" string.
using (var writer = new StringWriter(sb))
using (var jsonWriter = new JsonTextWriter(writer) { StringEscapeHandling = this.StringEscapeHandling, Culture = this.Culture, QuoteChar = '\"' })
{
var content = value.Substring(startToken.Length, value.Length - startToken.Length - endToken.Length);
jsonWriter.WriteValue(content);
}
// Strip the embedded quotes from the above.
sb.Remove(replacementStartToken.Length + 1, 1);
sb.Remove(sb.Length - 1, 1);
// Add the replacement end token and final quote.
sb.Append(replacementEndToken);
sb.Append(QuoteChar);
// Write without any further escaping.
WriteRawValue(sb.ToString());
}
else
{
base.WriteValue(value);
}
}
}
Then parse with DateParseHandling = DateParseHandling.None as you are currently doing:
var settings = new JsonSerializerSettings { DateParseHandling = DateParseHandling.None };
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
using (var jsonWriter = new DateLiteralJsonTextWriter(writer) { Formatting = Formatting.Indented})
{
JsonSerializer.CreateDefault(settings).Serialize(jsonWriter, JsonConvert.DeserializeObject(extract, settings));
}
Console.WriteLine(sb);
This prints:
{
"key": "\/Date(2015-02-02)\/"
}

How can I add specific escape characters to xmlserializer?

I have a method that serializes an object to xml and returns the string:
public static string SerializeType<T>(T item)
{
var serializer = new XmlSerializer(typeof(T));
var builder = new StringBuilder();
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
using (var stringWriter = XmlWriter.Create(builder, settings))
{
serializer.Serialize(stringWriter, item);
return builder.ToString();
}
}
However, it is not removing all the reserved characters from strings in objects I pass in. Microsoft lists the Reserved Characters as <>&% but when I input an item with a "abc&cd%d" string field, it spits out "a &lt ;ab&gt ;bc&amp ;cd%d" without out the spaces preceding the semicolons. % is not being escaped. How can I add the correct escape sequence for percent? The % causes an error when I send it to a client's app. The escaping listed on that page fixes the problem.
% isn't really a reserved character in XML. The documentation you've referred to is for SQL server, and there's a small note under the table:
The Notification Services XML vocabulary reserves the percent sign (%) for denoting parameters.
But you shouldn't expect XmlSerializer (or any other general-purpose XML library) to escape % for you. Unless you're using "Notification Services XML" I wouldn't expect this to be a problem.

Read special characters back from XmlDocument in c#

Say I have the an xml with a escaped ampersand (&). How do I then read it back such that the result gives me the 'un-escaped' text.
Running the following gives me "&amp" as the result. How do I get back '&'
void Main()
{
var xml = #"
<a>
&
</a>
";
var doc = new XmlDocument();
doc.LoadXml(xml);
var ele = (XmlElement)doc.FirstChild;
Console.WriteLine (ele.InnerXml);
}
Use ele.InnerText instead of ele.InnerXml
you can use CDATA in order to get your data
Characters like "<" and "&" are illegal in XML elements."<" the parser interprets it as the start of a new element. "&" the parser interprets it as the start of an character entity.
use
HttpServerUtility.HtmlDecode (ele.InnerXml);
Console.WriteLine (HttpUtility.UrlDecode(String.Format("{0}",ele.InnerXml)));
Try using this static method to decode escaped characters:
HttpServerUtility.HtmlDecode Method (String)
See example here:
http://msdn.microsoft.com/ru-ru/library/hwzhtkke.aspx
The following characters are illegal in XML elements:
Illegal EscapedUsed
------------------
" "
' &apos;
< <
> >
& &
To get the unescaped value you can use:
public string UnescapeXMLValue(string xmlValue)
{
if (string.IsNullOrEmpty(s)) return s;
string temp = s;
temp = temp.Replace("&apos;", "'").Replace(""", "\"").Replace(">", ">").Replace("<", "<").Replace("&", "&");
return temp ;
}
To get the escaped value you can use:
public string EscapeXMLValue(string value)
{
if (string.IsNullOrEmpty(s)) return s;
string temp = s;
temp = temp.Replace("'","&apos;").Replace( "\"", """).Replace(">",">").Replace( "<","<").Replace( "&","&");
return temp ;
}

XmlTextWriter incorrectly writing control characters

.NET's XmlTextWriter creates invalid xml files.
In XML, some control characters are allowed, like 'horizontal tab' ( ), but others are not, like 'vertical tab' (). (See spec.)
I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.
How can I make sure that XmlTextWriter never produces an illegal XML file?
Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?
Example code:
using (XmlTextWriter writer =
new XmlTextWriter("test.xml", Encoding.UTF8))
{
writer.WriteStartDocument();
writer.WriteStartElement("Test");
writer.WriteValue("hello \xb world");
writer.WriteEndElement();
writer.WriteEndDocument();
}
Output:
<?xml version="1.0" encoding="utf-8"?><Test>hello  world</Test>
This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.
The default behavior of an XmlWriter created using Create is to throw
an ArgumentException when attempting to write character values in the
range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD).
These invalid XML characters can be written by creating the XmlWriter
with the CheckCharacters property set to false. Doing so will result
in the characters being replaced with numeric character entities (
through &#0x1F). Additionally, an XmlTextWriter created with the new
operator will replace the invalid characters with numeric character
entities by default.
So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.
Just found this question when I was struggling with the same issue and I ended up solving it with an regex:
return Regex.Replace(s, #"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
Hope it helps someone as an alternative solution.
Built in .NET escapers such as SecurityElement.Escape don't properly escape/strip it either.
You could set CheckCharacters to false on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though.
See:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
If setting CheckCharacters to true(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them:
Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.
public static class XmlTextExtentions
{
private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
{ '&', "&"}, { '<', "<" }, { '>', ">" },
{ '"', """ }, { '\'', "&apos;" }
};
public static string ToValidXmlString(this string str)
{
var stripped = str
.Select((c,i) => new
{
c1 = c,
c2 = i + 1 < str.Length ? str[i+1]: default(char),
v = XmlConvert.IsXmlChar(c),
p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
})
.Aggregate("", (s, c) => {
if (c.pp)
return s;
if (textEntities.ContainsKey(c.c1))
s += textEntities[c.c1];
else if (c.v)
s += c.c1.ToString();
else if (c.p)
s += c.c1.ToString() + c.c2.ToString();
return s;
});
return stripped;
}
}
This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL which XmlConvert.IsXmlChar, Wikipedia, and the spec marks as a valid (although discouraged) character.

Get parameters out of text file

I have a C# asp.net page that has to get username/password info from a text file.
Could someone please tell me how.
The text file looks as follows: (it is actually a lot larger, I just got a few lines)
DATASOURCEFILE=D:\folder\folder
var1= etc
var2= more
var3 = misc
var4 = stuff
USERID = user1
PASSWORD = pwd1
all I need is the UserID and password out of that file.
Thank you for your help,
Steve
This would work:
var dic = File.ReadAllLines("test.txt")
.Select(l => l.Split(new[] { '=' }))
.ToDictionary( s => s[0].Trim(), s => s[1].Trim());
dic is a dictionary, so you easily extract your values, i.e.:
string myUser = dic["USERID"];
string myPassword = dic["PASSWORD"];
Open the file, split on the newline, split again on the = for each item and then add it to a dictionary.
string contents = String.Empty;
using (FileStream fs = File.Open("path", FileMode.OpenRead))
using (StreamReader reader = new StreamReader(fs))
{
contents = reader.ReadToEnd();
}
if (contents.Length > 0)
{
string[] lines = contents.Split(new char[] { '\n' });
Dictionary<string, string> mysettings = new Dictionary<string, string>();
foreach (string line in lines)
{
string[] keyAndValue = line.Split(new char[] { '=' });
mysettings.Add(keyAndValue[0].Trim(), keyAndValue[1].Trim());
}
string test = mysettings["USERID"]; // example of getting userid
}
You can use Regular expressions to extract each variable. You can read one line at a time, or the entire file into one string. If the latter, you just look for a newline in the expression.
Regards,
Morten
Dictionary is not needed.
Old-fashioned parsing can do more, with less executable code, the same amount of compiled data, and less processing:
public string MyPath1;
public string MyPath2;
...
public void ReadConfig(string sConfigFile)
{
MyPath1 = MyPath2 = ""; // Clear the external values (in case the file does not set every parameter).
using (StreamReader sr = new StreamReader(sConfigFile)) // Open the file for reading (and auto-close).
{
while (!sr.EndOfStream)
{
string sLine = sr.ReadLine().Trim(); // Read the next line. Trim leading and trailing whitespace.
// Treat lines with NO "=" as comments (ignore; no syntax checking).
// Treat lines with "=" as the first character as comments too.
// Treat lines with "=" as the 2nd character or after as parameter lines.
// Side-benefit: Values containing "=" are processed correctly.
int i = sLine.IndexOf("="); // Find the first "=" in the line.
if (i <= 0) // IF the first "=" in the line is the first character (or not present),
continue; // the line is not a parameter line. Ignore it. (Iterate the while.)
string sParameter = sLine.Remove(i).TrimEnd(); // All before the "=" is the parameter name. Trim whitespace.
string sValue = sLine.Substring(i + 1).TrimStart(); // All after the "=" is the value. Trim whitespace.
// Extra characters before a parameter name are usually intended to comment it out. Here, we keep them (with or without whitespace between). That makes an unrecognized parameter name, which is ignored (acts as a comment, as intended).
// Extra characters after a value are usually intended as comments. Here, we trim them only if whitespace separates. (Parsing contiguous comments is too complex: need delimiter(s) and then a way to escape delimiters (when needed) within values.) Side-drawback: Values cannot contain " ".
i = sValue.IndexOfAny(new char[] {' ', '\t'}); // Find the first " " or tab in the value.
if (i > 1) // IF the first " " or tab is the second character or after,
sValue = sValue.Remove(i); // All before the " " or tab is the parameter. (Discard the rest.)
// IF a desired parameter is specified, collect it:
// (Could detect here if any parameter is set more than once.)
if (sParameter == "MyPathOne")
MyPath1 = sValue;
else if (sParameter == "MyPathTwo")
MyPath2 = sValue;
// (Could detect here if an invalid parameter name is specified.)
// (Could exit the loop here if every parameter has been set.)
} // end while
// (Could detect here if the config file set neither parameter or only one parameter.)
} // end using
}

Categories