I have some JSON from a third party system that contains backslashes in the value. For example:
string extract = #"{""key"": ""\/Date(2015-02-02)\/""}";
which without the c# string escaping corresponds to the string:
{"key": "\/Date(2015-02-02)\/"}
I'd like to be able to format (e.g. indent) this JSON.
Typically for formatting, I might use something like JsonConvert like so:
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract), Formatting.Indented)
This doesn't quite work, as it sees the value as a date, but as it's not in the standard MS format of \/Date(ticks)\/, it goes to a date of 1 Jan 1970:
{
"key": "1970-01-01T00:00:02.015+00:00"
}
Next approach is to use the serializer settings to not convert dates (I'm not bothered whether it recognises the field as a date, although it would probably be handy later on):
JsonSerializerSettings settings = new JsonSerializerSettings
{
DateParseHandling = DateParseHandling.None,
};
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract, settings), Formatting.Indented);
This appears to have treated the backslash as an escape character during the deserialization, so it is "lost" once I see the final result:
{
"key": "/Date(2015-02-02)/"
}
Is there a way that I can format the JSON in C# (with or without JsonConvert), that will preserve the backslash in the value?
Note that the real JSON I am dealing with is (a) reasonably large, but not too large for some regex/find-replace solution, if really necessary (b) not under my control, so I can't change the format. I'm sure the answer is already on StackOverflow, but I'm finding it difficult to find the right search terms...
Have you tried:
extract = extract.Replace("\\","\\\\");
before parsing the string?
The basic problem is that, in a JSON string literal, the escaped solidus "\/" means exactly the same as the unescaped solidus "/", and Json.NET parses and interprets this escaping at a very low level, namely JsonTextReader.ReadStringIntoBuffer(). Thus there's no way for higher level code to detect and remember whether a string literal was formatted as "\/Date(2015-02-02)\/" or "/Date(2015-02-02)/" and later write back one or the other as appropriate.
If you are OK with always adding the extra escaping to strings that start with /Date( and end with )/, you can use a custom subclass of JsonTextWriter to do this:
public class DateLiteralJsonTextWriter : JsonTextWriter
{
public DateLiteralJsonTextWriter(TextWriter writer) : base(writer) { }
public override void WriteValue(string value)
{
const string startToken = #"/Date(";
const string replacementStartToken = #"\/Date(";
const string endToken = #")/";
const string replacementEndToken = #")\/";
if (value != null && value.StartsWith(startToken) && value.EndsWith(endToken))
{
var sb = new StringBuilder();
// Add the initial quote.
sb.Append(QuoteChar);
// Add the new start token.
sb.Append(replacementStartToken);
// Add any necessary escaping to the innards of the "/Date(.*)/" string.
using (var writer = new StringWriter(sb))
using (var jsonWriter = new JsonTextWriter(writer) { StringEscapeHandling = this.StringEscapeHandling, Culture = this.Culture, QuoteChar = '\"' })
{
var content = value.Substring(startToken.Length, value.Length - startToken.Length - endToken.Length);
jsonWriter.WriteValue(content);
}
// Strip the embedded quotes from the above.
sb.Remove(replacementStartToken.Length + 1, 1);
sb.Remove(sb.Length - 1, 1);
// Add the replacement end token and final quote.
sb.Append(replacementEndToken);
sb.Append(QuoteChar);
// Write without any further escaping.
WriteRawValue(sb.ToString());
}
else
{
base.WriteValue(value);
}
}
}
Then parse with DateParseHandling = DateParseHandling.None as you are currently doing:
var settings = new JsonSerializerSettings { DateParseHandling = DateParseHandling.None };
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
using (var jsonWriter = new DateLiteralJsonTextWriter(writer) { Formatting = Formatting.Indented})
{
JsonSerializer.CreateDefault(settings).Serialize(jsonWriter, JsonConvert.DeserializeObject(extract, settings));
}
Console.WriteLine(sb);
This prints:
{
"key": "\/Date(2015-02-02)\/"
}
Related
I'm using the XmlSerializer to output a class to a .xml file. For the most part, this is working as expected and intended. However, as a requirement, certain characters need to be removed from the values of the data and replaced with their proper escape characters.
In the elements I need to replace values in, I'm using the Replace() method and returning the updated string. The code below shows this string replacement; the lines commented out are because the XmlSerializer already escapes those particular characters.
I have a requirement from a third-party to escape &, <, >, ', and " characters when they appear within the values of the XML elements. Currently the characters &, <, and > are being escaped appropriately through the XmlSerializer.
The error received when these characters are present is:
Our system has detected a potential threat in the request message attachment.
However, when I serialize the XML Document after performing the string replace, the XmlSerializer sees the & character in ' and makes it '. I think this is a correct functionality of the XmlSerializer object. However, I would like the serializer to either a.) ignore the escape characters; or b.) serialize the other characters which are necessary to escape.
Can anyone shed some light on, specifically, how to accomplish either of these?
String Replacement Method
public static string CheckValueOfProperty(string str)
{
string trimmedString = str.Trim();
if (string.IsNullOrEmpty(trimmedString))
return null;
else
{
// Commented out because the Serializer already transforms a '&' character into the appropriate escape character.
//trimmedString = trimmedString .Replace("&", "&");
//trimmedString = trimmedString.Replace("<", "<");
//trimmedString = trimmedString.Replace(">", ">");
trimmedString = trimmedString.Replace("'", "'");
trimmedString = trimmedString.Replace("\"", """);
return trimmedString;
}
}
XmlSerializer Code
public static void SerializeAndOutput(object obj, string outputFilePath, XmlSerializerNamespaces ns = null)
{
XmlSerializer x = new XmlSerializer(obj.GetType());
// If the Output File already exists, delete it.
if (File.Exists(outputFilePath))
{
File.Delete(outputFilePath);
}
// Then, Create the Output File and Serialize the parameterized object as Xml to the Output File
using (TextWriter tw = File.CreateText(outputFilePath))
{
if (ns == null)
{
x.Serialize(tw, obj);
}
else { x.Serialize(tw, obj, ns); }
}
// =====================================================================
// The code below here is no longer needed, was used to force "utf-8" to
// UTF-8" to ensure the result was what was being expected.
// =====================================================================
// Create a new XmlDocument object, and load the contents of the OutputFile into the XmlDocument
// XmlDocument xdoc = new XmlDocument() { PreserveWhitespace = true };
// xdoc.Load(outputFilePath);
// Set the Encoding property of each XmlDeclaration in the document to "UTF-8";
// xdoc.ChildNodes.OfType<XmlDeclaration>().ToList().ForEach(d => d.Encoding = "UTF-8");
// Save the XmlDocument to the Output File Path.
// xdoc.Save(outputFilePath);
}
The single and double quote characters do not need to be escaped when used inside the node content in XML. The single quote or double quote characters only need to be escaped when used in a value of a node attribute. That's why the XMLSerializer does not escape them. And you also do not need to escape them.
See this question and answer for reference.
BTW: The way you set the Encoding to UTF-8 afterwards, is awkward as well. You can specify the encoding with the StreamWriter and then the XMLSerializer will automatically use that encoding and also specify it in the XML declaration.
Here's the solution I came up with. I have only tested it with a sample XML file and not the actual XML file I'm creating, so performance may take a hit; however, this seems to be working.
I'm reading the XML file line-by-line as a string, and replacing any of the defined "special" characters found in the string with their appropriate escape characters. It should process in the order of the specialCharacterList Dictionary<string, string> variable, which means the & character should process first. When processing <, > and " characters, it will only look at the value of the XML element.
using System;
using System.Collections.Generic;
using System.IO;
namespace testSerializer
{
class Program
{
private static string filePath = AppDomain.CurrentDomain.BaseDirectory + "testFile.xml";
private static string tempFile = AppDomain.CurrentDomain.BaseDirectory + "tempFile.xml";
private static Dictionary<string, string> specialCharacterList = new Dictionary<string, string>()
{
{"&","&"}, {"<","<"}, {">",">"}, {"'","'"}, {"\"","""}
};
static void Main(string[] args)
{
ReplaceSpecialCharacters();
}
private static void ReplaceSpecialCharacters()
{
string[] allLines = File.ReadAllLines(filePath);
using (TextWriter tw = File.CreateText(tempFile))
{
foreach (string strLine in allLines)
{
string newLineString = "";
string originalString = strLine;
foreach (var item in specialCharacterList)
{
// Since these characters are all valid characters to be present in the XML,
// We need to look specifically within the VALUE of the XML Element.
if (item.Key == "\"" || item.Key == "<" || item.Key == ">")
{
// Find the ending character of the beginning XML tag.
int firstIndexOfCloseBracket = originalString.IndexOf('>');
// Find the beginning character of the ending XML tag.
int lastIndexOfOpenBracket = originalString.LastIndexOf('<');
if (lastIndexOfOpenBracket > firstIndexOfCloseBracket)
{
// Determine the length of the string between the XML tags.
int lengthOfStringBetweenBrackets = lastIndexOfOpenBracket - firstIndexOfCloseBracket;
// Retrieve the string that is between the element tags.
string valueOfElement = originalString.Substring(firstIndexOfCloseBracket + 1, lengthOfStringBetweenBrackets - 1);
newLineString = originalString.Substring(0, firstIndexOfCloseBracket + 1) + valueOfElement.Replace(item.Key, item.Value) + originalString.Substring(lastIndexOfOpenBracket);
}
}
// For the ampersand (&) and apostrophe (') characters, simply replace any found with the escape.
else
{
newLineString = originalString.Replace(item.Key, item.Value);
}
// Set the "original" string to the new version.
originalString = newLineString;
}
tw.WriteLine(newLineString);
}
}
}
}
}
I have to convert a few hundred test cases written in Java to code in C#. At the moment all I could think of is define a set of regular expressions, try to match it on a line and do an action based on which regex matched.
Any better ideas (this still stinks).
An example of from and to:
Java:
Request request = new Request(testRunner)
request.setUsername("userName")
request.setPassword("password")
log.info(request.getRequest())
C#
var request = new LoginRequest(LoginParams);
request.Username = "userName";
request.Password = "password";
var LoginResponse = Account.ExecuteCall(request, pathToApi);
The source I'm trying to convert is from SoapUI and the bits of script involved are within TestSteps of a humongous XML file. Also, most of them are simply forming some sort of request and checking for a specific response so there shouldn't be too many types to implement.
What I ended up doing was defined a base class (Map) that has a Pattern property, a Success indicator and the lines of Code that it results to after a successful match. In some cases a certain line can be simply replaced by another one but in other cases (setUserName) I need to extract content from the original script to put in the c# code. In other cases, a single line might be replaced with more than one. The transformation is all defined in the Match function.
public class SetUserName : Map
{
internal override string Pattern { get { return #"request.setUsername\(""(.*)""\)"; } }
public override void Match(string line)
{
Match match = Regex.Match(line, Pattern);
if (match.Success)
{
Success = true;
CodeLines = new Code<CodeLine>
{new CodeLine("request.Username = \"" + match.Groups[1].Value + "\"")};
}
}
}
Then I put the maps in a list ordered by occurrence and loop through each line of script:
foreach (string scriptLine in scriptLines)
{
string line = Strip(scriptLine);
if (string.IsNullOrEmpty(line) || Regex.Match(line, #"^\s+$").Success)
{
continue;
}
Map[] RegExes =
{
new Request(),
new SetUserName(),
new SetPassword(),
new RunRequest()
};
foreach (Map map in RegExes)
{
map.Match(line);
if (map.Success)
{
codeList.AddRange(map.CodeLines);
break;
}
}
}
I have a method that serializes an object to xml and returns the string:
public static string SerializeType<T>(T item)
{
var serializer = new XmlSerializer(typeof(T));
var builder = new StringBuilder();
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
using (var stringWriter = XmlWriter.Create(builder, settings))
{
serializer.Serialize(stringWriter, item);
return builder.ToString();
}
}
However, it is not removing all the reserved characters from strings in objects I pass in. Microsoft lists the Reserved Characters as <>&% but when I input an item with a "abc&cd%d" string field, it spits out "a < ;ab> ;bc& ;cd%d" without out the spaces preceding the semicolons. % is not being escaped. How can I add the correct escape sequence for percent? The % causes an error when I send it to a client's app. The escaping listed on that page fixes the problem.
% isn't really a reserved character in XML. The documentation you've referred to is for SQL server, and there's a small note under the table:
The Notification Services XML vocabulary reserves the percent sign (%) for denoting parameters.
But you shouldn't expect XmlSerializer (or any other general-purpose XML library) to escape % for you. Unless you're using "Notification Services XML" I wouldn't expect this to be a problem.
.NET's XmlTextWriter creates invalid xml files.
In XML, some control characters are allowed, like 'horizontal tab' ( ), but others are not, like 'vertical tab' (). (See spec.)
I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.
How can I make sure that XmlTextWriter never produces an illegal XML file?
Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?
Example code:
using (XmlTextWriter writer =
new XmlTextWriter("test.xml", Encoding.UTF8))
{
writer.WriteStartDocument();
writer.WriteStartElement("Test");
writer.WriteValue("hello \xb world");
writer.WriteEndElement();
writer.WriteEndDocument();
}
Output:
<?xml version="1.0" encoding="utf-8"?><Test>hello world</Test>
This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.
The default behavior of an XmlWriter created using Create is to throw
an ArgumentException when attempting to write character values in the
range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD).
These invalid XML characters can be written by creating the XmlWriter
with the CheckCharacters property set to false. Doing so will result
in the characters being replaced with numeric character entities (
through �x1F). Additionally, an XmlTextWriter created with the new
operator will replace the invalid characters with numeric character
entities by default.
So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.
Just found this question when I was struggling with the same issue and I ended up solving it with an regex:
return Regex.Replace(s, #"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
Hope it helps someone as an alternative solution.
Built in .NET escapers such as SecurityElement.Escape don't properly escape/strip it either.
You could set CheckCharacters to false on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though.
See:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
If setting CheckCharacters to true(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them:
Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.
public static class XmlTextExtentions
{
private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
{ '&', "&"}, { '<', "<" }, { '>', ">" },
{ '"', """ }, { '\'', "'" }
};
public static string ToValidXmlString(this string str)
{
var stripped = str
.Select((c,i) => new
{
c1 = c,
c2 = i + 1 < str.Length ? str[i+1]: default(char),
v = XmlConvert.IsXmlChar(c),
p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
})
.Aggregate("", (s, c) => {
if (c.pp)
return s;
if (textEntities.ContainsKey(c.c1))
s += textEntities[c.c1];
else if (c.v)
s += c.c1.ToString();
else if (c.p)
s += c.c1.ToString() + c.c2.ToString();
return s;
});
return stripped;
}
}
This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL which XmlConvert.IsXmlChar, Wikipedia, and the spec marks as a valid (although discouraged) character.
I am parsing a webpage for http links by first parsing out all the anchored tags, then parsing out the href tags, then running a regex to remove all tags that aren't independent links (like href="/img/link.php"). The following code works correctly, but also appends lots of blank lines in between the parsed links.
while (parse.ParseNext("a", out tag))
{
string value;
//A REGEX value, this one finds proper http address'
Regex regexObj = new Regex(#"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$");
if (tag.Attributes.TryGetValue("href", out value))
{
string value2;
//Start finding matches...
Match matchResults = regexObj.Match(value);
value2 = matchResults.Value;
lstPages.AppendText(value2 + "\r\n");
}
}
To fix this, I added the following code and it works to clean up the list:
if (value2 != "")
{
lstPages.AppendText(value2 + "\r\n");
}
However, I
Don't believe this is the most efficient way to go about this and
Still don't understand where the != "" lines come from.
My actual question is on both of these but more for issue #2, as I would like to learn why I receive these results, but also if there is a more efficient method for this.
The reason you are getting an empty string in value2 is that matchResults.Value == "" if the regular expression fails to match. Instead of checking if value2 != "", you could directly check matchResults.Success to see if the regular expression matched. You're basically doing that, anyway, since your particular regular expression would never match an empty string, but checking matchResults.Success would be more straightforward.
Another thing to consider is that it's not necessary to create the Regex object every iteration of your loop. Here are the modifications I suggest:
//A REGEX value, this one finds proper http address'
Regex regexObj = new Regex(#"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$");
while (parse.ParseNext("a", out tag))
{
string value;
if (tag.Attributes.TryGetValue("href", out value))
{
string value2;
//Start finding matches...
Match matchResult = regexObj.Match(value);
if (matchResult.Success)
{
value2 = matchResult.Value;
lstPages.AppendText(value2 + "\r\n");
}
}
}
Use Html Agility Pack instead
static void Main(string[] args)
{
var html = new HtmlDocument();
var request = WebRequest.Create("http://stackoverflow.com/questions/6256982/parsing-links-and-recieving-extra-blanks/6257328#6257328") as HttpWebRequest;
using (var response = request.GetResponse())
using (var responseStream = response.GetResponseStream())
{
html.Load(responseStream);
}
foreach (var absoluteHref in html.DocumentNode.SelectNodes("//a[starts-with(#href, 'http')]"))
{
Console.WriteLine(absoluteHref.Attributes["href"].Value);
}
}
TryGetValue is a generic Method (of Type T). If it doesnt have any value to return, it returnd the default value of the type, which is String.Empty or "" for String