Reliably fix broken escape sequences in JSON - c#

I'm getting some JSON for an outside source that can't be changed and apparently they don't understand the rules about escaping characters correctly in JSON string values. So they have a string value that might have tabs in it, for example, that should have been escaped and other invalid escape sequences like \$. I'm trying to parse this with JSON.Net but it keeps falling over on these sequences.
For example, the source might look something like this:
{
"someRegularProp": 10,
"aNormalString": "foo bar etc",
"anInvalidString": "foo <tab \$100"
}
and it's parsed with
var obj = JObject.Parse(json);
So I can fix this specific case with something like:
json = json.Replace("\t", "").Replace("\\$", "$"); // note: in this case I'm fine with just stripping the tabs out
But is there a general way to fix these problems to remove invalid escape sequences before parsing? Because I don't know what other invalid sequences they might put in there?

I don't see general way. Obviously they are using bugged library or no library at all to generate this output and unless you explore more, all you can do is try as much output from them as possible to find all problems.
Perhaps make a script to generate as much output as possible and validate all of that, then you can be at least a bit more sure.

Related

Preparing a String to be used in Json

I have a string where I need to use as the body of a JSON object. I know its possible that the data could have quotes in it, so I parse through to add an escape character to those instance of quotes.. like so:
string NewComment = comment.Replace("\"", "\\\"");
However, somehow on some edgecases, a quote still makes it through. I don't know if this is something with UTF or some other issue, But I am trying to find a function that would safely create a json compatible string, I figured there has to be something like this out there, or a regex way of doing so.
Basically a TLDR is how to create a json syntax safe string from a c# string
The simple answer is don't do it this way. What if you have escaped quotes in your string? "Hello \"World\"" would become invalid with such a simple approach: "Hello \\"World\\"". JSON.Net or Newtonsoft are going to save you so many headaches in the long run.

Weird behavior C#

Somehow I'm getting a weird result from a GetString(). So, in my project I got this code:
byte[] arrayBytes = System.Convert.FromBase64String(n["spo_fdat"].InnerText);
string str = System.Text.Encoding.UTF8.GetString(arrayBytes);
The InnerText Value and the code is in: https://dotnetfiddle.net/mMUlti
So, my problem is that somehow I'm getting this result on my Visual Studio:
While in the online compiler that I post above the output is as expected.
This output is an output for a printer and this \0 are destroying the format.
Anyone have a clue of what is going on and what should I do/try?
It looks like for some reason every other byte in your input is null. If you strip those out you get something that looks much more plausible as printer commands (though I am no expert). Hopefully you can verify things...
To do this all I did was added this line in:
arrayBytes = arrayBytes.Where((x,i)=>i%2==0).ToArray();
The where command takes the value (x), and index (i) and if the index mode 2 is 0 (ie its even) then the where clause allows it - if its odd it throws it away.
The output I get from this starts:
CT~~CD,~CC^~CT~
^XA~TA000~JSN^LT0^MNW^MTT^PON^PMN^LH0,0^JMA^PR2,2~SD15^JUS^LRN^CI0^XZ
^XA
^MMT
^PW607
^LL0406
There are some non-printing character in there too that look like possible printing commands (eg 16 is the first character that is "data link escape" character.
Edited afterthought:
The problem you have here is obviously a problem with the specification. It seems to be that your input is wrong. You need to talk to whoever generated it find out the specification they are using to generate it, make sure their ode matches that spec and then right your code to accept that spec. With a solid specification you should both be writing compatible code.
Try inspecting the bytes instead. You'll see that what you have encoded in the base-64 string is much closer to what Visual Studio shows to you in comparison to the output from dotnetfiddle. Consoles usually don't escape non-printables (such as \0 - the null character) whereas Visual Studio string inspector does so in attempt to provide as much value to its user as possible.
Looking at your base-64 encoded data, it looks way more like UTF-16 than UTF-8. If you decode it like so, you'll perhaps get rid of the null characters in Visual Studio inspector as well.
Regardless of that, the base-64 data don't make much sense. More semantical context is required to figure out what the issue is.
According to inspection by Chris, it looks like the data is UTF-8 encoded in UTF-16.
You should be able to get proper results with the following:
var xml = //your base-64 input...
var arrayBytes = Convert.FromBase64String(xml);
var utf16 = Encoding.Unicode.GetString(arrayBytes);
var utf8Bytes = utf16.Select(c => (byte)c).ToArray();
var utf8 = Encoding.UTF8.GetString(utf8Bytes);
Console.WriteLine(utf8);
The opposite is probably how your input was created. However, you could also go for Chris' solution of ignoring every odd byte as it is basically the same with less weird encoding things going on (although this may be more explicit to what really goes on: UTF-8 inside UTF-16).

How to escape special characters while doing XML Serialization

One of my element in an xml has a value like
<item name="abc_def>" />
The actual value pulled from the data source is "abc_def!!>". I have no control over this data source and this cannot be changed.
I wanted to know how do I escape these characters when xml serialization is taking place. I have tried a couple of things, but they didnt work.
I tried all methods explained here
What is the correct way to escape these characters ? The end output is an api which our clients hit using their browsers and because of this issue, the xml parsing in browser is breaking.
If all of your strings look like that one, you can do something like this:
string input = "abc_def>";
input = input.Replace("", "!!");
string output = HttpUtility.HtmlDecode(input);
You need to use:
System.Net.WebUtility.HtmlEncode(stringToEncode);
Of course when you later decode that you use:
System.Net.WebUtility.HtmlDecode(stringToDecode);
This is for UWP, namespace may vary depending on what framework you use.

String.Format not taking 4th object

Here is my prob, I wanted String.Format() function should take 4 objects and format string. But it throws "Input string not in a correct format error".
Here is my code,
string jsonData = string.Format("{{\"sectionTitle\":\"{0}\",\"strPushMsg\":\"{1}\",\"Language\":\"{2}\",}\",\"articleid\":\"{3}\"}}", urlsectiontitle, formatHeadline, Language, articleid);
\"{2}\",}\"
Looks like you need to escape that closing brace by doubling it:
string.Format("{{\"sectionTitle\":\"{0}\",\"strPushMsg\":\"{1}\",\"Language\":\"{2}\",}}\",\"articleid\":\"{3}\"}}", urlsectiontitle, formatHeadline, Language, articleid);
It appears you are creating JSON. This can use single quotes (which would avoid all the escaping), but even better use a tool like JSON.Net designed to create JSON. While your (partial) structure here is quite small (the unmatched } shows this is only partial), and the JSON gets bigger it is much easier to use a tool to get it right.

How automatic escape quotes in json (C#)

From server I get json. Json is very big. I show litle piece of this
{
"id": "9429531978965160",
"name": "Morning in "Paris"", // json.net cannot deserialize this line, because line have no escaped quotes.
"alias": "ThisAlias"
}
The problem is the server side that generates invalid JSON.
You could try writing a regex that fixes this (searches for any quotes in between the third and last). Just note that there might be many other issues with the JSON, like newlines that are not escaped etc.
It's not just that the output you are receiving is non-standard json, it's broken in such a way that it's not a well-defined language and doesn't parse unambiguously even in the simple cases. How should you parse {"a": "A", "b": "B"}? One way is as legal json. Another valid parse is a single property a with the value "A\", \"b\": \"B".
As others have said, the best resolution is to fix the server so that it no longer outputs invalid garbage. If that's not an option, you'll have to write your own parser. A normal parser would declare an syntax error at the 'P' in "Paris". Your parser could back up to the last quote token and try to treat it as if it were escaped. The next syntax error is at the second of the consecutive quotes, and again it could back up and treat the quote token as if it were escaped. If there are any other ways in which the input deviates from legal json you'll need to handle those as well.
If you're not familiar with parsers, this will take a while. And when you're done you'll have a parser that recognizes a poorly-specified and almost totally useless language, which is to say that it will largely be a waste of time. Do what you can to fix it on the server side.

Categories