cleaning JSON for XSS before deserializing

cleaning JSON for XSS before deserializing - c#

I am using Newtonsoft JSON deserializer. How can one clean JSON for XSS (cross site scripting)? Either cleaning the JSON string before de-serializing or writing some kind of custom converter/sanitizer? If so - I am not 100% sure about the best way to approach this.
Below is an example of JSON that has a dangerous script injected and needs "cleaning." I want a want to manage this before I de-serialize it. But we need to assume all kinds of XSS scenarios, including BASE64 encoded script etc, so the problem is more complex that a simple REGEX string replace.
{ "MyVar" : "hello<script>bad script code</script>world" }
Here is a snapshot of my deserializer ( JSON -> Object ):
public T Deserialize<T>(string json)
{
T obj;
var JSON = cleanJSON(json); //OPTION 1 sanitize here
var customConverter = new JSONSanitizer();// OPTION 2 create a custom converter
obj = JsonConvert.DeserializeObject<T>(json, customConverter);
return obj;
}
JSON is posted from a 3rd party UI interface, so it's fairly exposed, hence the server-side validation. From there, it gets serialized into all kinds of objects and is usually stored in a DB, later to be retrieved and outputted directly in HTML based UI so script injection must be mitigated.

Ok, I am going to try to keep this rather short, because this is a lot of work to write up the whole thing. But, essentially, you need to focus on the context of the data you need to sanitize. From comments on the original post, it sounds like some values in the JSON will be used as HTML that will be rendered, and this HTML comes from an un-trusted source.
The first step is to extract whichever JSON values need to be sanitized as HTML, and for each of those objects you need to run them through an HTML parser and strip away everything that is not in a whitelist. Don't forget that you will also need a whitelist for attributes.
HTML Agility Pack is a good starting place for parsing HTML in C#. How to do this part is a separate question in my opinion - and probably a duplicate of the linked question.
Your worry about base64 strings seems a little over-emphasized in my opinion. It's not like you can simply put aW5zZXJ0IGg0eCBoZXJl into an HTML document and the browser will render it. It can be abused through javascript (which your whitelist will prevent) and, to some extent, through data: urls (but this isn't THAT bad, as javascript will run in the context of the data page. Not good, but you aren't automatically gobbling up cookies with this). If you have to allow a tags, part of the process needs to be validating that the URL is http(s) (or whatever schemes you want to allow).
Ideally, you would avoid this uncomfortable situation, and instead use something like markdown - then you could simply escape the HTML string, but this is not always something we can control. You'd still have to do some URL validation though.

Interesting!! Thanks for asking. we normally use html.urlencode in terms of web forms. I have a enterprise web api running that has validations like this. We have created a custom regex to validate. Please have a look at this MSDN link.
This is the sample model created to parse the request named KeyValue (say)
public class KeyValue
{
public string Key { get; set; }
}
Step 1: Trying with a custom regex
var json = #"[{ 'MyVar' : 'hello<script>bad script code</script>world' }]";
JArray readArray = JArray.Parse(json);
IList<KeyValue> blogPost = readArray.Select(p => new KeyValue { Key = (string)p["MyVar"] }).ToList();
if (!Regex.IsMatch(blogPost.ToString(),
#"^[\p{L}\p{Zs}\p{Lu}\p{Ll}\']{1,40}$"))
Console.WriteLine("InValid");
// ^ means start looking at this position.
// \p{ ..} matches any character in the named character class specified by {..}.
// {L} performs a left-to-right match.
// {Lu} performs a match of uppercase.
// {Ll} performs a match of lowercase.
// {Zs} matches separator and space.
// 'matches apostrophe.
// {1,40} specifies the number of characters: no less than 1 and no more than 40.
// $ means stop looking at this position.
Step 2: Using HttpUtility.UrlEncode - this newtonsoft website link suggests the below implementation.
string json = #"[{ 'MyVar' : 'hello<script>bad script code</script>world' }]";
JArray readArray = JArray.Parse(json);
IList<KeyValue> blogPost = readArray.Select(p => new KeyValue {Key =HttpUtility.UrlEncode((string)p["MyVar"])}).ToList();

Related

JSON Data format to remove escaped characters

Having some trouble with parsing some JSON data, and removing the escaped characters so that I can then assign the values to a List. I've read lots of pages on SO about this very thing, and where people are having success, I am just now. I was wondering if anyone could run their eyes over my method to see what I am doing wrong?
The API I have fetching the JSON data from is from IPStack. It allows me to capture location based data from website visitors.
Here is how I am building up the API path. The two querystrings i've added to the URI are the access key that APIStack give you to use, as well as fields=main which gives you the main location based data (they have a few other blocks of data you can also get).
string api_URI = "http://api.ipstack.com/";
string api_IP = "100.121.126.33";
string api_KEY = "8378273uy12938";
string api_PATH = string.Format("{0}{1}?access_key={2}&fields=main", api_URI, api_IP, api_KEY);
The rest of the code in my method to pull the JSON data in is as follows.
System.Net.WebClient wc = new System.Net.WebClient();
Uri myUri = new Uri(api_PATH, UriKind.Absolute);
var jsonResponse = wc.DownloadString(myUri);
dynamic Data = Json.Decode(jsonResponse);
This gives me a JSON string that looks like this. (I have entered on each key/value to show you the format better). The IP and KEY I have obfuscated from my own details, but it won't matter in this summary anyway.
"{
\"ip\":\"100.121.126.33\",
\"type\":\"ipv4\",
\"continent_code\":\"OC\",
\"continent_name\":\"Oceania\",
\"country_code\":\"AU\",
\"country_name\":\"Australia\"
}"
This is where I believe the issue lies, in that I cannot remove the escaped characters. I have tried to use Regex.Escape(jsonResponse.ToString()); and whilst this does not throw any errors, it actually doesn't remove the \ characters either. It leaves me with the exact same string that went into it.
The rest of my method is to create a List which has one public string (country_name) just for limiting the scope during the test.
List<IPLookup> List = new List<IPLookup>();
foreach (var x in Data)
{
List.Add(new IPLookup()
{
country_name = x.country_name
});
}
The actual error in Visual Studio is thrown when it tries to add country_name to the List, as it complains that it does not contain country_name, and i'm presuming because it still has it's backslash attached to it?
Any help or pointers on where I can look to fix this one up?

Resolved just from the questions posed by Jon and Luke which got me looking at the problem from another angle.
Rather than finish my method in a foreach statement and trying to assign via x.something,,, I simple replaced that block of code with the following.
List<IPLookup> List = new List<IPLookup>();
List.Add(new IPLookup()
{
country_name = Data.country_name,
});
I can now access the key/value pairs from this JSON data without having to try remove the escaped characters that my debugger was showing me to have...

C# Parse HTML Post Data

I have MemoryStream data (HTML POST Data) which i need to parse it.
Converting it to string give result like below
key1=value+1&key2=val++2
Now the problem is that all this + are space in html. Am not sure why space is converting to +
This is how i am converting MemoryStream to string
Encoding.UTF8.GetString(request.PostData.ToArray())

If you are using Content-Type of application/x-www-form-urlencoded, your data needs to be url encoded.
Use System.Web.HttpUtility.UrlEncode():
using System.Web;
var data = HttpUtility.UrlEncode(request.PostData);
See more in MSDN.
You can also use JSON format for POST.

I suppose that the data you are retrieving are encoded with URL rules.
You can discover why data are encoded to this format reading this simple article from W3c school.
To encode/decode your post string you may use this couple of methods:
System.Web.HttpUtility.UrlEncode(yourString); // Encode
System.Web.HttpUtility.UrlDecode(yourString); // Decode
You can find more informations about URL manipulation functions here.
Note: If you need to encode/decode an array of string you need to enumerate your collection with a for or foreach statement. Remember that with this kind of cycles you cannot directly change the cycle variable value during the enumeration (so probably you need a temporary storage variable).
At least, to efficiently parse strings, I suggest you to use the System.Text.RegularExpression.Regex class and learn the regex "language".
You can find some example on how to use Regex here; Regex101 site has also a C# code generator that shows you how to translate your regex into code.

Deserializing ServiceBus content in Azure Logic App

I'm trying to read the content body of a message in an Azure Logic App, but I'm not having much success. I have seen a lot of suggestions which say that the body is base64 encoded, and suggest using the following to decode:
#{json(base64ToString(triggerBody()?['ContentData']))}
The base64ToString(...) part is decoding the content into a string correctly, but the string appears to contain a prefix with some extra serialization information at the start:
#string3http://schemas.microsoft.com/2003/10/Serialization/�3{"Foo":"Bar"}
There are also some extra characters in that string that are not being displayed in my browser. So the json(...) function doesn't accept the input, and gives an error instead.
InvalidTemplate. Unable to process template language expressions in
action 'HTTP' inputs at line '1' and column '2451': 'The template
language function 'json' parameter is not valid. The provided value
#string3http://schemas.microsoft.com/2003/10/Serialization/�3{"Foo":"bar" }
cannot be parsed: Unexpected character encountered while parsing value: #. Path '', line 0, position 0.. Please see https://aka.ms/logicexpressions#json for usage details.'.
For reference, the messages are added to the topic using the .NET service bus client (the client shouldn't matter, but this looks rather C#-ish):
await TopicClient.SendAsync(new BrokeredMessage(JsonConvert.SerializeObject(item)));
How can I read this correctly as a JSON object in my Logic App?

This is caused by how the message is placed on the ServiceBus, specifically in the C# code. I was using the following code to add a new message:
var json = JsonConvert.SerializeObject(item);
var message = new BrokeredMessage(json);
await TopicClient.SendAsync(message);
This code looks fine, and works between different C# services no problem. The problem is caused by the way the BrokeredMessage(Object) constructor serializes the payload given to it:
Initializes a new instance of the BrokeredMessage class from a given object by using DataContractSerializer with a binary XmlDictionaryWriter.
That means the content is serialized as binary XML, which explains the prefix and the unrecognizable characters. This is hidden by the C# implementation when deserializing, and it returns the object you were expecting, but it becomes apparent when using a different library (such as the one used by Azure Logic Apps).
There are two alternatives to handle this problem:
Make sure the receiver can handle messages in binary XML format
Make sure the sender actually uses the format we want, e.g. JSON.
Paco de la Cruz's answer handles the first case, using substring, indexOf and lastIndexOf:
#json(substring(base64ToString(triggerBody()?['ContentData']), indexof(base64ToString(triggerBody()?['ContentData']), '{'), add(1, sub(lastindexof(base64ToString(triggerBody()?['ContentData']), '}'), indexof(base64ToString(triggerBody()?['ContentData']), '}')))))
As for the second case, fixing the problem at the source simply involves using the BrokeredMessage(Stream) constructor instead. That way, we have direct control over the content:
var json = JsonConvert.SerializeObject(item);
var bytes = Encoding.UTF8.GetBytes(json);
var stream = new MemoryStream(bytes);
var message = new BrokeredMessage(stream, true);
await TopicClient.SendAsync(message);

You can use the substring function together with indexOf and lastIndexOf to get only the JSON substring.
Unfortunately, it's rather complex, but it should look something like this:
#json(substring(base64ToString(triggerBody()?['ContentData']), indexof(base64ToString(triggerBody()?['ContentData']), '{'), add(1, sub(lastindexof(base64ToString(triggerBody()?['ContentData']), '}'), indexof(base64ToString(triggerBody()?['ContentData']), '}')))))
More info on how to use these functions here.
HTH

Paco de la Cruz solution worked for me, though I had to swap out the last '}' in the expression for a '{', otherwise it finds the wrong end of the data segment.
I also split it into two steps to make it a little more manageable.
First I get the decoded string out of the message into a variable (that I've called MC) using:
#{base64ToString(triggerBody()?['ContentData'])}
then in another logic app action do the substring extraction:
#{substring(variables('MC'),indexof(variables('MC'),'{'),add(1,sub(lastindexof(variables('MC'),'}'),indexof(variables('MC'),'{'))))}
Note that the last string literal '{' is reversed from Paco's solution.
This is working for my test cases, but I'm not sure how robust this is.
Also, I've left it as a String, I do the conversion to JSON later in my logic app.
UPDATE
We have found that just occasionally (2 in several hundred runs) the text that we want to discard can contain the '{' character.
I have modified our expression to explicitly locate the start of the data segment, which for me is:
'{"IntegrationRequest"'
so the substitution becomes:
#{substring(variables('MC'),indexof(variables('MC'),'{"IntegrationRequest"'),add(1,sub(lastindexof(variables('MC'),'}'),indexof(variables('MC'),'{"IntegrationRequest"'))))}

String.Format not taking 4th object

Here is my prob, I wanted String.Format() function should take 4 objects and format string. But it throws "Input string not in a correct format error".
Here is my code,
string jsonData = string.Format("{{\"sectionTitle\":\"{0}\",\"strPushMsg\":\"{1}\",\"Language\":\"{2}\",}\",\"articleid\":\"{3}\"}}", urlsectiontitle, formatHeadline, Language, articleid);

\"{2}\",}\"
Looks like you need to escape that closing brace by doubling it:
string.Format("{{\"sectionTitle\":\"{0}\",\"strPushMsg\":\"{1}\",\"Language\":\"{2}\",}}\",\"articleid\":\"{3}\"}}", urlsectiontitle, formatHeadline, Language, articleid);
It appears you are creating JSON. This can use single quotes (which would avoid all the escaping), but even better use a tool like JSON.Net designed to create JSON. While your (partial) structure here is quite small (the unmatched } shows this is only partial), and the JSON gets bigger it is much easier to use a tool to get it right.

How to deserialize in C# an unknown JSON string to some Object

I need to parse in C# (key ,value wise) a string that is built in a JSON format (to be exact I need to parse the binding parameter of Knockout data-bind).
I go over the html file and I extract the bindings. I want to modify each and every binding (string-wise), but It's really hard for me to parse the string, since I can't really know where each binding stops and the other starts.
for example:
data-bind="text:'ggggg',event:{mouseover:x=function(){alert(1);return 'd,y'}}"
will result in the following string:
"text:'ggggg',event:{mouseover:x=function(){alert(1);return 'd,y'}}"
I want to modify the string in the following way:
newString= "text('gggg'),event(mouseover(x=function(){alert(1);return 'd,y'}))"
I figured out that the best way to do it is to deserialize the string by JSON and then it will be easier for me to get access to each and every binding element.
I write at C#, but since I go over the html file and each data-bind is different and can contain different amount and type of attributes I would like to have a general object that I can deserialize to.
I checked out DataContractJsonSerializer but I don't see how it solves my problem.
Can you please suggest me what's best for my case?
Mary

You can do it with something like this:
var obj = ko.bindingProvider.instance.getBindings(yourDomElement,
ko.contextFor(yourDomElement));
alert(JSON.stringify(obj));
And then do whatever you want with obj.
Fiddle
But... well... don't!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.