JsonPath with JsonTextReader: Token at a Time - c#

I am having an issue with JsonPath working differently when loading token (.Load) at a time using JsonTextReader versus loading the entire JSON using ReadFrom. Here is an example:
JSON: Path="[*].person" Method=SelectTokens(path)
[
{
"person": {
"personid": 123456
}
},
{
"person": {
"personid": 798
}
}
]
When using .ReadFrom, it'll return the proper 2 elements. If I use .Load though, it'll return 0 elements. However, if I change the path to "person", .ReadFrom returns 0 elements while .Load returns 2 elements.
As a fix, I could change the path so that it'll remove up to the first "." i.e. path = substring(path.index(".")+1); however, this feels more of a hack than a proper fix. I would, of course, also need to ensure that the JSON is an array, but in most of my cases, it would be.
So finally, I am trying to learn how to use JSON Path with arrays when loading a token at a time. Any recommendations?
Full Code
Full JSON

What is happening in the code you have linked to is it reads tokens until it encounters an object, it then loads the a JToken from this object, which reads ahead to the end of this object.
So what you end up with is a JToken per item in the root array. You can then for each JToken call:
token.SelectTokens("person").OfType<JObject>()
cause you know the property contains an object.
That is the equivalent of doing "[*].person" JsonPath on the whole parsed JSON.
I hope I have understood your question correctly. If not, please let me know =)
Update:
Based on your comments I understand what you are after. What you could do is create a method like this:
public IEnumerable<JToken> GetTokensByPath(TextReader tr, string path)
{
// do our best to convert the path to a RegEx
var regex = new Regex(path.Replace("[*]", #"\[[0-9]*\]"));
using (var reader = new JsonTextReader(tr))
{
while (reader.Read())
{
if (regex.IsMatch(reader.Path))
yield return JToken.Load(reader);
}
}
}
I am matching the path based on the JSON path input, but we need to try and handle all of the various JSON path grammars, at the moment I'm only support *.
This approach will be useful when you have a massive file, with a deep JSON path selector, you'll keep the stream open longer if you enumerate slowly, but you will have a much lower peak memory usage.
I hope this helps.

Related

JSON Data format to remove escaped characters

Having some trouble with parsing some JSON data, and removing the escaped characters so that I can then assign the values to a List. I've read lots of pages on SO about this very thing, and where people are having success, I am just now. I was wondering if anyone could run their eyes over my method to see what I am doing wrong?
The API I have fetching the JSON data from is from IPStack. It allows me to capture location based data from website visitors.
Here is how I am building up the API path. The two querystrings i've added to the URI are the access key that APIStack give you to use, as well as fields=main which gives you the main location based data (they have a few other blocks of data you can also get).
string api_URI = "http://api.ipstack.com/";
string api_IP = "100.121.126.33";
string api_KEY = "8378273uy12938";
string api_PATH = string.Format("{0}{1}?access_key={2}&fields=main", api_URI, api_IP, api_KEY);
The rest of the code in my method to pull the JSON data in is as follows.
System.Net.WebClient wc = new System.Net.WebClient();
Uri myUri = new Uri(api_PATH, UriKind.Absolute);
var jsonResponse = wc.DownloadString(myUri);
dynamic Data = Json.Decode(jsonResponse);
This gives me a JSON string that looks like this. (I have entered on each key/value to show you the format better). The IP and KEY I have obfuscated from my own details, but it won't matter in this summary anyway.
"{
\"ip\":\"100.121.126.33\",
\"type\":\"ipv4\",
\"continent_code\":\"OC\",
\"continent_name\":\"Oceania\",
\"country_code\":\"AU\",
\"country_name\":\"Australia\"
}"
This is where I believe the issue lies, in that I cannot remove the escaped characters. I have tried to use Regex.Escape(jsonResponse.ToString()); and whilst this does not throw any errors, it actually doesn't remove the \ characters either. It leaves me with the exact same string that went into it.
The rest of my method is to create a List which has one public string (country_name) just for limiting the scope during the test.
List<IPLookup> List = new List<IPLookup>();
foreach (var x in Data)
{
List.Add(new IPLookup()
{
country_name = x.country_name
});
}
The actual error in Visual Studio is thrown when it tries to add country_name to the List, as it complains that it does not contain country_name, and i'm presuming because it still has it's backslash attached to it?
Any help or pointers on where I can look to fix this one up?
Resolved just from the questions posed by Jon and Luke which got me looking at the problem from another angle.
Rather than finish my method in a foreach statement and trying to assign via x.something,,, I simple replaced that block of code with the following.
List<IPLookup> List = new List<IPLookup>();
List.Add(new IPLookup()
{
country_name = Data.country_name,
});
I can now access the key/value pairs from this JSON data without having to try remove the escaped characters that my debugger was showing me to have...

JSON.NET: Getting nested value when key contains dots?

I want to access a nested value with JSON.NET. I know I can use the .SelectToken() method to access a nested value (see for example this question or this question). My issue is that the JSON I'm trying to access has keys with dots in them:
var json = #"
{
""data.dot"": {
""value"": 5,
}
}";
var jo = JObject.Parse(json);
Console.WriteLine(jo.SelectToken("data.dot.value")); // <-- doesn't work
I found the answer while writing this question, so I might as well share my findings.
It turns out that the .SelectToken method is very powerful, and:
allows you to query a JSON with escaped properties by surrounding your key with ['{key}']
allows you to use regex
allows you to filter by path value
allows you to query by complex path
So in my case, I could write:
jo.SelectToken("['data.dot'].value"); // escaped property
jo.SelectToken("$..value"); // complex JSON path
I could also use the JToken indexer, but contrary to the .SelectToken method, it would throw an exception if the JSON doesn't contain the data.dot key:
jo["data.dot"]["value"]

How to create json schema from json object string C#

I am evaluating Json.Net.Schema from NewtonSoft and NJsonSchema from GitHub and I cannot figure out how to create a JSON schema from a JSON object. I want it to work exactly like this site does: http://jsonschema.net/#/
What I am looking for
string json = #"{""Name"": ""Bill"",""Age"": 51,""IsTall"": true}";
var jsonSchemaRepresentation = GetSchemaFromJsonObject(json);
I would expect a valid JSON schema in the jsonSchemaRepresentation variable. Does anyone know how I can accomplish this?
Thanks in advance!
The current version of NJsonSchema supports this feature:
The SampleJsonSchemaGenerator generates a JSON Schema from sample JSON data.
var schema = JsonSchema4.FromSampleJson("...");
var schemaJson = schema.ToJson();
... or create a SampleJsonSchemaGenerator instance and call the Generate("...") method.
Actually both of the libraries you mentioned do not support such a functionality.
If you're down to implement it yourself then you will have to parse your JSON, iterate over it recursively and add a new schema depending on the type of what you've just iterated over.
There are also some other tools (in other languages like python) which could be an inspiration, this might get you started.
The string you are submitting to the function is not in the correct format. Try this (add '{' to the start of the string, '}' to the end):
string json = #"{
""Name"": ""Bill"",
""Age"": 51,
""IsTall"": true
}";
var jsonSchemaRepresentation = GetSchemaFromJsonObject(json);

cleaning JSON for XSS before deserializing

I am using Newtonsoft JSON deserializer. How can one clean JSON for XSS (cross site scripting)? Either cleaning the JSON string before de-serializing or writing some kind of custom converter/sanitizer? If so - I am not 100% sure about the best way to approach this.
Below is an example of JSON that has a dangerous script injected and needs "cleaning." I want a want to manage this before I de-serialize it. But we need to assume all kinds of XSS scenarios, including BASE64 encoded script etc, so the problem is more complex that a simple REGEX string replace.
{ "MyVar" : "hello<script>bad script code</script>world" }
Here is a snapshot of my deserializer ( JSON -> Object ):
public T Deserialize<T>(string json)
{
T obj;
var JSON = cleanJSON(json); //OPTION 1 sanitize here
var customConverter = new JSONSanitizer();// OPTION 2 create a custom converter
obj = JsonConvert.DeserializeObject<T>(json, customConverter);
return obj;
}
JSON is posted from a 3rd party UI interface, so it's fairly exposed, hence the server-side validation. From there, it gets serialized into all kinds of objects and is usually stored in a DB, later to be retrieved and outputted directly in HTML based UI so script injection must be mitigated.
Ok, I am going to try to keep this rather short, because this is a lot of work to write up the whole thing. But, essentially, you need to focus on the context of the data you need to sanitize. From comments on the original post, it sounds like some values in the JSON will be used as HTML that will be rendered, and this HTML comes from an un-trusted source.
The first step is to extract whichever JSON values need to be sanitized as HTML, and for each of those objects you need to run them through an HTML parser and strip away everything that is not in a whitelist. Don't forget that you will also need a whitelist for attributes.
HTML Agility Pack is a good starting place for parsing HTML in C#. How to do this part is a separate question in my opinion - and probably a duplicate of the linked question.
Your worry about base64 strings seems a little over-emphasized in my opinion. It's not like you can simply put aW5zZXJ0IGg0eCBoZXJl into an HTML document and the browser will render it. It can be abused through javascript (which your whitelist will prevent) and, to some extent, through data: urls (but this isn't THAT bad, as javascript will run in the context of the data page. Not good, but you aren't automatically gobbling up cookies with this). If you have to allow a tags, part of the process needs to be validating that the URL is http(s) (or whatever schemes you want to allow).
Ideally, you would avoid this uncomfortable situation, and instead use something like markdown - then you could simply escape the HTML string, but this is not always something we can control. You'd still have to do some URL validation though.
Interesting!! Thanks for asking. we normally use html.urlencode in terms of web forms. I have a enterprise web api running that has validations like this. We have created a custom regex to validate. Please have a look at this MSDN link.
This is the sample model created to parse the request named KeyValue (say)
public class KeyValue
{
public string Key { get; set; }
}
Step 1: Trying with a custom regex
var json = #"[{ 'MyVar' : 'hello<script>bad script code</script>world' }]";
JArray readArray = JArray.Parse(json);
IList<KeyValue> blogPost = readArray.Select(p => new KeyValue { Key = (string)p["MyVar"] }).ToList();
if (!Regex.IsMatch(blogPost.ToString(),
#"^[\p{L}\p{Zs}\p{Lu}\p{Ll}\']{1,40}$"))
Console.WriteLine("InValid");
// ^ means start looking at this position.
// \p{ ..} matches any character in the named character class specified by {..}.
// {L} performs a left-to-right match.
// {Lu} performs a match of uppercase.
// {Ll} performs a match of lowercase.
// {Zs} matches separator and space.
// 'matches apostrophe.
// {1,40} specifies the number of characters: no less than 1 and no more than 40.
// $ means stop looking at this position.
Step 2: Using HttpUtility.UrlEncode - this newtonsoft website link suggests the below implementation.
string json = #"[{ 'MyVar' : 'hello<script>bad script code</script>world' }]";
JArray readArray = JArray.Parse(json);
IList<KeyValue> blogPost = readArray.Select(p => new KeyValue {Key =HttpUtility.UrlEncode((string)p["MyVar"])}).ToList();

Using JSON with corrupt data in C#

I recently had to parse JSON data like
[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]
like this:
var reqData = JsonConvert.DeserializeObject<Dictionary<string, object>>("{" + fileData + "}");
which I used in another project where the data was well formattted. Here, however, the data was somewhat corrupt. For instance "firstName" might appear as ".\"firstName" and so forth. Using JSON like above results in an exception thrown.
I tried various schemes to "purify" the data but as I cannot predict the state of other data, I stopped using JSON and just parsed it myself (with heavy use of substrings and counting to isolate the keys and values). That method works OK but of course using JSON would be much simplier.
Is there a way around this with JSON?
The main Problem is to define corrupt data. If you know that there is never a substring .\" so you can replace it with an empty string and parse it afterwards. That is no problem, but it can be dificult to do something like this if it is more complex.
It is sometimes no problem for an human to read corrupt data withut a valid format - but it is almost impossible for simple algorithms.
By the way, the formatting ".\"firstName" is a valid JSON element because the " is escaped by \. See this question too.

Categories