Compare 2 JSONs - The property count - c#

Compare 2 JSONs - The property count - c# - c#

Assume the recurring code is generating huge JSON based on some nested structure. Suppose there comes a bad data on one bad day, and our new JSON is 30% smaller in size from last good JSON.
My goal is to count all the main properties while deserializing the JSON and compare it from past count statistics and notify the JSON producer that new JSON is bad.
Is there a simple way to do this? Using C# and NewtonSoft.Json for deserialization.
For ex: Suppose JSON encapsulates n-array tree like structure. Each node contains the source field as string (which can be null). Now, in first run 100 nodes were present in JSON and 70% of them contain valid source string. In second run there were 105 nodes with 80% of valid source string (GOOD data). In third run there were 40 nodes with 20% of valid source string. I consider this as "bad" data. I want to fail here without ingesting newer bad data.
Each time we run recurring code, I want to compare it against the last run's count statistics, and want to fail the code if the data is bad as described in example.

What you are describing is business logic and nothing the serializer should care about.
I think the best way to this is to add a validation step between deserialization and processing of the deserialized data.
This means that the validator iterates over the deserialized POCO classes and counts how many nodes have a valid source string. In case they are below a certain threshold then execution should halt and the JSON producer should be notified.
If you want the threshold to be dynamic (i.e. it should depend on the previous count) then you have to store the counted value in a file or database to retrieve it when the next JSON arrives.

Related

How to split the string more efficiently?

I have a JSON string which looks like:
{"Detail": [
{"PrimaryKey":111,"Date":"2016-09-01","Version":"7","Count":2,"Name":"Windows","LastAccessTime":"2016-05-25T21:49:52.36Z"},
{"PrimaryKey":222,"Date":"2016-09-02","Version":"8","Count":2,"Name":"Windows","LastAccessTime":"2016-07-25T21:49:52.36Z"},
{"PrimaryKey":333,"Date":"2016-09-03","Version":"9","Count":3,"Name":"iOS","LastAccessTime":"2016-08-22T21:49:52.36Z"},
.....( *many values )
]}
The array Detail has lots of PrimaryKeys. Sometimes, it is about 500K PrimaryKeys. The system we use can only process JSON strings with certain length, i.e. 128KB. So I have to split this JSON string into segments (each one is 128KB or fewer chars in length).
Regex reg = new Regex(#"\{"".{0," + (128*1024).ToString() + #"}""\}");
MatchCollection mc = reg.Matches(myListString);
Currently, I use regular expression to do this. It works fine. However, it uses too much memory. Is there a better way to do this (unnecessary to be regular expression)?
*** Added more info.
The 'system' I mentioned above is Azure DocumentDB. By default, the document can only be 512KB (as now). Although we can request MS increase this, but the json file we got always much much more than 512KB. That's why we need to figure out a way to do this.
If possible, we want to keep using documentDB, but we are open to other suggestions.
*** Some info to make things clear: 1) the values in the array are different. Not duplicated. 2) Yes, I use StringBuilder whenever I can. 3) Yes, I tried IndexOf & Substring, but based on tests, the performance is not better than regular expression in this case (although it could be the way I implement it).
* **the json object is complex, but all I care is this "Detail" which is an array. We can assume the string is just like the example, only has "Detail". We need to split this json array string into size smaller than 512KB. Basically, we can think this as a simple string, not json. but, it is a json format, so maybe some libraries can do this better.

Take a look at Json.NET (available via NuGet).
It has a JsonReader class, which allows you to create a required object by reading json by token, example of json reading with JsonReader. Not that if you pass invalid json string (e.g. without "end array" character or without "end object" character) to JsonReader - it will throw an exception only when it reaches invalid item, so you can pass different substrings to it.
Also, I guess that your system has something similar to JsonReader, so you can use it.
Reading a string with StringReader should not require too much application memory and it should be faster then iterating through regular expression matches.

Here is a hacky solution assuming data contains your JSON data:
var details = data
.Split('[')[1]
.Split(']')[0]
.Split(new[] { "}," }, StringSplitOptions.None)
.Select(d => d.Trim())
.Select(d => d.EndsWith("}") ? d : d + "}");;
foreach (var detail in details)
{
// Now process "detail" with your JSON library.
}
Working example: https://dotnetfiddle.net/sBQjyi
Obviously you should only do this if you really can't use a normal JSON library. See Mikhail Neofitov's answer for library suggestions.
If you are reading the JSON data from file or network you should implement a more stream-like processing where you read one details line, deserialize it with your JSON library and yield it to the caller. When the caller requests the next detail object, read the next line, deserialize it and so on. This way you can minimize the memory footprint of your deserializer.

You might want to consider storing each detail in a separate document. It means two round trips to get both the header and all of the detail documents, but it means you are never dealing with a really large JSON document. Also, if Detail is added to incrementally, it'll be much more efficient for writes because there is no way to just add another row. You have to rewrite the entire document. Your read/write ratio will determine the break even point in overall efficiency.
Another argument for this is that the complexity of regex parsing, feeding it through your JSON parser, then reassembling it goes away. You never know if your regex parser will deal with all cases (commas inside of quotes, international characters, etc.). I've seen many folks think they have a good regex only to find odd cases in production.
If your Detail array can grow unbounded (or even with a large bound), then you should definitely make this change regardless of your JSON parser limitations or read/write ratio because eventually, you'll exceed the limit.

Serializing objects bigger than 2MiB to Json in Asp.net

We are currently doing Performance tests to determine, if Kendo UI is fast enough for our needs. For that we need to perform tests with a large database (~150 columns and ~100,000 rows).
Table rows should be read by a Kendo UI Grid using ajax calls, which return the data as a json string. With our test data (random strings of 3-10 chars) this works for up to ~700 result rows per request. More, and we hit maxJsonLength, which is already set to Int32.Max-3.
We are not planning on displaying that many rows per page, but there might be binary data attached to the rows. That data could, even with 20 rows, easily go above the 2 MiB restriction implied by having to use an Int32 to set the max size.
So is there any way to serialize objects with a length bigger than 2M?

JSON isn't really designed to transfer large binary data. If you want your UI to be fast and snappy you should try splitting larger objects into smaller ones and also removing binary content from the json.
For example, you can refactor the content of json to only carry a link to the binary resource. If that binary resource is actually needed on screen you can perform a separate request. In fact you can perform requests in parallel: e.g. load json and display the content. Load first N entries with binary data and display it. Don't load the rest as it will slow down your page render time.

We are now using the Json lib from http://www.newtonsoft.com to serialize the objects. It is not bound by the web.config settings and can handle Json requests of unlimited length afaik.

Writing disparate objects to a single file

This is an offshoot of my question found at Pulling objects from binary file and putting in List<T> and the question posed by David Torrey at Serializing and Deserializing Multiple Objects.
I don't know if this is even possible or not. Say that I have an application that uses five different classes to contain the necessary information to do whatever it is that the application does. The classes do not refer to each other and have various numbers of internal variables and the like. I would want to save the collection of objects spawned from these classes into a single save file and then be able to read them back out. Since the objects are created from different classes they can't go into a list to be sent to disc. I originally thought that using something like sizeof(this) to record the size of the object to record in a table that is saved at the beginning of a file and then have a common GetObjectType() that returns an actionable value as the kind of object it is would have worked, but apparently sizeof doesn't work the way I thought it would and so now I'm back at square 1.

Several options: wrap all of your objects in a larger object and serialize that. The problem with that is that you can't deserialize just one object, you have to load all of them. If you have 10 objects that each is several megs, that isn't a good idea. You want random access to any of the objects, but you don't know the offsets in the file. The "wrapper" objcect can be something as simple as a List<object>, but I'd use my own object for better type safety.
Second option: use a fixed size file header/footer. Serialize each object to a MemoryStream, and then dump the memory streams from the individual objects into the file, remembering the number of bytes in each. Finally add a fixed size block at the start or end of the file to record the offsets in the file where the individual objects begin. In the example below the header of the file has first the number of objects in the file (as an integer), then one integer for each object giving the size of each object.
// Pseudocode to serialize two objects into the same file
// First serialize to memory
byte[] bytes1 = object1.Serialize();
byte[] bytes2 = object2.Serialize();
// Write header:
file.Write(2); // Number of objects
file.Write(bytes1.Length); // Size of first object
file.Write(bytes2.Length); // Size of second object
// Write data:
file.Write(bytes1);
file.Write(bytes2);
The offset of objectN, is the size of the header PLUS the sum of all sizes up to N. So in this example, to read the file, you read the header like so (pseudo, again)
numObjs = file.readInt();
for(i=0..numObjs)
size[i] = file.readInt();
After which you can compute and seek to the correct location for object N.
Third option: Use a format that does option #2 for you, for example zip. In .NET you can use System.IO.Packaging to create a structured zip (OPC format), or you can use a third party zip library if you want to roll your own zip format.

Representing a giant matrix/table

I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.
The matrix data will look like this:
Document ID| word1 | word 2 | word 3 |... | word 30000 | Document Class
0032 1 0 0 1 P
In other words, the vast majority of the cells will contain boolean values(0's and 1's).
The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.
What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.
I have thought about representing it in XML, but file sizes may prove problematic.
I may be sitting the pot miss with my approach here -
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.
Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.

You haven't explained the nature of the calculations you're needing to do on the table/matrix, so I'm having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database -- even if you don't have any actual relations in your database. If you can't use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.
Edit:
After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don't see a way around it that isn't amazingly complicated.
Based on your edit, I would say that there are three things you are interested in:
A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
The words themselves. This is primarily contained in the first row of your sample data file.
A means of identifying documents and their classification. This is the first and last column of your data file.
After thinking about it for a little bit, this is how I would model your data:
With the case of word presence, I feel it's best to avoid a complex object model. You're wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:
var wordMatrix = new bool[numDocuments,numWords];
The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix -- the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.
The document identification would similarly be in an array or list of ints index-linked to the first column. I'm assuming the document ids are integer values there. The classification would be a similar array or list, although I'd use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.
I've made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than "per document". If I'm wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.
Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.

Parse a smaller XML multiple times or a larger XML once using LINQ to XML

I am writing a process for an ASP.NET website which will look-up a certain value from an XML file and perform a redirect. I am using LINQ (C#) to parse this XML.
I have hit a decision point where I have to look-up alternate values within the same request. I have two solutions:
Look-up each value separately. This will mean parsing the XML twice. But the XML size will be much smaller
Store multiple values in the XML and parse XML once. This will make the XML larger.
So which approach will have less of a performance overhead considering this for a website with some concurrency?
Simply put should I parse 200 elements once OR 100 elements twice?

Neither. Cache the parsed result until it is changed (if ever).

Why not parse it into memory once and then do lookups in-memory? E.g. read it into a Dictionary<> on your Application object. Put a FilesystemWatcher on the file and re-parse if it changed.

My $0.02:
caching the XElement (and repeatedly query) that is good enough until your profiler tells you otherwise.

Ok, I fetched the results for both scenarios via my Linq query and then looped through the two XElements to determine which one I need to use. I just needed to think a bit more. So, now, I have a much smaller XML and a single look-up.
IEnumerable<XElement> mping = (from mpings in mpingXML.Elements("mping")
where mpings.Element("sptrn").Value.Equals(sourceURL, StringComparison.InvariantCultureIgnoreCase)
&& (mpings.Attribute("lcl").Value.Equals(locale, StringComparison.InvariantCultureIgnoreCase) || mpings.Attribute("lcl").Value.Equals("ALL", StringComparison.InvariantCultureIgnoreCase))
select mpings);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.