We are currently doing Performance tests to determine, if Kendo UI is fast enough for our needs. For that we need to perform tests with a large database (~150 columns and ~100,000 rows).
Table rows should be read by a Kendo UI Grid using ajax calls, which return the data as a json string. With our test data (random strings of 3-10 chars) this works for up to ~700 result rows per request. More, and we hit maxJsonLength, which is already set to Int32.Max-3.
We are not planning on displaying that many rows per page, but there might be binary data attached to the rows. That data could, even with 20 rows, easily go above the 2 MiB restriction implied by having to use an Int32 to set the max size.
So is there any way to serialize objects with a length bigger than 2M?
JSON isn't really designed to transfer large binary data. If you want your UI to be fast and snappy you should try splitting larger objects into smaller ones and also removing binary content from the json.
For example, you can refactor the content of json to only carry a link to the binary resource. If that binary resource is actually needed on screen you can perform a separate request. In fact you can perform requests in parallel: e.g. load json and display the content. Load first N entries with binary data and display it. Don't load the rest as it will slow down your page render time.
We are now using the Json lib from http://www.newtonsoft.com to serialize the objects. It is not bound by the web.config settings and can handle Json requests of unlimited length afaik.
Related
Assume the recurring code is generating huge JSON based on some nested structure. Suppose there comes a bad data on one bad day, and our new JSON is 30% smaller in size from last good JSON.
My goal is to count all the main properties while deserializing the JSON and compare it from past count statistics and notify the JSON producer that new JSON is bad.
Is there a simple way to do this? Using C# and NewtonSoft.Json for deserialization.
For ex: Suppose JSON encapsulates n-array tree like structure. Each node contains the source field as string (which can be null). Now, in first run 100 nodes were present in JSON and 70% of them contain valid source string. In second run there were 105 nodes with 80% of valid source string (GOOD data). In third run there were 40 nodes with 20% of valid source string. I consider this as "bad" data. I want to fail here without ingesting newer bad data.
Each time we run recurring code, I want to compare it against the last run's count statistics, and want to fail the code if the data is bad as described in example.
What you are describing is business logic and nothing the serializer should care about.
I think the best way to this is to add a validation step between deserialization and processing of the deserialized data.
This means that the validator iterates over the deserialized POCO classes and counts how many nodes have a valid source string. In case they are below a certain threshold then execution should halt and the JSON producer should be notified.
If you want the threshold to be dynamic (i.e. it should depend on the previous count) then you have to store the counted value in a file or database to retrieve it when the next JSON arrives.
This is an offshoot of my question found at Pulling objects from binary file and putting in List<T> and the question posed by David Torrey at Serializing and Deserializing Multiple Objects.
I don't know if this is even possible or not. Say that I have an application that uses five different classes to contain the necessary information to do whatever it is that the application does. The classes do not refer to each other and have various numbers of internal variables and the like. I would want to save the collection of objects spawned from these classes into a single save file and then be able to read them back out. Since the objects are created from different classes they can't go into a list to be sent to disc. I originally thought that using something like sizeof(this) to record the size of the object to record in a table that is saved at the beginning of a file and then have a common GetObjectType() that returns an actionable value as the kind of object it is would have worked, but apparently sizeof doesn't work the way I thought it would and so now I'm back at square 1.
Several options: wrap all of your objects in a larger object and serialize that. The problem with that is that you can't deserialize just one object, you have to load all of them. If you have 10 objects that each is several megs, that isn't a good idea. You want random access to any of the objects, but you don't know the offsets in the file. The "wrapper" objcect can be something as simple as a List<object>, but I'd use my own object for better type safety.
Second option: use a fixed size file header/footer. Serialize each object to a MemoryStream, and then dump the memory streams from the individual objects into the file, remembering the number of bytes in each. Finally add a fixed size block at the start or end of the file to record the offsets in the file where the individual objects begin. In the example below the header of the file has first the number of objects in the file (as an integer), then one integer for each object giving the size of each object.
// Pseudocode to serialize two objects into the same file
// First serialize to memory
byte[] bytes1 = object1.Serialize();
byte[] bytes2 = object2.Serialize();
// Write header:
file.Write(2); // Number of objects
file.Write(bytes1.Length); // Size of first object
file.Write(bytes2.Length); // Size of second object
// Write data:
file.Write(bytes1);
file.Write(bytes2);
The offset of objectN, is the size of the header PLUS the sum of all sizes up to N. So in this example, to read the file, you read the header like so (pseudo, again)
numObjs = file.readInt();
for(i=0..numObjs)
size[i] = file.readInt();
After which you can compute and seek to the correct location for object N.
Third option: Use a format that does option #2 for you, for example zip. In .NET you can use System.IO.Packaging to create a structured zip (OPC format), or you can use a third party zip library if you want to roll your own zip format.
I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.
The matrix data will look like this:
Document ID| word1 | word 2 | word 3 |... | word 30000 | Document Class
0032 1 0 0 1 P
In other words, the vast majority of the cells will contain boolean values(0's and 1's).
The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.
What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.
I have thought about representing it in XML, but file sizes may prove problematic.
I may be sitting the pot miss with my approach here -
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.
Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.
You haven't explained the nature of the calculations you're needing to do on the table/matrix, so I'm having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database -- even if you don't have any actual relations in your database. If you can't use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.
Edit:
After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don't see a way around it that isn't amazingly complicated.
Based on your edit, I would say that there are three things you are interested in:
A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
The words themselves. This is primarily contained in the first row of your sample data file.
A means of identifying documents and their classification. This is the first and last column of your data file.
After thinking about it for a little bit, this is how I would model your data:
With the case of word presence, I feel it's best to avoid a complex object model. You're wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:
var wordMatrix = new bool[numDocuments,numWords];
The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix -- the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.
The document identification would similarly be in an array or list of ints index-linked to the first column. I'm assuming the document ids are integer values there. The classification would be a similar array or list, although I'd use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.
I've made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than "per document". If I'm wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.
Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.
In my website's advanced search screen there are about 15 fields that need an autocomplete field.
Their content is all depending on each other's value (so if one is filled in, the other's content will change depending on the first's value).
Most of the fields have a huge amount of possibilities (1000's of entries at least).
Currently make an ajax call if the user stops typing for half a second. This ajax call makes a quick call to my Lucene index and returns a bunch of JSon objects. The method itself is really fast, but it's the connection and transferring of data that is too slow.
If I look at other sites (say facebook), their autocomplete is instant. I figure they put the possible values in their HTML, so they don't have to do a round trip. But I fear with the amounts of data I'm handling, this is not an option.
Any ideas?
Return only top x results.
Get some trends about what users are picking,
and order based on that, preferably
automatically.
Cache results for every URL & keystroke combination,
so that you don't have to round-trip
if you've already fetched the result
before.
Share this cache with all
autocompletes that use the same URL
& keystroke combination.
Of course,
enable gzip compression for the
JSON, and ensure you're setting your
cache headers to cache for some
time. The time depends on your rate
of change of autocomplete response.
Optimize the JSON to send down the
bare minimum. Don't send down
anything you don't need.
Are you returning ALL results for the possibilities or just the top 10 as json objects.
I notice a lot of people send large numbers of results back to the screen, but then only show the first few. By sending back small numbers of results, you can reduce the data transfer.
Return the top "X" results, rather than the whole list, to cut back on the number of options? You might also want to try and put in some trending to track what users pick from the list so you can try and make the top "X" the most used/most relvant. You could always return your most relevant list first, then return the full list if they are still struggling.
In addition to limiting the set of results to a top X set consider enabling caching on the responses of the AJAX requests (which means using GET and keeping the URL simple).
Its amazing how often users will backspace then end up retyping exactly the same content. Also by allowing public and server-side caching your could speed up the overall round-trup time.
Cache the results in System.Web.Cache
Use a Lucene cache
Use GET not POST as IE caches this
Only grab a subset of results (10 as people suggest)
Try a decent 3rd party autocomplete widget like the YUI one
Returning the top-N entries is a good approach. But if you want/have to return all the data, I would try and limit the data being sent and the JSON object itself.
For instance:
"This Here Company With a Long Name" becomes "This Here Company..." (you put the dots in the name client side--again; transfer a minimum of data).
And as far as the JSON object goes:
{n: "This Here Company", v: "1"}
... Where "n" would be the name and "v" would be the value.
I need to hold a representation of a document in memory, and am looking for the most efficient way to do this.
Assumptions
The documents can be pretty large, up
to 100MB.
More often than not the document
will remain unchanged - (i.e. I don't
want to do unnecessary up front
processing).
Changes will typically be quite close
to each other in the document (i.e. as
the user types).
It should be possible to apply changes fast (without copying the whole document)
Changes will be applied in terms of
offsets and new/deleted text (not as
line/col).
To work in C#
Current considerations
Storing the data as a string. Easy to
code, fast to set, very slow to
update.
Array of Lines, moderatly easy to code, slower to set (as we have to parse the string into lines), faster to update (as we can insert remove lines easily, but finding offsets requires summing line lengths).
There must be a load of standard algorithms for this kind of thing (it's not a million miles of disk allocation and fragmentation).
Thanks for your thoughts.
I would suggest to break the file into blocks. All blocks have the same length when you load them, but the length of each block might change if the user edits this blocks. This avoids moving 100 megabyte of data if the user inserts one byte in the front.
To manage the blocks, just but them - together with the offset of each block - into a list. If the user modifies a blocks length you must only update the offsets of the blocks after this one. To find an offset, you can use binary search.
File size: 100 MiB
Block Size: 16 kiB
Blocks: 6400
Finding a offset using binary search (worst case): 13 steps
Modifying a block (worst case): copy 16384 byte data and update 6400 block offsets
Modifying a block (average case): copy 8192 byte data and update 3200 block offsets
16 kiB block size is just a random example - you can balance the costs of the operations by choosing the block size, maybe based on the file size and the probability of operations. Doing some simple math will yield the optimal block size.
Loading will be quite fast, because you load fixed sized blocks, and saving should perform well, too, because you will have to write a few thousand blocks and not millions of single lines. You can optimize loading by loading blocks only on demand and you can optimize saving by only saving all blocks that changed (content or offset).
Finally the implementation will not be to hard, too. You could just use the StringBuilder class to represent a block. But this solution will not work well for very long lines with lengths comparable to the block size or larger because you will have to load many blocks and display only a small parts with the rest being to the left or right of the window. I assume you will have to use a two dimensional partitioning model in this case.
Good Math, Bad Math wrote an excellent article about ropes and gap buffers a while ago that details the standard methods for representing text files in a text editor, and even compares them for simplicity of implementation and performance. In a nutshell: a gap buffer - a large character array with an empty section immediately after the current position of the cursor - is your simplest and best bet.
You might find this paper useful --- Data Structures for Text Sequences which describes and experimentally analyses a few standard algorithms, and compares [among other things] gap buffers and piece tables.
FWIW, it concludes piece tables are slightly better overall; though net.wisdom seems to prefer gap buffers.
I would suggest you to take a look at Memory Mapped Files (MMF).
Some pointers:
Memory Mapped Files .NET
http://msdn.microsoft.com/en-us/library/ms810613.aspx
I'd use a b-tree or skip list of lines, or larger blocks if you aren't going to edit much.
You don't have much extra cost determine line ends on load, since you have to visit each character on loading anyway.
You can move lines within a node without much effort.
The total length of the text in each node is stored in the node, and changes propagated up to parent nodes.
Each line is represented by a data array, and start index, length and capacity. Line break/carriage returns aren't put in the data array. Common operations such as breaking lines only requires changes to the references into the array; editing lines requires a copy if capacity is exceeded. A similar structure might be used per line temporarily when editing that line, so you don't perform a copy on each key-press.
Off the top of my head, I would have thought an indexed linked list would be fairly efficient for this sort of thing unless you have some very long lines.
The linked list would give you an efficient way to store the data and add or remove lines as the user edits. The indexing allows you to quickly jump to a particular point in your file. This sort of idea lends itself well to undo/redo type operations too as it should be reasonably easy to sort edits into small atomic operations.
I'd agree with crisb's point though, it's probably better to get something simple working first and then see if it really is slow..
From your description it sounds a lot like your document is unformatted text only - so a stringbuilder would do fine.
If its a formatted document, I would be inclined to use the MS Word APIs or similar and just offload your document processing to them - will save you an awful lot of time as document parsing can often be a pain in the a** :-)
I wouldn't get too worried about the performance yet - it sounds a lot like you haven't implemented one yet, so you also don't know what performance characteristics the rest of your app has - it may be that you can't actually afford to hold multiple documents in memory at all when you actually get round to profiling it.