I have an array of Object Ids which I need to retrieve from Parse. The size of the array varies greatly, and sometimes there are duplicates. Up until now, I've been prototyping, so I would use
string[] objectIds = new [] { "xT6...
...WhereContainedIn("objectId", objectIds);
And this would work okay. In real life, though, the size of the objectId array above can reach in the hundreds, and the query returns "operation was slow and timed out". I really have two questions here:
1) There has to be a better way to retrieve an array of objects, if you know the object Ids, but I couldn't find it. Is WhereContainedIn() the only solution here?
2) Are there any guidelines for how/when queries will simply fail? The documentation only mentions a limit of 1000 items to be retrieved, and nothing about the query going in. If it turns out that this query has to be batched, that would be okay, but there are no guidelines for batching, either.
So I have never used (or even heard of parse but reading throught the documentation I found this text about the limit maybe it would help.
"You can limit the number of results by calling Limit. By default, results are limited to 100, but anything from 1 to 1000 is a valid limit:"
https://www.parse.com/docs/dotnet_guide#queries-constraints
Related
I have a JSON string which looks like:
{"Detail": [
{"PrimaryKey":111,"Date":"2016-09-01","Version":"7","Count":2,"Name":"Windows","LastAccessTime":"2016-05-25T21:49:52.36Z"},
{"PrimaryKey":222,"Date":"2016-09-02","Version":"8","Count":2,"Name":"Windows","LastAccessTime":"2016-07-25T21:49:52.36Z"},
{"PrimaryKey":333,"Date":"2016-09-03","Version":"9","Count":3,"Name":"iOS","LastAccessTime":"2016-08-22T21:49:52.36Z"},
.....( *many values )
]}
The array Detail has lots of PrimaryKeys. Sometimes, it is about 500K PrimaryKeys. The system we use can only process JSON strings with certain length, i.e. 128KB. So I have to split this JSON string into segments (each one is 128KB or fewer chars in length).
Regex reg = new Regex(#"\{"".{0," + (128*1024).ToString() + #"}""\}");
MatchCollection mc = reg.Matches(myListString);
Currently, I use regular expression to do this. It works fine. However, it uses too much memory. Is there a better way to do this (unnecessary to be regular expression)?
*** Added more info.
The 'system' I mentioned above is Azure DocumentDB. By default, the document can only be 512KB (as now). Although we can request MS increase this, but the json file we got always much much more than 512KB. That's why we need to figure out a way to do this.
If possible, we want to keep using documentDB, but we are open to other suggestions.
*** Some info to make things clear: 1) the values in the array are different. Not duplicated. 2) Yes, I use StringBuilder whenever I can. 3) Yes, I tried IndexOf & Substring, but based on tests, the performance is not better than regular expression in this case (although it could be the way I implement it).
* **the json object is complex, but all I care is this "Detail" which is an array. We can assume the string is just like the example, only has "Detail". We need to split this json array string into size smaller than 512KB. Basically, we can think this as a simple string, not json. but, it is a json format, so maybe some libraries can do this better.
Take a look at Json.NET (available via NuGet).
It has a JsonReader class, which allows you to create a required object by reading json by token, example of json reading with JsonReader. Not that if you pass invalid json string (e.g. without "end array" character or without "end object" character) to JsonReader - it will throw an exception only when it reaches invalid item, so you can pass different substrings to it.
Also, I guess that your system has something similar to JsonReader, so you can use it.
Reading a string with StringReader should not require too much application memory and it should be faster then iterating through regular expression matches.
Here is a hacky solution assuming data contains your JSON data:
var details = data
.Split('[')[1]
.Split(']')[0]
.Split(new[] { "}," }, StringSplitOptions.None)
.Select(d => d.Trim())
.Select(d => d.EndsWith("}") ? d : d + "}");;
foreach (var detail in details)
{
// Now process "detail" with your JSON library.
}
Working example: https://dotnetfiddle.net/sBQjyi
Obviously you should only do this if you really can't use a normal JSON library. See Mikhail Neofitov's answer for library suggestions.
If you are reading the JSON data from file or network you should implement a more stream-like processing where you read one details line, deserialize it with your JSON library and yield it to the caller. When the caller requests the next detail object, read the next line, deserialize it and so on. This way you can minimize the memory footprint of your deserializer.
You might want to consider storing each detail in a separate document. It means two round trips to get both the header and all of the detail documents, but it means you are never dealing with a really large JSON document. Also, if Detail is added to incrementally, it'll be much more efficient for writes because there is no way to just add another row. You have to rewrite the entire document. Your read/write ratio will determine the break even point in overall efficiency.
Another argument for this is that the complexity of regex parsing, feeding it through your JSON parser, then reassembling it goes away. You never know if your regex parser will deal with all cases (commas inside of quotes, international characters, etc.). I've seen many folks think they have a good regex only to find odd cases in production.
If your Detail array can grow unbounded (or even with a large bound), then you should definitely make this change regardless of your JSON parser limitations or read/write ratio because eventually, you'll exceed the limit.
I have a variable that has a current value but when I change the value it first needs to store the past value in some data structure that will show me the past X many values.
This is to do all kinds of calculations on past values like an average of the most recent values and such.
My only idea was to use a queue for this and since I only need the past X values then I implemented a FixedSizedQueue that would automatically dequeue older values.
Since then I've found out I can't really access a random value in it at least in default implementations of queues it seems. But additionally that if one would make that work they would be slow and need to iterate over all values.
So I'm left wondering is there any way at all to do this efficiently? The only other way I can think off would be to have an array and simply implement some pushing feature that would move all elements by one index position. But that seems overly wasteful. If these are the only two options, which one would be better if I need to access each value in the data structure 20 times each time I change it, and the size would be 50 values stored?
This is a place where performance will matter a great deal since each variable being "recorded" will change at least a million times when iterating over the data I have so don't worry about me doing premature optimization. Thank you, I appreciate it!
You are looking for a ring buffer / circular buffer.
You can find a c# implementation here.
This question already has answers here:
How to find the index of an element in an array in Java?
(15 answers)
Closed 6 years ago.
I was asked this question in an interview. Although the interview was for dot net position, he asked me this question in context to java, because I had mentioned java also in my resume.
How to find the index of an element having value X in an array ?
I said iterating from the first element till last and checking whether the value is X would give the result. He asked about a method involving less number of iterations, I said using binary search but that is only possible for sorted array. I tried saying using IndexOf function in the Array class. But nothing from my side answered that question.
Is there any fast way of getting the index of an element having value X in an array ?
As long as there is no knowledge about the array (is it sorted? ascending or descending? etc etc), there is no way of finding an element without inspecting each one.
Also, that is exactly what indexOf does (when using lists).
How to find the index of an element having value X in an array ?
This would be fast:
int getXIndex(int x){
myArray[0] = x;
return 0;
}
A practical way of finding it faster is by parallel processing.
Just divide the array in N parts and assign every part to a thread that iterates through the elements of its part until value is found. N should preferably be the processor's number of cores.
If a binary search isn't possible (beacuse the array isn't sorted) and you don't have some kind of advanced search index, the only way I could think of that isn't O(n) is if the item's position in the array is a function of the item itself (like, if the array is [10, 20, 30, 40], the position of an element n is (n / 10) - 1).
Maybe he wants to test your knowledge about Java.
There is Utility Class called Arrays, this class contains various methods for manipulating arrays (such as sorting and searching)
http://download.oracle.com/javase/6/docs/api/java/util/Arrays.html
In 2 lines you can have a O(n * log n) result:
Arrays.sort(list); //O(n * log n)
Arrays.binarySearch(list, 88)); //O(log n)
Puneet - in .net its:
string[] testArray = {"fred", "bill"};
var indexOffset = Array.IndexOf(testArray, "fred");
[edit] - having read the question properly now, :) an alternative in linq would be:
string[] testArray = { "cat", "dog", "banana", "orange" };
int firstItem = testArray.Select((item, index) => new
{
ItemName = item,
Position = index
}).Where(i => i.ItemName == "banana")
.First()
.Position;
this of course would find the FIRST occurence of the string. subsequent duplicates would require additional logic. but then so would a looped approach.
jim
It's a question about data structures and algorithms (altough a very simple data structure). It goes beyond the language you are using.
If the array is ordered you can get O(log n) using binary search and a modified version of it for border cases (not using always (a+b)/2 as the pivot point, but it's a pretty sophisticated quirk).
If the array is not ordered then... good luck.
He can be asking you about what methods you have in order to find an item in Java. But anyway they're not faster. They can be olny simpler to use (than a for-each - compare - return).
There's another solution that's creating an auxiliary structure to do a faster search (like a hashmap) but, OF COURSE, it's more expensive to create it and use it once than to do a simple linear search.
Take a perfectly unsorted array, just a list of numbers in memory. All the machine can do is look at individual numbers in memory, and check if they are the right number. This is the "password cracker problem". There is no faster way than to search from the beginning until the correct value is hit.
Are you sure about the question? I have got a questions somewhat similar to your question.
Given a sorted array, there is one element "x" whose value is same as its index find the index of that element.
For example:
//0,1,2,3,4,5,6,7,8,9, 10
int a[10]={1,3,5,5,6,6,6,8,9,10,11};
at index 6 that value and index are same.
for this array a, answer should be 6.
This is not an answer, in case there was something missed in the original question this would clarify that.
If the only information you have is the fact that it's an unsorted array, with no reletionship between the index and value, and with no auxiliary data structures, then you have to potentially examine every element to see if it holds the information you want.
However, interviews are meant to separate the wheat from the chaff so it's important to realise that they want to see how you approach problems. Hence the idea is to ask questions to see if any more information is (or could be made) available, information that can make your search more efficient.
Questions like:
1/ Does the data change very often?
If not, then you can use an extra data structure.
For example, maintain a dirty flag which is initially true. When you want to find an item and it's true, build that extra structure (sorted array, tree, hash or whatever) which will greatly speed up searches, then set the dirty flag to false, then use that structure to find the item.
If you want to find an item and the dirty flag is false, just use the structure, no need to rebuild it.
Of course, any changes to the data should set the dirty flag to true so that the next search rebuilds the structure.
This will greatly speed up (through amortisation) queries for data that's read far more often than written.
In other words, the first search after a change will be relatively slow but subsequent searches can be much faster.
You'll probably want to wrap the array inside a class so that you can control the dirty flag correctly.
2/ Are we allowed to use a different data structure than a raw array?
This will be similar to the first point given above. If we modify the data structure from an array into an arbitrary class containing the array, you can still get all the advantages such as quick random access to each element.
But we gain the ability to update extra information within the data structure whenever the data changes.
So, rather than using a dirty flag and doing a large update on the next search, we can make small changes to the extra information whenever the array is changed.
This gets rid of the slow response of the first search after a change by amortising the cost across all changes (each change having a small cost).
3. How many items will typically be in the list?
This is actually more important than most people realise.
All talk of optimisation tends to be useless unless your data sets are relatively large and performance is actually important.
For example, if you have a 100-item array, it's quite acceptable to use even the brain-dead bubble sort since the difference in timings between that and the fastest sort you can find tend to be irrelevant (unless you need to do it thousands of times per second of course).
For this case, finding the first index for a given value, it's probably perfectly acceptable to do a sequential search as long as your array stays under a certain size.
The bottom line is that you're there to prove your worth, and the interviewer is (usually) there to guide you. Unless they're sadistic, they're quite happy for you to ask them questions to try an narrow down the scope of the problem.
Ask the questions (as you have for the possibility the data may be sorted. They should be impressed with your approach even if you can't come up with a solution.
In fact (and I've done this in the past), they may reject all your possibile approaches (no, it's not sorted, no, no other data structures are allowed, and so on) just to see how far you get.
And maybe, just maybe, like the Kobayashi Maru, it may not be about winning, it may be how you deal with failure :-)
In some library code, I have a List that can contain 50,000 items or more.
Callers of the library can invoke methods that result in strings being added to the list. How do I efficiently check for uniqueness of the strings being added?
Currently, just before adding a string, I scan the entire list and compare each string to the to-be-added string. This starts showing scale problems above 10,000 items.
I will benchmark this, but interested in insight.
if I replace the List<> with a Dictionary<> , will ContainsKey() be appreciably faster as the list grows to 10,000 items and beyond?
if I defer the uniqueness check until after all items have been added, will it be faster? At that point I would need to check every element against every other element, still an n^^2 operation.
EDIT
Some basic benchmark results. I created an abstract class that exposes 2 methods: Fill and Scan. Fill just fills the collection with n items (I used 50,000). Scan scans the list m times (I used 5000) to see if a given value is present. Then I built an implementation of that class for List, and another for HashSet.
The strings used were uniformly 11 characters in length, and randomly generated via a method in the abstract class.
A very basic micro-benchmark.
Hello from Cheeso.Tests.ListTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.4428266
Time to scan: 00:00:13.0291180
Hello from Cheeso.Tests.HashSetTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.3797751
Time to scan: 00:00:00.4364431
So, for strings of that length, HashSet is roughly 25x faster than List , when scanning for uniqueness. Also, for this size of collection, HashSet has zero penalty over List when adding items to the collection.
The results are interesting and not valid. To get valid results, I'd need to do warmup intervals, multiple trials, with random selection of the implementation. But I feel confident that that would move the bar only slightly.
Thanks everyone.
EDIT2
After adding randomization and multple trials, HashSet consistently outperforms List in this case, by about 20x.
These results don't necessarily hold for strings of variable length, more complex objects, or different collection sizes.
You should use the HashSet<T> class, which is specifically designed for what you're doing.
Use HashSet<string> instead of List<string>, then it should scale very well.
From my tests, HashSet<string> takes no time compared to List<string> :)
Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.
Does the Contains(T) function not work for you?
I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative_arrays.
A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.
edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.
edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.
In my website's advanced search screen there are about 15 fields that need an autocomplete field.
Their content is all depending on each other's value (so if one is filled in, the other's content will change depending on the first's value).
Most of the fields have a huge amount of possibilities (1000's of entries at least).
Currently make an ajax call if the user stops typing for half a second. This ajax call makes a quick call to my Lucene index and returns a bunch of JSon objects. The method itself is really fast, but it's the connection and transferring of data that is too slow.
If I look at other sites (say facebook), their autocomplete is instant. I figure they put the possible values in their HTML, so they don't have to do a round trip. But I fear with the amounts of data I'm handling, this is not an option.
Any ideas?
Return only top x results.
Get some trends about what users are picking,
and order based on that, preferably
automatically.
Cache results for every URL & keystroke combination,
so that you don't have to round-trip
if you've already fetched the result
before.
Share this cache with all
autocompletes that use the same URL
& keystroke combination.
Of course,
enable gzip compression for the
JSON, and ensure you're setting your
cache headers to cache for some
time. The time depends on your rate
of change of autocomplete response.
Optimize the JSON to send down the
bare minimum. Don't send down
anything you don't need.
Are you returning ALL results for the possibilities or just the top 10 as json objects.
I notice a lot of people send large numbers of results back to the screen, but then only show the first few. By sending back small numbers of results, you can reduce the data transfer.
Return the top "X" results, rather than the whole list, to cut back on the number of options? You might also want to try and put in some trending to track what users pick from the list so you can try and make the top "X" the most used/most relvant. You could always return your most relevant list first, then return the full list if they are still struggling.
In addition to limiting the set of results to a top X set consider enabling caching on the responses of the AJAX requests (which means using GET and keeping the URL simple).
Its amazing how often users will backspace then end up retyping exactly the same content. Also by allowing public and server-side caching your could speed up the overall round-trup time.
Cache the results in System.Web.Cache
Use a Lucene cache
Use GET not POST as IE caches this
Only grab a subset of results (10 as people suggest)
Try a decent 3rd party autocomplete widget like the YUI one
Returning the top-N entries is a good approach. But if you want/have to return all the data, I would try and limit the data being sent and the JSON object itself.
For instance:
"This Here Company With a Long Name" becomes "This Here Company..." (you put the dots in the name client side--again; transfer a minimum of data).
And as far as the JSON object goes:
{n: "This Here Company", v: "1"}
... Where "n" would be the name and "v" would be the value.