Optimizing array iteration with nested Where (LINQ) clauses - c#

I am creating a (C#) tool that has a search functionality. The search is kind of similar to a "go to anywhere" search (like ReSharper has or VS2013).
The search context is a string array that contains all items up front:
private string[] context; // contains thousands of elements
Searching is incremental and occurs with every new input (character) the user provides.
I have implemented the search using the LINQ Where extension method:
// User searched for "c"
var input = "c";
var results = context.Where(s => s.Contains(input));
When the user searches for "ca", I attempted to use the previous results as the search context, however this causes (i think?) a nested Where iteration, and does not run very well. Think of something like this code:
// Cache these results.
var results = var results = context.Where(s => s.Contains(input));
// Next search uses the previous search results
var newResults = results.Where(s => s.Contains(input));
Is there any way to optimize this scenario?
Converting the IEnumerable into an array with every search causes high memory allocations and runs poorly.

Presenting the user with thousands of search results is pretty useless. You should add a "top" (Take in linq) statement to your query before presenting the result to the user.
var results = context.Where(s => s.Contains(input)).Take(100);
And if you want to present the next 100 results to the user:
var results = context.Where(s => s.Contains(input)).Skip(100).Take(100);
Also just use the original array for all the searches, no nested Where as it has no benefits unless you materialize the query.

I got a couple of useful points to add, too many for a comment.
First off, i agree with the other comments that you should start with .take(100), decrease the load time. Even better, add one result at the time:
var results = context.Where(s => s.Contains(input));
var resultEnumerator = result.GetEnumerator()
Loop over the resultEnumerator to display results one at the time, stop when the screen is full or a new search is initiated.
Second, throttle your input. If the user writes Hello, you do not want to shoot off 5 searches for H, He, Hel, Hell and Hello, you want to search for just Hello. When the user later add world, it could be worthwhile to take your old result and add Hello world to the where clause.
results = results.Where(s => s.Contains(input));
resultEnumerator = result.GetEnumerator()
And of course, cancel the current in progress result when the user adds new text.
Using Rx, the throttle part is easy, you would get something like this:
var result = context.AsEnumerable();
var oldStr = "";
var resultEnumerator = result.GetEnumerator();
Observable.FromEventPattern(h => txt.TextChanged += h, h => txt.TextChanged -= h)
.Select(s => txt.Text)
.DistinctUntilChanged().Throttle(TimeSpan.FromMilliseconds(300))
.Subscribe(s =>
{
if (s.Contains(oldStr))
result = result.Where(t => t.Contains(s));
else
result = context.Where(t => t.Contains(s));
resultEnumerator = result.GetEnumerator();
oldStr = s;
// and probably start iterating resultEnumerator again,
// but perhaps not on this thread.
});

If allocs are your concern and you don't want to write a trie implementation or use third party code, you should get away with partitioning your context array successively to clump matching entries together in the front. Not very LINQ-ish, but fast and has zero memory cost.
The partitioning extension method, based on C++'s std::partition
/// <summary>
/// All elements for which predicate is true are moved to the front of the array.
/// </summary>
/// <param name="start">Index to start with</param>
/// <param name="end">Index to end with</param>
/// <param name="predicate"></param>
/// <returns>Index of the first element for which predicate returns false</returns>
static int Partition<T>(this T[] array, int start, int end, Predicate<T> predicate)
{
while (start != end)
{
// move start to the first not-matching element
while ( predicate(array[start]) )
{
if ( ++start == end )
{
return start;
}
}
// move end to the last matching element
do
{
if (--end == start)
{
return start;
}
}
while (!predicate(array[end]));
// swap the two
var temp = array[start];
array[start] = array[end];
array[end] = temp;
++start;
}
return start;
}
So now you need to store the last partition index, which should be initialised with context length:
private int resultsCount = context.Length;
Then for each change in input that's incremental you can run:
resultsCount = context.Partition(0, resultsCount, s => s.Contains(input));
Each time this will only do the checks for elements that haven't been filtered out previously, which is exactly what you are after.
For each non-incremental change you'll need to reset resultsCount to the original value.
You can expose results in a convenient, debugger and LINQ friendly way:
public IEnumerable<string> Matches
{
get { return context.Take(resultsCount); }
}

Related

Reactive - how to combine / join / look up items with two sequences

I am connecting to a web service that gives me all prices for a day (without time info). Each of those price results has the id for a corresponding "batch run".
The "batch run" has a date+time stamp, but I have to make a separate call to get all the batch info for the day.
Hence, to get the actual time of each result, I need to combine the two API calls.
I'm using Reactive for this, but I can't reliably combine the two sets of data. I thought that CombineLatest would do it, but it doesn't seem to work as I thought (based on http://reactivex.io/documentation/operators/combinelatest.html, http://introtorx.com/Content/v1.0.10621.0/12_CombiningSequences.html#CombineLatest).
[TestMethod]
public async Task EvenMoreBasicCombineLatestTest()
{
int batchStart = 100, batchCount = 10;
//create 10 results with batch ids [100, 109]
//the test uses lists just to make debugging easier
var resultsWithBatchIdList = Enumerable.Range(batchStart, batchCount)
.Select(id => new { BatchRunId = id, ResultValue = id * 10 })
.ToList();
var resultsWithBatchId = Observable.ToObservable(resultsWithBatchIdList);
Assert.AreEqual(batchCount, await resultsWithBatchId.Count());
//create 10 batches with ids [100, 109]
var batchesList = Enumerable.Range(batchStart, batchCount)
.Select(id => new
{
ThisId = id,
BatchName = String.Concat("abcd", id)
})
.ToList();
var batchesObservable = Observable.ToObservable(batchesList);
Assert.AreEqual(batchCount, await batchesObservable.Count());
//turn the batch set into a dictionary so we can look up each batch by its id
var batchRunsByIdObservable = batchesObservable.ToDictionary(batch => batch.ThisId);
//for each result, look up the corresponding batch id in the dictionary to join them together
var resultsWithCorrespondingBatch =
batchRunsByIdObservable
.CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
{
Assert.AreEqual(NumberOfResultsToCreate, batchRunsById.Count);
var correspondingBatch = batchRunsById[result.BatchRunId];
var priceResultAndSourceBatch = new
{
Result = result,
SourceBatchRun = correspondingBatch
};
return priceResultAndSourceBatch;
});
Assert.AreEqual(batchCount, await resultsWithCorrespondingBatch.Count());
}
I would expect as each element of the 'results' observable comes through, it would get combined with each element of the batch-id dictionary observable (which only ever has one element). But instead, it looks like only the last element of the result list gets joined.
I have a more complex problem deriving from this but while trying to create a minimum repro, even this is giving me unexpected results. This happens with version 3.1.1, 4.0.0, 4.2.0, etc.
(Note that the sequences don't generally match up as in this artificial example, so I can't just Zip them.)
So how can I do this join? A stream of results that I want to look up more info via a Dictionary (which also is coming from an Observable)?
Also note that the goal is to return the IObservable (resultsWithCorrespondingBatch), so I can't just await the batchRunsByIdObservable.
Ok I think I figured it out. I wish either of the two marble diagrams in the documentation had been just slightly different -- it would have made a subtlety of CombineLatest much more obvious:
N------1---2---3---
L--z--a------bc----
R------1---2-223---
a a bcc
It's combine latest -- so depending on when items get emitted, it's possible to miss some tuples. What I should have done is SelectMany:
NO: .CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
YES: .SelectMany(batchRunsById => resultsWithBatchId.Select(result =>
Note that the "join" order is important: A.SelectMany(B) vs B.SelectMany(A) -- if A has 1 item and B has 100 items, the latter would result in 100 calls to subscribe to A.

Query method should return Cursor or List ? (Mongo DB C#)

I have a task in which I need to query a large amount of data. I created a method for the queries:
public List<T> Query(FilterDefinition<T> filter, SortDefinition<T> sort, int limit)
{
var query = Collection.Find(filter).Sort(sort).Limit(limit);
var result = query.ToList();
return result;
}
In the main method:
List<Cell> cells = MyDatabaseService.Query(filter, sort, 100000);
This List will contain 100000 values which is quite large.
On the other hand I can also use:
public async Task<IAsyncCursor<T>> QueryAsync(FilterDefinition<T> filter, SortDefinition<T> sort, int limit)
{
FindOptions<T> options = new FindOptions<T> { Sort = sort, Limit = limit };
var queryCursor = await Collection.FindAsync(filter, options);
return queryCursor;
}
In the main, then I use a while loop to iterate the cursor.
Task<IAsyncCursor<Cell>> cursor = MyDatabaseService.QueryAsync(filter, sort, 100000);
while (await cursor.MoveNextAsync())
{
var batch = queryCursor.Current;
foreach (var document in batch)
{
}
}
So considering I have a lot of data to query, is it a good idea to use the 2nd implementation ? Thanks for any reply.
It really depends what you are planning to do with the documents once you've retrieved them from the server.
If you need to perform an operation that requires all 100,000 documents to be in the program's memory then the two methods will essentially do the same thing.
On the other hand, if you are using the returned documents one by one, the second method is better: the first will essentially process every document twice (once to retrieve it along with all other documents and once to act on it); the second will process it once (retrieve and act immediately).

Performance issue with search of large in memory collection

I wrote a query to find node in nodedata from transitiondata but it is taking quite a long time to come out of that loop since it has 4 million records.
What we have :
1. Transition data(Collection) which will have from and to node.
2. Node data(Collection) which will have key which is equals to form or to node from Transition data(Collection)
What is required out of these collections:
1. Collection which should have Transition Data(from, to) and the corresponding nodes from Node data(from key) and (to key)
The code what i wrote works fine, but it takes lot of time to execute. Below is the code.
foreach (var trans in transitions)
{
string transFrom = trans.From;
string transTo = trans.To;
var fromNodeData = nodeEntitydata.Where(x => x.Key == transFrom).FirstOrDefault();
var toNodeData = nodeEntitydata.Where(x => x.Key == transTo).FirstOrDefault();
if (fromNodeData != null && toNodeData != null)
{
//string fromSwimlane = fromNodeData.Group;
//string toSwimlane = toNodeData.Group;
string dicKey = fromNodeData.sokey + toNodeData.sokey;
if (!dicTrans.ContainsKey(dicKey))
{
soTransition.Add(new TransitionDataJsonObject
{
From = fromNodeData.sokey,
To = toNodeData.sokey,
FromPort = fromPortIds[0],
ToPort = toPortIds[0],
Description = "SOTransition",
IsManual = true
});
dicTrans.Add(dicKey, soTransition);
}
}
}
That is the loop which takes time to execute. I know the problem is in that two Where clause. Because transitions will have 400k and nodeEntitydata will have 400k. Can someone help me on this?
Use direct access to the dictionary entry:
var fromNodeData = nodeEntitydata[transFrom];
var toNodeData = nodeEntitydata[transTo];
It looks like nodeEntitydata is just a normal collection. The problem you're facing is that performing a Where on an in memory collection has linear performance, and you've got a lot of records to process.
What you need is a Dictionary. This is much more efficient for searching large collections, because it uses a binary tree to do the search rather than a linear search.
If nodeEntitydata isn't already a Dictionary, you can create a Dictionary from it like this:
var nodeEntitydictionary = nodeEntitydata.ToDictionary(n => n.Key);
You can then consume the dictionary like this:
var fromNodeData = nodeEntitydictionary[transFrom];
var toNodeData = nodeEntitydictionary[transTo];
Creating the Dictionary will be fairly slow, so make sure you only do it once at the point where you populate nodeEntitydata. If you have to keep re-instantiating the Dictionary too frequently then you won't see much of a performance benefit, so make sure you reuse it as much as possible.

Cache only parts of an object

I'm trying to achieve a super-fast search, and decided to rely heavily on caching to achieve this. The order of events is as follows;
1) Cache what can be cached (from entire database, around 3000 items)
2) When a search is performed, pull the entire result set out of the cache
3) Filter that result set based on the search criteria. Give each search result a "relevance" score.
4) Send the filtered results down to the database via xml to get the bits that can't be cached (e.g. prices)
5) Display the final results
This is all working and going at lightning speed, but in order to achieve (3) I've given each result a "relevance" score. This is just a member integer on each search result object. I iterate through the entire result set and update this score accordingly, then order-by it at the end.
The problem I am having is that the "relevance" member is retaining this value from search to search. I assume this is because what I am updating is a reference to the search results in the cache, rather than a new object, so updating it also updates the cached version. What I'm looking for is a tidy solution to get around this. What I've come up with so far is either;
a) Clone the cache when i get it.
b) Create a seperate dictionary to store relevances in and match them up at the end
Am I missing a really obvious and clean solution or should i go down one of these routes? I'm using C# and .net.
Hopefully it should be obvious from the description what I'm getting at, here's some code anyway; this first one is the iteration through the cached results in order to do the filtering;
private List<QuickSearchResult> performFiltering(string keywords, string regions, List<QuickSearchResult> cachedSearchResults)
{
List<QuickSearchResult> filteredItems = new List<QuickSearchResult>();
string upperedKeywords = keywords.ToUpper();
string[] keywordsArray = upperedKeywords.Split(' ');
string[] regionsArray = regions.Split(',');
foreach (var item in cachedSearchResults)
{
//Check for keywords
if (keywordsArray != null)
{
if (!item.ContainsKeyword(upperedKeywords, keywordsArray))
continue;
}
//Check for regions
if (regionsArray != null)
{
if (!item.IsInRegion(regionsArray))
continue;
}
filteredItems.Add(item);
}
return filteredItems.OrderBy(t=> t.Relevance).Take(_maxSearchResults).ToList<QuickSearchResult>();
}
and here is an example of the "IsInRegion" method of the QuickSearchResult object;
public bool IsInRegion(string[] regions)
{
int relevanceScore = 0;
foreach (var region in regions)
{
int parsedRegion = 0;
if (int.TryParse(region, out parsedRegion))
{
foreach (var thisItemsRegion in this.Regions)
{
if (thisItemsRegion.ID == parsedRegion)
relevanceScore += 10;
}
}
}
Relevance += relevanceScore;
return relevanceScore > 0;
}
And basically if i search for "london" i get a score of "10" the first time, "20" the second time...
If you use the NetDataContractSerializer to serialize your objects in the cache, you could use a [DataMember] attribute to control what gets serialized and what doesn't. For instance, you could store your temporarary calculated relevance value in a field that is not serialized.

Is there a more efficent way to randomise a set of LINQ results?

I've produced a function to get back a random set of submissions depending on the amount passed to it, but I worry that even though it works now with a small amount of data when the large amount is passed through, it would become efficent and cause problems.
Is there a more efficent way of doing the following?
public List<Submission> GetRandomWinners(int id)
{
List<Submission> submissions = new List<Submission>();
int amount = (DbContext().Competitions
.Where(s => s.CompetitionId == id).FirstOrDefault()).NumberWinners;
for (int i = 1 ; i <= amount; i++)
{
bool added = false;
while (!added)
{
bool found = false;
var randSubmissions = DbContext().Submissions
.Where(s => s.CompetitionId == id && s.CorrectAnswer).ToList();
int count = randSubmissions.Count();
int index = new Random().Next(count);
foreach (var sub in submissions)
{
if (sub == randSubmissions.Skip(index).FirstOrDefault())
found = true;
}
if (!found)
{
submissions.Add(randSubmissions.Skip(index).FirstOrDefault());
added = true;
}
}
}
return submissions;
}
As I say, I have this fully working and bringing back the wanted result. It is just that I'm not liking the foreach and while checks in there and my head has just turned to mush now trying to come up with the above solution.
(Please read all the way through, as there are different aspects of efficiency to consider.)
There are definitely simpler ways of doing this - and in particular, you really don't need to perform the query for correct answers repeatedly. Why are you fetching randSubmissions inside the loop? You should also look at ElementAt to avoid the Skip and FirstOrDefault - and bear in mind that as randSubmissions is a list, you can use normal list operations, like the Count property and the indexer!
The option which comes to mind first is to perform a partial shuffle. There are loads of examples on Stack Overflow of a modified Fisher-Yates shuffle. You can modify that code very easily to avoid shuffling the whole list - just shuffle it until you've got as many random elements as you need. In fact, these days I'd probably implement that shuffle slightly differently to you could just call:
return correctSubmissions.Shuffle(random).Take(amount).ToList();
For example:
public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random rng)
{
T[] elements = source.ToArray();
for (int i = 0; i < elements.Length; i++)
{
// Find an item we haven't returned yet
int swapIndex = i + rng.Next(elements.Length - i);
T tmp = elements[i];
yield return elements[swapIndex];
elements[swapIndex] = tmp;
// Note that we don't need to copy the value into elements[i],
// as we'll never use that value again.
}
}
Given the above method, your GetRandomWinners method would look like this:
public List<Submission> GetRandomWinners(int competitionId, Random rng)
{
List<Submission> submissions = new List<Submission>();
int winnerCount = DbContext().Competitions
.Single(s => s.CompetitionId == competitionId)
.NumberWinners;
var correctEntries = DbContext().Submissions
.Where(s => s.CompetitionId == id &&
s.CorrectAnswer)
.ToList();
return correctEntries.Shuffle(rng).Take(winnerCount).ToList();
}
I would advise against creating a new instance of Random in your method. I have an article on preferred ways of using Random which you may find useful.
One alternative you may want to consider is working out the count of the correct entries without fetching them all, then work out winning entries by computing a random selection of "row IDs" and then using ElementAt repeatedly (with a consistent order). Alternatively, instead of pulling the complete submissions, pull just their IDs. Shuffle the IDs to pick n random ones (which you put into a List<T>, then use something like:
return DbContext().Submissions
.Where(s => winningIds.Contains(s.Id))
.ToList();
I believe this will use an "IN" clause in the SQL, although there are limits as to how many entries can be retrieved like this.
That way even if you have 100,000 correct entries and 3 winners, you'll only fetch 100,000 IDs, but 3 complete records. Hope that makes sense!

Categories