Grouping on Lucene.net - c#

`
public static async Task<List<string>> SearchGroup(string filedName, Query bq, Filter fil, IndexSearcher searcher)
{
//分组的最大数量
int groupNum = 100;
return await Task.Run(() =>
{
GroupingSearch groupingSearch = new GroupingSearch(filedName);
groupingSearch.SetCachingInMB(8192, cacheScores: true);
//不做排序,浪费性能
//groupingSearch.SetGroupSort(new Sort(new SortField("name", SortFieldType.STRING)));
groupingSearch.SetGroupDocsLimit(groupNum);
ITopGroups<BytesRef> topGroups = groupingSearch.SearchByField(searcher, fil, bq, groupOffset: 0, groupNum);
List<string> groups = new List<string>();
foreach (var groupDocs in topGroups.Groups.Take(groupNum))
{
if (groupDocs.GroupValue != null)
{
groups.Add(groupDocs.GroupValue.Utf8ToString());
}
}
return groups;
});
}
`
Here is my current code for grouping, but there are performance issues. The time for each call is equal to the time for one query. If I group multiple fields at the same time, it is very time-consuming. Is there any way to improve the speed?
There will be multiple screening items, but it is too time-consuming
Hope to have fast grouping results, or grouping at the same time

You have to add cache - group values usual not changing very frequently.
So it is better to pre-cache data before execute search request.
And then use such information every next query of same information.
To develop this look like you have to customzie lucene engine (searcher).
in my case i am using faceted filters instead of groups. Difference that groups shows sub content as tree view (above group), filters - all data shows as one set but can be easelly and quickly filters as on screenshot
Filters is not answer for all questions/problems. There is no information about task which you are going to sove with grouping - so i can't answer more exactly.
So cache as much as possible - is answer for 99% of such issues. Probably change strtategy to use facet filters - another way to improve issue/situation.

Related

Reactive - how to combine / join / look up items with two sequences

I am connecting to a web service that gives me all prices for a day (without time info). Each of those price results has the id for a corresponding "batch run".
The "batch run" has a date+time stamp, but I have to make a separate call to get all the batch info for the day.
Hence, to get the actual time of each result, I need to combine the two API calls.
I'm using Reactive for this, but I can't reliably combine the two sets of data. I thought that CombineLatest would do it, but it doesn't seem to work as I thought (based on http://reactivex.io/documentation/operators/combinelatest.html, http://introtorx.com/Content/v1.0.10621.0/12_CombiningSequences.html#CombineLatest).
[TestMethod]
public async Task EvenMoreBasicCombineLatestTest()
{
int batchStart = 100, batchCount = 10;
//create 10 results with batch ids [100, 109]
//the test uses lists just to make debugging easier
var resultsWithBatchIdList = Enumerable.Range(batchStart, batchCount)
.Select(id => new { BatchRunId = id, ResultValue = id * 10 })
.ToList();
var resultsWithBatchId = Observable.ToObservable(resultsWithBatchIdList);
Assert.AreEqual(batchCount, await resultsWithBatchId.Count());
//create 10 batches with ids [100, 109]
var batchesList = Enumerable.Range(batchStart, batchCount)
.Select(id => new
{
ThisId = id,
BatchName = String.Concat("abcd", id)
})
.ToList();
var batchesObservable = Observable.ToObservable(batchesList);
Assert.AreEqual(batchCount, await batchesObservable.Count());
//turn the batch set into a dictionary so we can look up each batch by its id
var batchRunsByIdObservable = batchesObservable.ToDictionary(batch => batch.ThisId);
//for each result, look up the corresponding batch id in the dictionary to join them together
var resultsWithCorrespondingBatch =
batchRunsByIdObservable
.CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
{
Assert.AreEqual(NumberOfResultsToCreate, batchRunsById.Count);
var correspondingBatch = batchRunsById[result.BatchRunId];
var priceResultAndSourceBatch = new
{
Result = result,
SourceBatchRun = correspondingBatch
};
return priceResultAndSourceBatch;
});
Assert.AreEqual(batchCount, await resultsWithCorrespondingBatch.Count());
}
I would expect as each element of the 'results' observable comes through, it would get combined with each element of the batch-id dictionary observable (which only ever has one element). But instead, it looks like only the last element of the result list gets joined.
I have a more complex problem deriving from this but while trying to create a minimum repro, even this is giving me unexpected results. This happens with version 3.1.1, 4.0.0, 4.2.0, etc.
(Note that the sequences don't generally match up as in this artificial example, so I can't just Zip them.)
So how can I do this join? A stream of results that I want to look up more info via a Dictionary (which also is coming from an Observable)?
Also note that the goal is to return the IObservable (resultsWithCorrespondingBatch), so I can't just await the batchRunsByIdObservable.
Ok I think I figured it out. I wish either of the two marble diagrams in the documentation had been just slightly different -- it would have made a subtlety of CombineLatest much more obvious:
N------1---2---3---
L--z--a------bc----
R------1---2-223---
a a bcc
It's combine latest -- so depending on when items get emitted, it's possible to miss some tuples. What I should have done is SelectMany:
NO: .CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
YES: .SelectMany(batchRunsById => resultsWithBatchId.Select(result =>
Note that the "join" order is important: A.SelectMany(B) vs B.SelectMany(A) -- if A has 1 item and B has 100 items, the latter would result in 100 calls to subscribe to A.

Find TOP (1) for each ID in array

I have a large (60m+) document collection, whereby each ID has many records in time series. Each record has an IMEI identifier, and I'm looking to select the most recent record for each IMEI in a given List<Imei>.
The brute force method is what is currently happening, whereby I create a loop for each IMEI and yield out the top most record, then return a complete collection after the loop completes. As such:
List<BsonDocument> documents = new List<BsonDocument>();
foreach(var config in imeiConfigs)
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Limit(1).FirstOrDefault();
documents.Add(data);
}
The end result is a List<BsonDocument> which contains the most recent BsonDocument for each IMEI, but it's not massively performant. If imeiConfigs is too large, the query takes a long time to run and return as the documents are rather large.
Is there a way to select the TOP 1 for each IMEI in a single query, as opposed to brute forcing like I am above?
have tried using the LINQ Take function?
List documents = new List();
foreach(var config in imeiConfigs)
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Take(1).FirstOrDefault();
documents.Add(data);
}
https://learn.microsoft.com/es-es/dotnet/api/system.linq.enumerable.take?view=netframework-4.8
I think bad performance come from "Sort(sort)", because the sorting forces it to go through all the collection.
But perhaps you can improuve time performance with parallel.
List<BsonDocument> documents;
documents = imeiConfigs.AsParallel().Select((config) =>
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Limit(1).FirstOrDefault();
return data;
}).ToList();

Does Azure Search result guarantee order for * query?

I'd like to manage AzSearch documents (indexed items) by AzSearch C# SDK.
What I try to do is to list up documents by query result (mostly * result) continuously and edit values of them.
To list up query result is as below;
public async Task<IEnumerable<MyIndexModel>> GetListAsync(string query, bool isNext = false)
{
if (string.IsNullOrEmpty(query)) query = "*";
DocumentSearchResult list;
if (!isNext)
{
list = await _indexClient.Documents.SearchAsync(query);
}
else
{
list = await _indexClient.Documents.ContinueSearchAsync(ContinuationToken);
}
ContinuationToken = list.ContinuationToken;
return list.Results.Select(o => o.Document.ToIndexModel());
}
One requirement is to jump to the n-th list of items. Since AzSearch does not provide paging, I'd like to know whether it gives ordered list or not.
If we do not update document count (not index further), does AzSearch give unchanged/ordered list so I can get the same document for jump to 80th list by running ContinueSearchAsync() method 80 times?
Do I have to maintain another lookup table for my requirement?
* is a wildcard query. Documents matching a wildcard query are given the same constant score in ranking because there's no way to measure how close a document is to . Further, order between the same score document is not guaranteed. A document matching '' can be ranked 1st in one response and 7th position in another even when the same query were issued.
In order to get consistent ordering for wildcard queries, I suggest passing in an $orderBy clause, search=*&$orderBy=id asc for example. Azure Search does support paging via $skip and $top. This document provides the guidance.

Query method should return Cursor or List ? (Mongo DB C#)

I have a task in which I need to query a large amount of data. I created a method for the queries:
public List<T> Query(FilterDefinition<T> filter, SortDefinition<T> sort, int limit)
{
var query = Collection.Find(filter).Sort(sort).Limit(limit);
var result = query.ToList();
return result;
}
In the main method:
List<Cell> cells = MyDatabaseService.Query(filter, sort, 100000);
This List will contain 100000 values which is quite large.
On the other hand I can also use:
public async Task<IAsyncCursor<T>> QueryAsync(FilterDefinition<T> filter, SortDefinition<T> sort, int limit)
{
FindOptions<T> options = new FindOptions<T> { Sort = sort, Limit = limit };
var queryCursor = await Collection.FindAsync(filter, options);
return queryCursor;
}
In the main, then I use a while loop to iterate the cursor.
Task<IAsyncCursor<Cell>> cursor = MyDatabaseService.QueryAsync(filter, sort, 100000);
while (await cursor.MoveNextAsync())
{
var batch = queryCursor.Current;
foreach (var document in batch)
{
}
}
So considering I have a lot of data to query, is it a good idea to use the 2nd implementation ? Thanks for any reply.
It really depends what you are planning to do with the documents once you've retrieved them from the server.
If you need to perform an operation that requires all 100,000 documents to be in the program's memory then the two methods will essentially do the same thing.
On the other hand, if you are using the returned documents one by one, the second method is better: the first will essentially process every document twice (once to retrieve it along with all other documents and once to act on it); the second will process it once (retrieve and act immediately).

How to retrieve records more than 4000 from Raven DB in SIngle Session [duplicate]

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

Categories