Having troubles in ordering search results using Lucene

Having troubles in ordering search results using Lucene - c#

I am running search query as following to bring results from Dynamics CRM. Search is working fine but it is brining results based on relevance. We want to order them in descending order of 'createdon' field. As we are displaying only 10 results per page, so I can't sort the result returned by this query.
Is there any way to order based on a field?
public IEnumerable<SearchResult> Search(string term, int? pageNumber, int
pageSize, out int totalHits, IEnumerable<string> logicalName)
{
var searchProvider = SearchManager.Provider;
var query = new CrmEntityQuery(term, pageNumber.GetValueOrDefault(1), pageSize, logicalNames);
return GetSearchResults(out totalHits, searchProvider, query);
}
private IEnumerable<SearchResult> GetSearchResults(out int totalHits,
SearchProvider searchProvider, CrmEntityQuery query)
{
using (ICrmEntityIndexSearcher searcher = searchProvider.GetIndexSearcher())
{
Portal.StoreRequestItem("SearchDeduplicateListForAuthorisation", new List<Guid>());
var results = searcher.Search(query);
totalHits = results.ApproximateTotalHits;
return from x in results
select new SearchResult(x);
}
}

Not used Lucene myself, so cant comment on that.
However, if you were doing this in basic CRM. You would use a QueryExpression with an OrderExpression. Then when you page the results they are paged in order.
Here is an example of a QueryExpression, with an OrderExpression, and paging.
Page large result sets with QueryExpression
Presumably at some point the data is being pulled out of CRM, either within Lucene, or your own code, maybe in CrmEntityQuery? Then you can add the sort there.

Related

Find TOP (1) for each ID in array

I have a large (60m+) document collection, whereby each ID has many records in time series. Each record has an IMEI identifier, and I'm looking to select the most recent record for each IMEI in a given List<Imei>.
The brute force method is what is currently happening, whereby I create a loop for each IMEI and yield out the top most record, then return a complete collection after the loop completes. As such:
List<BsonDocument> documents = new List<BsonDocument>();
foreach(var config in imeiConfigs)
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Limit(1).FirstOrDefault();
documents.Add(data);
}
The end result is a List<BsonDocument> which contains the most recent BsonDocument for each IMEI, but it's not massively performant. If imeiConfigs is too large, the query takes a long time to run and return as the documents are rather large.
Is there a way to select the TOP 1 for each IMEI in a single query, as opposed to brute forcing like I am above?

have tried using the LINQ Take function?
List documents = new List();
foreach(var config in imeiConfigs)
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Take(1).FirstOrDefault();
documents.Add(data);
}
https://learn.microsoft.com/es-es/dotnet/api/system.linq.enumerable.take?view=netframework-4.8

I think bad performance come from "Sort(sort)", because the sorting forces it to go through all the collection.
But perhaps you can improuve time performance with parallel.
List<BsonDocument> documents;
documents = imeiConfigs.AsParallel().Select((config) =>
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Limit(1).FirstOrDefault();
return data;
}).ToList();

Does Azure Search result guarantee order for * query?

I'd like to manage AzSearch documents (indexed items) by AzSearch C# SDK.
What I try to do is to list up documents by query result (mostly * result) continuously and edit values of them.
To list up query result is as below;
public async Task<IEnumerable<MyIndexModel>> GetListAsync(string query, bool isNext = false)
{
if (string.IsNullOrEmpty(query)) query = "*";
DocumentSearchResult list;
if (!isNext)
{
list = await _indexClient.Documents.SearchAsync(query);
}
else
{
list = await _indexClient.Documents.ContinueSearchAsync(ContinuationToken);
}
ContinuationToken = list.ContinuationToken;
return list.Results.Select(o => o.Document.ToIndexModel());
}
One requirement is to jump to the n-th list of items. Since AzSearch does not provide paging, I'd like to know whether it gives ordered list or not.
If we do not update document count (not index further), does AzSearch give unchanged/ordered list so I can get the same document for jump to 80th list by running ContinueSearchAsync() method 80 times?
Do I have to maintain another lookup table for my requirement?

* is a wildcard query. Documents matching a wildcard query are given the same constant score in ranking because there's no way to measure how close a document is to . Further, order between the same score document is not guaranteed. A document matching '' can be ranked 1st in one response and 7th position in another even when the same query were issued.
In order to get consistent ordering for wildcard queries, I suggest passing in an $orderBy clause, search=*&$orderBy=id asc for example. Azure Search does support paging via $skip and $top. This document provides the guidance.

How to retrieve records more than 4000 from Raven DB in SIngle Session [duplicate]

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}

It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.

The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default

The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging

Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();

You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

Caching/compiling complex Linq query (Entity Framework)

I have a complex Entity Framework query. My performance bottleneck is not actually querying the database, but translating the IQueryable into query text.
My code is something like this:
var query = context.Hands.Where(...)
if(x)
query = query.where(...)
....
var result = query.OrderBy(...)
var page = result.skip(500 * pageNumber).Take(500).ToList(); //loong time here, even before calling the DB
do
{
foreach(var h in page) { ... }
pageNumber += 1;
page = result.skip(500 * pageNumber).Take(500).ToList(); //same here
}
while(y)
What can I do? I am using DbContext (with SQLite), so I can't use precompiled query (and even then, it would be cumbersome with query building algorithm like this).
What I basically need, is to cache a "page" query and only change the "skip" and "take" parameters, without recompiling it from the ground up each time.

Your premise is incorrect. Because you have a ToList call at the end of your query you are querying the database where you've indicated, to construct the list. You're not deferring execution any longer. That's why it takes so long. You aren't spending a long time constructing the query, it's taking a long time to go to the database and actually execute it.
If it helps you can use the following method to do the pagination for you. It will defer fetching each page until you ask for the next one:
public static IEnumerable<IEnumerable<T>> Paginate<T>(
this IQueryable<T> query, int pagesize)
{
int pageNumber = 0;
var page = query.Take(pagesize).ToList();
while (page.Any())
{
yield return page;
pageNumber++;
page = query.Skip(pageNumber * pagesize)
.Take(pagesize)
.ToList();
}
}
So if you had this code:
var result = query.OrderBy(...);
var pages = result.Paginate();//still haven't hit the database
//each iteration of this loop will go query the DB once to get that page
foreach(var page in pages)
{
//use page
}
If you want to get an IEnumerable<IQueryable<T>> in which you have all of the pages as queries (meaning you could add further filters to them before sending them to the database) then the major problem you have is that you don't know how many pages there will be. You need to actually execute a given query to know if it's the last page or not. You either need to fetch each page as you go, as this code does, or you need to query the count of the un-paged query at the start (which means one more DB query than you would otherwise need). Doing that would look like:
public static IEnumerable<IQueryable<T>> Paginate<T>(
this IQueryable<T> query, int pagesize)
{
//note that this is hitting the DB
int numPages = (int)Math.Ceiling(query.Count() / (double)pagesize);
for (int i = 0; i < numPages; i++)
{
var page = query.Skip(i * pagesize)
.Take(pagesize);
yield return page;
}
}

nhibernate paging with detached criteria

I am working on an application in which I would like to implement paging. I have the following class that implements detached criteria -
public class PagedData : DetachedCriteria
{
public PagedData(int pageIndex, int pageSize) : base(typeof(mytype))
{
AddOrder(Order.Asc("myId"));
var subquery = DetachedCriteria.For(typeof(mytype2))
.SetProjection(Projections.Property("mytype.myId"));
Add(Subqueries.PropertyIn("myId", subquery));
SetFirstResult((pageIndex - 1) * pageSize);
SetMaxResults(pageSize);
}
}
This works fine - it returns exactly the data that I am trying to retrieve. The problem I am running into is getting the total row count for my page navigation. since I am using the setfirstresults and setmaxresults in my detached criteria, the row count is always limited to the pageSize variable that is coming in.
My question is this: How can I get the total row count? Should I just create another detachedcriteria to calculate the row count? If so, will that add round trips to the db? Would I be better off not using detacedcriteria and using a straight criteria query in which I can then utilize futures? Or can I somehow use futures with what I am currently doing.
Please let me know if any further information is needed.
Thanks

I do it like this, inside my class which is used for paged criteria access:
// In order to be able to determine the NumberOfItems in a efficient manner,
// we'll clone the Criteria that has been given, and use a Projection so that
// NHibernate will issue a SELECT COUNT(*) against the ICriteria.
ICriteria countQuery =
CriteriaTransformer.TransformToRowCount (_criteria);
NumberOfItems = countQuery.UniqueResult<int> ();
Where NumberOfItems is a property (with a private setter) inside my 'PagedCriteriaResults' class.
The PagedCriteriaResults class takes an ICriteria instance in its constructor.

you can create a second DetachedCriteria to get to row count with the build-in CriteriaTransformer
DetachedCriteria countSubquery = NHibernate.CriteriaTransformer.TransformToRowCount(subquery)
this will of course result in a second call to the db

Discussed here:
How can you do paging with NHibernate?

Drawing on the two answers above i created this method for paged searching using detached criteria.
Basically i just take an ordinary detached criteria and after i've created the real ICriteria from the session, i transform it to a rowcount critera and then use Future on both of them. Works great!
public PagedResult<T> SearchPaged<T>(PagedQuery query)
{
try
{
//the PagedQuery object is just a holder for a detached criteria and the paging variables
ICriteria crit = query.Query.GetExecutableCriteria(_session);
crit.SetMaxResults(query.PageSize);
crit.SetFirstResult(query.PageSize * (query.Page - 1));
var data = crit.Future<T>();
ICriteria countQuery = CriteriaTransformer.TransformToRowCount(crit);
var rowcount = countQuery.FutureValue<Int32>();
IList<T> list = new List<T>();
foreach (T t in data)
{
list.Add(t);
}
PagedResult<T> res = new PagedResult<T>();
res.Page = query.Page;
res.PageSize = query.PageSize;
res.TotalRowCount = rowcount.Value;
res.Result = list;
return res;
}
catch (Exception ex)
{
_log.Error("error", ex);
throw ex;
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Having troubles in ordering search results using Lucene - c#

Related

Find TOP (1) for each ID in array

Does Azure Search result guarantee order for * query?

How to retrieve records more than 4000 from Raven DB in SIngle Session [duplicate]

Caching/compiling complex Linq query (Entity Framework)

nhibernate paging with detached criteria

Categories

Resources