Minimum scores in Lucene.net/Lucene? - c#

Is it possible to set a minimum score for which to return results in Lucene?
I have this function:
public Tuple<int,ICollection<Guid>> Search(string searchQuery,int maxResults)
{
var booleanQuery = new BooleanQuery();
var s1 = new TermQuery(new Term("companyName", searchQuery));
booleanQuery.Add(s1, Occur.SHOULD);
using (var searcher = new IndexSearcher(this.Directory))
{
TopDocs hits = searcher.Search(booleanQuery, maxResults);
var ids = new List<Guid>();
for (int i = 0; i < hits.ScoreDocs.Count(); i++)
{
var idString = searcher.Doc(hits.ScoreDocs[i].Doc).Get("id");
ids.Add(new Guid(idString));
}
return new Tuple<int, ICollection<Guid>>(hits.TotalHits, ids);
}
}
The function searches my index and returns the IDs of the companies that match the searchQuery, along with the total number of companies that matched the search - so I can write 'Showing 1-20 of 245 matching companies'.
My problem is that the threshold for a match is very low. If the user enters "accountant" the search returns meaningful results, but if they enter "adasdfsdf" it returns results that are are not relevant. I would rather display a message like "Sorry, no companies match your query" if the results are not relevant enough.
Is it possible to set a minimum score for the matches? Will the TopDocs.TotalHits property respect this score?

In short, no. You can't really create a minimum score cutoff point in Lucene. Here is one discussion of why not. Note the cases discussed there are a bit different that what your asking for, but the difficulties are much the same (and, in fact, providing a reasonable cut-off point to be used on different, independant queries introduces greater, though closely related, difficulties).
The better way to address this is to design your queries such that you don't get irrelevant results. In your example, I don't really see why you would see a lot of irrelevant results coming up, so I'll assume there are other terms being added to the query. In that case, if you only want to get those documents for which new Term("companyName", searchQuery) is a match, you should add it with the Occur.MUST booleanClause, like:
var booleanQuery = new BooleanQuery();
var s1 = new TermQuery(new Term("companyName", searchQuery));
booleanQuery.Add(s1, Occur.MUST);
To explain further, the Occur.MUST and Occur.SHOULD are your problem there. If you have a query like:
category:type1 companyName:asdfdas
And have no results on companyName, then you would just see the results for the query category:type1. If you did have a match on companyName, those results would be judged to have much higher relevance, and would be displayed first, but it would still bring up everything that matched the category as well, just lower on the list. Both terms, in that example, are added with the BooleanClause.Occur.SHOULD, and so both are optional (although at least one matching term must still be found in any result).
If you wish to only display those terms that match both the category and the companyName, you should make both of them required terms in your query, by using the BooleanClause.Occur.MUST. Using the query syntax, this would look like:
+category:type1 +companyName:asdfdas
Or building a the BooleanQuery:
var s1 = new TermQuery(new Term("companyName", "asdfdas"));
booleanQuery.Add(s1, Occur.MUST);
var s1 = new TermQuery(new Term("category", "type1"));
booleanQuery.Add(s1, Occur.MUST);

Related

Grouping on Lucene.net

`
public static async Task<List<string>> SearchGroup(string filedName, Query bq, Filter fil, IndexSearcher searcher)
{
//分组的最大数量
int groupNum = 100;
return await Task.Run(() =>
{
GroupingSearch groupingSearch = new GroupingSearch(filedName);
groupingSearch.SetCachingInMB(8192, cacheScores: true);
//不做排序,浪费性能
//groupingSearch.SetGroupSort(new Sort(new SortField("name", SortFieldType.STRING)));
groupingSearch.SetGroupDocsLimit(groupNum);
ITopGroups<BytesRef> topGroups = groupingSearch.SearchByField(searcher, fil, bq, groupOffset: 0, groupNum);
List<string> groups = new List<string>();
foreach (var groupDocs in topGroups.Groups.Take(groupNum))
{
if (groupDocs.GroupValue != null)
{
groups.Add(groupDocs.GroupValue.Utf8ToString());
}
}
return groups;
});
}
`
Here is my current code for grouping, but there are performance issues. The time for each call is equal to the time for one query. If I group multiple fields at the same time, it is very time-consuming. Is there any way to improve the speed?
There will be multiple screening items, but it is too time-consuming
Hope to have fast grouping results, or grouping at the same time
You have to add cache - group values usual not changing very frequently.
So it is better to pre-cache data before execute search request.
And then use such information every next query of same information.
To develop this look like you have to customzie lucene engine (searcher).
in my case i am using faceted filters instead of groups. Difference that groups shows sub content as tree view (above group), filters - all data shows as one set but can be easelly and quickly filters as on screenshot
Filters is not answer for all questions/problems. There is no information about task which you are going to sove with grouping - so i can't answer more exactly.
So cache as much as possible - is answer for 99% of such issues. Probably change strtategy to use facet filters - another way to improve issue/situation.

Reactive - how to combine / join / look up items with two sequences

I am connecting to a web service that gives me all prices for a day (without time info). Each of those price results has the id for a corresponding "batch run".
The "batch run" has a date+time stamp, but I have to make a separate call to get all the batch info for the day.
Hence, to get the actual time of each result, I need to combine the two API calls.
I'm using Reactive for this, but I can't reliably combine the two sets of data. I thought that CombineLatest would do it, but it doesn't seem to work as I thought (based on http://reactivex.io/documentation/operators/combinelatest.html, http://introtorx.com/Content/v1.0.10621.0/12_CombiningSequences.html#CombineLatest).
[TestMethod]
public async Task EvenMoreBasicCombineLatestTest()
{
int batchStart = 100, batchCount = 10;
//create 10 results with batch ids [100, 109]
//the test uses lists just to make debugging easier
var resultsWithBatchIdList = Enumerable.Range(batchStart, batchCount)
.Select(id => new { BatchRunId = id, ResultValue = id * 10 })
.ToList();
var resultsWithBatchId = Observable.ToObservable(resultsWithBatchIdList);
Assert.AreEqual(batchCount, await resultsWithBatchId.Count());
//create 10 batches with ids [100, 109]
var batchesList = Enumerable.Range(batchStart, batchCount)
.Select(id => new
{
ThisId = id,
BatchName = String.Concat("abcd", id)
})
.ToList();
var batchesObservable = Observable.ToObservable(batchesList);
Assert.AreEqual(batchCount, await batchesObservable.Count());
//turn the batch set into a dictionary so we can look up each batch by its id
var batchRunsByIdObservable = batchesObservable.ToDictionary(batch => batch.ThisId);
//for each result, look up the corresponding batch id in the dictionary to join them together
var resultsWithCorrespondingBatch =
batchRunsByIdObservable
.CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
{
Assert.AreEqual(NumberOfResultsToCreate, batchRunsById.Count);
var correspondingBatch = batchRunsById[result.BatchRunId];
var priceResultAndSourceBatch = new
{
Result = result,
SourceBatchRun = correspondingBatch
};
return priceResultAndSourceBatch;
});
Assert.AreEqual(batchCount, await resultsWithCorrespondingBatch.Count());
}
I would expect as each element of the 'results' observable comes through, it would get combined with each element of the batch-id dictionary observable (which only ever has one element). But instead, it looks like only the last element of the result list gets joined.
I have a more complex problem deriving from this but while trying to create a minimum repro, even this is giving me unexpected results. This happens with version 3.1.1, 4.0.0, 4.2.0, etc.
(Note that the sequences don't generally match up as in this artificial example, so I can't just Zip them.)
So how can I do this join? A stream of results that I want to look up more info via a Dictionary (which also is coming from an Observable)?
Also note that the goal is to return the IObservable (resultsWithCorrespondingBatch), so I can't just await the batchRunsByIdObservable.
Ok I think I figured it out. I wish either of the two marble diagrams in the documentation had been just slightly different -- it would have made a subtlety of CombineLatest much more obvious:
N------1---2---3---
L--z--a------bc----
R------1---2-223---
a a bcc
It's combine latest -- so depending on when items get emitted, it's possible to miss some tuples. What I should have done is SelectMany:
NO: .CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
YES: .SelectMany(batchRunsById => resultsWithBatchId.Select(result =>
Note that the "join" order is important: A.SelectMany(B) vs B.SelectMany(A) -- if A has 1 item and B has 100 items, the latter would result in 100 calls to subscribe to A.

Linq with Regex

I have the matches of a regex pattern and I'm having some difficulties designing the Linq around it to produce the desired output.
The data is fixed lengths: 1231234512341234567
Lengths in this case are: 3, 5, 4, 7
The regex pattern used is: (.{3})(.{5})(.{4})(.{7})
This all works perfectly fine and the matched results of the pattern are as expected, however, the desired output is proving to be somewhat difficult. In fact, I'm not even certain what it would be called in SQL terms - except maybe a pivot query. The desired output is to take all the values from each of the groups at a given position and concatenate them so for example:
field1:value1;value2;value3;valueN;field2:value2;value3;valueN;
Using the below Linq expression, I was able to get field1-value1, field2-value2, etc...
var matches = Regex.Matches(data, re).Cast<Match>();
var xmlResults = from m in matches
from e in elements
select string.Format("<{0}>{1}</{0}>", e.Name, m.Groups[e.Ordinal].Value);
but I can't seem to figure out how to get all the values at position 1 from "Groups" using the element's Ordinal, then all the values at position 2 and so on.
The "elements" in this example is a collection of field names and ordinal positions (starting at 1). So, it would look like this:
public class Element
{
public string Name { get; set; }
public int Ordinal { get; set; }
}
var elements = new List<Element>{
new Element { Name="Field1", Ordinal=1 },
new Element { Name="Field2", Ordinal=2 }
};
I've reviewed a bunch of various Linq expressions and dug into some pivot type Linq expressions, but none of them get me close - they all use the join operator which I don't think is possible.
Does anyone have any idea how to make this Linq?
You should be able to do this by changing the query to select from elements only, and bring in the matches through string.Join, like this:
// Use ToList to avoid iterating matches multiple times
var matches = Regex.Matches(data, re).Cast<Match>().ToList();
// For each element, join all matches, and pull in the value for e.Ordinal
var xmlResults = elements.Select(e =>
string.Format(
"<{0}>{1}</{0}>"
, e.Name
, string.Join(";", matches.Select(m => m.Groups[e.Ordinal].Value))
);
Note: this is not the best way of formatting XML. You would be better off using one of .NET's libraries for making XML, such as LINQ2XML.

ActiveDirectory with Range not changing results using DirectorySearcher

So I'm basically trying to enumerate results from AD, and for some reason I'm unable to pull down new results, meaning it keeps continuously pulling the first 1500 results even though I tell it I want an additional range.
Can someone point out where I'm making the mistake? The code never breaks out of the loop but more importantly it pulls users 1-1500 even when I say I want users 1500-3000.
uint rangeStep = 1500;
uint rangeLow = 0;
uint rangeHigh = rangeLow + (rangeStep - 1);
bool lastQuery = false;
bool quitLoop = false;
do
{
string attributeWithRange;
if (!lastQuery)
{
attributeWithRange = String.Format("member;Range={0}-{1}", rangeLow, rangeHigh);
}
else
{
attributeWithRange = String.Format("member;Range={0}-*", rangeLow);
}
DirectoryEntry dEntryhighlevel = new DirectoryEntry("LDAP://OU=C,OU=x,DC=h,DC=nt");
DirectorySearcher dSeacher = new DirectorySearcher(dEntryhighlevel,"(&(objectClass=user)(memberof=CN=Users,OU=t,OU=s,OU=x,DC=h,DC=nt))",new string[] {attributeWithRange});
dSeacher.PropertiesToLoad.Add("givenname");
dSeacher.PropertiesToLoad.Add("sn");
dSeacher.PropertiesToLoad.Add("samAccountName");
dSeacher.PropertiesToLoad.Add("mail");
dSeacher.PageSize = 1500;
SearchResultCollection resultCollection = resultCollection = dSeacher.FindAll();
dSeacher.Dispose();
foreach (SearchResult userResults in resultCollection)
{
string Last_Name = userResults.Properties["sn"][0].ToString();
string First_Name = userResults.Properties["givenname"][0].ToString();
string userName = userResults.Properties["samAccountName"][0].ToString();
string Email_Address = userResults.Properties["mail"][0].ToString();
OriginalList.Add(Last_Name + "|" + First_Name + "|" + userName + "|" + Email_Address);
}
if(resultCollection.Count == 1500)
{
lastQuery = true;
rangeLow = rangeHigh + 1;
rangeHigh = rangeLow + (rangeStep - 1);
}
else
{
quitLoop = true;
}
}
while (!quitLoop);
You're mixing up two concepts which is what is causing you trouble. This is a FAQ on the SO forums so I probably should blog on this to try and clear things up.
Let me first just explain the concepts, then correct the code once the concepts are out there.
Concept one is fetching large collections of objects. When you fetch a lot of objects, you need to ask for them in batches. This is typically called "paging" through the results. When you do this you'll get back a paging cookie and can pass back the paged control in subsequent searches to keep getting a "page worth" of results with each pass.
The second concept is fetching large numbers of values from a single attribute. The simple example of this is reading the member attribute from a group (ex: doing a base search for that group). This is called "ranged retrieval." In this search mode you are doing a base search against that object for the large attribute (like member) and asking for "ranges" of values with each passing search.
The code above confuses these concepts. You are doing member range logic like you are doing range retrieval but you are in fact doing a search that is constructed to return a large # of objects like a paged search. This is why you are getting the same results over and over.
To fix this you need to first pick an approach. :) I recommend range retrieval against the group object and asking for the large member attribute in ranges. This will get you all of the members in the group.
If you go down this path, you'll notice you can't ask for attributes for these values. The only vlaue you get is the list of members, and you can then do searches for them. IF you opt to stay with paged searches like you have above, then you end up switching to paged searches.
If you opt to stick with paged searches, then you'll need to:
Get rid of the Range logic, and all mentions of 1500
Set a page size of something like 1000
Instead of ranging, look up how to do paged searches (using the page search control) using your API
If you pick ranging, you'll switch from a memberOf search like this to a search of the form:
a) scope: base
b) filter: (objectclass=)
c) base DN: OU=C,OU=x,DC=h,DC=nt
d) Attributes: member;Range=0-
...then you will increment the 0 up as you fetch ranges of values (ie do this search over and over again for each subsequent range of values, changing only the 0 to subsequent integers)
Other nits you'll notice in my logic:
- I don't set page size...you're not doing a paged search, so it doesn't matter.
- I dont' ever hard code the value 1500 here. It doesn't matter. Ther eis no value in knowing or even computing this. The point is that you asked for 0-* (ie all), you got back 1500, so then you say 1500-, then 3000-, and so on. You don't need to knwo the range size, only what you have been given so far.
I hope this fully answers it...
Here is a code snip of doing a paged search, per my comment below (this is what you would need to do using the System.DirectoryServices.Protocols namespace classes, going down the logical path you started above (paged searches, not ranged retrieval)):
string searchFilter = "(&(objectClass=user)(memberof=CN=Users,OU=t,OU=s,OU=x,DC=h,DC=nt))";
string baseDN = "OU=C,OU=x,DC=h,DC=nt";
var scope = SearchScope.Subtree;
var attributeList = new string[] { "givenname", "sn", "samAccountName", "mail" };
PageResultRequestControl pageSearchControl = new PageResultRequestControl(1000);
do
{
SearchRequest sr = new SearchRequest(baseDN, searchFilter, scope, attributeList);
sr.Controls.Add(pageSearchControl);
var directoryResponse = ldapConnection.SendRequest(sr);
if (directoryResponse.ResultCode != ResultCode.Success)
{
// Handle error
}
var searchResponse = (SearchResponse)directoryResponse;
pageSearchControl = null; // Reset!
foreach (var control in searchResponse.Controls)
{
if (control is PageResultResponseControl)
{
var prrc = (PageResultResponseControl)control;
if (prrc.Cookie.Length > 0)
{
pageSearchControl = new PageResultRequestControl(prrc.Cookie);
}
}
}
foreach (var entry in searchResponse.Entries)
{
// Handle the search result entry
}
} while (pageSearchControl != null);
Your problem is caused by creating new object of directory searcher in loop. Each time there will be new object that will take first 1500 records. Create instance of searher out of the loop and use same instance for all queries.

How can I check if a string in sql server contains at least one of the strings in a local list using linq-to-sql?

In my database field I have a Positions field, which contains a space separated list of position codes. I need to add criteria to my query that checks if any of the locally specified position codes match at least one of the position codes in the field.
For example, I have a local list that contains "RB" and "LB". I want a record that has a Positions value of OL LB to be found, as well as records with a position value of RB OT but not records with a position value of OT OL.
With AND clauses I can do this easily via
foreach (var str in localPositionList)
query = query.Where(x => x.Position.Contains(str);
However, I need this to be chained together as or clauses. If I wasn't dealing with Linq-to-sql (all normal collections) I could do this with
query = query.Where(x => x.Positions.Split(' ').Any(y => localPositionList.contains(y)));
However, this does not work with Linq-to-sql as an exception occurs due it not being able to translate split into SQL.
Is there any way to accomplish this?
I am trying to resist splitting this data out of this table and into other tables, as the sole purpose of this table is to give an optimized "cache" of data that requires the minimum amount of tables in order to get search results (eventually we will be moving this part to Solr, but that's not feasible at the moment due to the schedule).
I was able to get a test version working by using separate queries and running a Union on the result. My code is rough, since I was just hacking, but here it is...
List<string> db = new List<string>() {
"RB OL",
"OT LB",
"OT OL"
};
List<string> tests = new List<string> {
"RB", "LB", "OT"
};
IEnumerable<string> result = db.Where(d => d.Contains("RB"));
for (int i = 1; i < tests.Count(); i++) {
string val = tests[i];
result = result.Union(db.Where(d => d.Contains(val)));
}
result.ToList().ForEach(r => Console.WriteLine(r));
Console.ReadLine();

Categories