Lucene .net Boost not working when using * wildcard - c#

I have two documents and using Luke to investigate, I have confirmed in code that it has the same behavior, using StandardAnalyzer.
Document one with boost 1
stored/uncompressed,indexed,tokenized<Description:Nummer ett>
stored/uncompressed,indexed,tokenized<Id:2>
stored/uncompressed,indexed,tokenized<Name:Apa>
Document two with boost 2
stored/uncompressed,indexed,tokenized<Description:Nummer tvÄ>
stored/uncompressed,indexed,tokenized<Id:1>
stored/uncompressed,indexed,tokenized<Name:Apa>
Search apa in field Name
Returns with boost used and in the correct order.
Document 2 has Score 1,1891
Document 1 has Score 0.5945
Search ap*
Returns in no order and same score
Document 1 Score 1.0000
Document 2 Score 1.0000
Search apa*
Returns in no order and same score
Document 1 Score 1.0000
Document 2 Score 1.0000
Why is this? I would like to return some documents with higher boost value even if I have to use wildcards. Is this possible?
Cheers all cool coders out there!
This is what I want to accomplice.
A search string and want matches. Using wildcard.
Search "Lu" +"*"
Document
Name
City
I would like the Document whose Name is Lund to get higher rating than the document with the Name Lunt or City is Lund for example. This is due to I will know which documents that are most popular. I want to get the documents with city Stockholm and names Stockholm and Stockholmen but ordered as I choose.

Since WildcardQuery is a subclass of MultiTermQuery you are getting constant score of 1.
If you check the definition of t.getBoost():
t.getBoost() is a search time boost of term t in the query q as
specified in the query text (see query syntax), or as set by
application calls to setBoost(). Notice that there is really no direct
API for accessing a boost of one term in a multi term query, but
rather multi terms are represented in a query as multi TermQuery
objects, and so the boost of a term in the query is accessible by
calling the sub-query getBoost()
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
One possible hack could be to set rewrite method of query parser:
myCustomQueryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE)

Related

I want to show numbers in "From - To" Format

I have multiple random numbers in table column ID like
8v12027
8v12025
8v12024
8v12029.
8v12023
8v12030
8v12020
O/p - 8v12020, From 8v12023 To 8v12025, 8v12027, From 8v12029 To 8v12030,
I assume you'are waiting for an sql solution so :
You have to use Lead or Lag KeyWord and concat it.
SELECT CONCAT('From ',Id,'To :', LEAD(p.Id) OVER (ORDER BY p.Id),'s') FROM YourTable p
There is a really good explanation about thoses keyword in the sqlauthority web site.
https://blog.sqlauthority.com/2013/09/22/sql-server-how-to-access-the-previous-row-and-next-row-value-in-select-statement/
But If you were waiting for a pure C# solution, you can retreive the data set in an Array, after order it by Id and concat and with a for loop concat current value with previous (or next) one.
Or with a Linq use Aggregate
yourArray.Aggregate((a,b)=> String.Concat("From ",a," To ",b,";")).Split(';')

Merging queries of same table N times

I have a table of word, a lookup table of where those words are found in documents, and the number of times that word appears in that document. So there might be a record that says Alpha exists 5 times in document X, while Beta exists 3 times in document X, and another for Beta existing twice in document Y.
The user can enter a number of words to search, so "quick brown fox" is three queries, but "quick brown fox jumped" is four queries. While I can get a scored result set of each word in turn, what I actually want is to add the number of occurrences together for each word, such that the top result is the highest occurrence count for all words.
A document might have hundreds of "quick" and "brown" occurrences but no "fox" occurrences. The results should still be included as it could score higher than a document with only one each of "quick", "brown", and "fox".
The problem I can't work out is how to amalgamate the 1 to N queries with the occurences summed. I think I need to use GROUP BY and SUM() but not certain. Linq preferred but SQL would be ok. MS SQL 2016.
I want to pass the results on to a page indexer so a for-each over the results wouldn't work, plus we're talking 80,000 word records, 3 million document-word records, and 100,000 document records.
// TextIndexDocument:
// Id | WordId | Occurences | DocumentId | (more)
//
// TextIndexWord:
// Id | Word
foreach (string word in words)
{
string lword = word.ToLowerInvariant();
var results = from docTable in db.TextIndexDocuments
join wordTable in db.TextIndexWords on docTable.WordId equals wordTable.Id
where wordTable.Word == lword
orderby docTable.Occurences descending
select docTable;
// (incomplete)
}
More information
I understand that full text searching is recommended. The problem then is how to rank the results from a half dozen unrelated tables (searching in forum posts, articles, products...) into one unified result set - let's say record Id, record type (article/product/forum), and score. The top result might be a forum post while the next best hits are a couple of articles, then a product, then another forum post and so on. The TextIndexDocument table already has this information across all the relevant tables.
Let's assume that you can create a navigation property TextIndexDocuments in Document:
public virtual ICollection<TextIndexDocuments> TextIndexDocuments{ get; set; }
and a navigation property in TextIndexDocument:
public virtual TextIndexWord TextIndexWord { get; set; }
(highly recommended)
Then you can use the properties to get the desired results:
var results =
(
from doc in db.Documents
select new
{
doc,
TotalOccurrences =
doc.TextIndexDocuments
.Where(tid => lwords.Contains(tid.TextIndexWord.Word))
.Sum(doc => doc.Occurrences)
}
).OrderByDescending(x => x.TotalOccurrences)
As far as I know this can not, or at least easily, be accomplished in LINQ, especially in any kind of performant way.
What you really should consider, assuming your DBA will allow it, is Full-Text indexing of your documents stored in SQL Server. From my understanding the RANK operator is exactly what you are looking for which has been highly optimized for Full-Text.
In response to your comment: (sorry for not noticing that)
You'll need to either do a series of subqueries or Common-Table-Expressions. CTE's are a bit hard to get used to writing at first but once you get used to them they are far more elegant than the corresponding query written with sub queries. Either way the query execution plan will be exactly the same, so there is no performance gain from going the CTE route.
You want to add up occurences for the words per document. So group by document ID, use SUM and order by total descending:
select documentid, sum(occurences)
from doctable
where wordid in (select id from wordtable where word in 'quick', 'brown', 'fox')
group by documentid
order by sum(occurences) desc;

lucene.net filtering on multiple fields

Following is my schema
Product_Name (Analyzed),Category (Analyzed)
Scenario:
I want to search those products whose category is exactly "Cellphones & Accessories" and Product_Name is "sam*"
Equivalent SQL Query is
select * from products
where Product_Name like '%sam%' and Category='Cellphones & Accessories'
I am using lucene.net.
I need equivalent lucene.net statement.
As this is a few months old I'll be brief (I can expand if you're still interested)...
If you want to have an exact match to Category then do not analyze. Analyzers will chop the string up into bits which are then searchable. Matching case can be problematic so maybe just the lowercase analyzer would work for that field.
It might be useful to have several fields analyzed in different ways so that different queries can be used.
NOTE: "sam*" is not equivalent to "%sam%"
Do you want "sam" to be a prefix ie "sample" or a word "the sam product"?
If it's a word then a no stopword analyzer should be fine.
A nice trick is to create many fields (with the same name) with variations of the name. Probably with just a lower case analyzer
name: "some sample product"
name: "sample product"
name" "product"
Then have a look at "prefix queries". a query of (name:sam) would then match.
Also have a look at the PerFieldAnalyzerWrapper in order to use a different analyzer for each field.

MongoDB use index in regular expression query

I am using the official C# MongoDB driver.
If I have an index on three elements {"firstname":1,"surname":1,"companyname":1} can I search the collection by using a regular expression that directly matches against the index value?
So, if someone enters "sun bat" as a search term, I would create a regex as follows
(?=.\bsun)(?=.\bbat).* and this should match any index entries where firstname or surname or companyname starts with 'sun' AND where firstname or surname or companyname starts with 'bat'.
If I can't do it this way, how can I do it? The user just types their search terms, so I won't know which element (firstname, surname, companyname) each search term (sun or bat) refers to.
Update: for MongoDB 2.4 and above you should not use this method but use MongoDB's text index instead.
Below is the original and still relevant answer for MongoDB < 2.4.
Great question. Keep this in mind:
MongoDB can only use one index per query.
Queries that use regular expressions only use an index when the regex is rooted and case sensitive.
The best way to do a search across multiple fields is to create an array of search terms (lower case) for each document and index that field. This takes advantage of the multi-keys feature of MongoDB.
So the document might look like:
{
"firstname": "Tyler",
"surname": "Brock",
"companyname": "Awesome, Inc.",
"search_terms": [ "tyler", "brock", "awesome inc"]
}
You would create an index: db.users.ensureIndex({ "search_terms": 1 })
Then when someone searches for "Tyler", you smash the case and search the collection using a case sensitive regex that matches the beginning of the string:
db.users.find({ "search_terms": /^tyler/ })
What mongodb does when executing this query is to try and match your term to every element of the array (the index is setup that way too -- so it's speedy). Hopefully that will get you where you need to be, good luck.
Note: These examples are in the shell. I have never written a single line of C# but the concepts will translate even though the syntax may differ.

SQL user defined aggregate order of values preserved?

Im using the code from this MSDN page to create a user defined aggregate to concatenate strings with group by's in SQL server. One of my requirements is that the order of the concatenated values are the same as in the query. For example:
Value Group
1 1
2 1
3 2
4 2
Using query
SELECT
dbo.Concat(tbl.Value) As Concat,
tbl.Group
FROM
(SELECT TOP 1000
tblTest.*
FROM
tblTest
ORDER BY
tblTest.Value) As tbl
GROUP BY
tbl.Group
Would result in:
Concat Group
"1,2" 1
"3,4" 2
The result seems to always come out correct and as expected, but than I came across this page that states that the order is not guaranteed and that attribute SqlUserDefinedAggregateAttribute.IsInvariantToOrder is only reserved for future use.
So my question is: Is it correct to assume that the concatenated values in the string can end up in any order? If that is the case then why does the example code on the MSDN page use the IsInvariantToOrder attribute?
I suspect a big problem here is your statement "the same as in the query" - however, your query never defines (and cannot define) an order by the things being aggregated (you can of course order the groups, by having a ORDER BY after the GROUP BY). Beyond that, I can only say that it is based purely on a set (rather than an ordered sequence), and that technically the order is indeed undefined.
While the accepted answer is correct, I wanted to share a workaround that others may find useful. Warning: it involves not using a user-defined aggregate at all :)
The link below describes an elegant way to build a concatenated, delimited list using only a SELECT statement and a varchar variable. The upside (for this thread) is that you can specify the order in which the rows are processed. The downside is that you can't easily concatenate across many different subsets of rows without painful iteration.
Not perfect, but for my use case was a good workaround.
http://blog.sqlauthority.com/2008/06/04/sql-server-create-a-comma-delimited-list-using-select-clause-from-table-column/

Categories