I am trying to implement Lucene search on 3 fields in my data. It should work as following:
when the field text is "My big white cat" when I search for "big cat" it would match.
Based on the tutorials I added the AddToLuceneIndex method:
private static void AddToLuceneIndex(MyObject myObject, IndexWriter writer)
{
var searchQuery = new TermQuery(new Term("Id", myObject.Id));
writer.DeleteDocuments(searchQuery);
var doc = new Document();
doc.Add(new Field("Field1", myObject.Field1, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("Field2", myObject.Field2, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("Field3", myObject.Field3, Field.Store.YES, Field.Index.ANALYZED));
(...)
writer.AddDocument(doc);
}
In my Search method i tried to use PhraseQuery:
public static IEnumerable<MyObject> Search(string phrase)
{
var luceneDir = FSDirectory.Open(new DirectoryInfo(LuceneDir));
var indexReader = IndexReader.Open(luceneDir, true);
var searcher = new IndexSearcher(indexReader);
var phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term("Field1", phrase));
const int maxHits = 1000;
var collector = TopScoreDocCollector.Create(maxHits, false);
searcher.Search(phraseQuery, collector);
var hits = collector.TopDocs().ScoreDocs;
return MapLuceneToDataList(hits, searcher).ToList();
}
There are always 0 hits (although there are matched objects)
When I use BooleanQuery like this:
public static IEnumerable<MyObject> Search(string phrase)
{
var luceneDir = FSDirectory.Open(new DirectoryInfo(LuceneDir));
var indexReader = IndexReader.Open(luceneDir, true);
var searcher = new IndexSearcher(indexReader);
var terms = phrase.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
var queryParser = new MultiFieldQueryParser
(Version.LUCENE_30,
new[] { "Field1", "Field2", "Field3"},
analyzer) { FuzzyMinSim = 0.8f };
var booleanQuery = new BooleanQuery();
foreach (var term in terms)
{
booleanQuery.Add(queryParser.Parse(term.Replace("~", "") + "~"), Occur.MUST);
}
const int maxHits = 1000;
var collector = TopScoreDocCollector.Create(maxHits, false);
searcher.Search(booleanQuery, collector);
var hits = collector.TopDocs().ScoreDocs;
return MapLuceneToDataList(hits, searcher).ToList();
}
it works well, but I don't need "big OR cat", I need something I've described earlier. What am I doing wrong using PhraseQuery?
There are two problems with what your PhraseQuery.
As stated by #groverboy, you must add separate terms separately to the PhraseQuery. While Query.toString() may show the same thing, they are not the same thing. The toString method does not show term breaks in a PhraseQuery. It tries to represent the query in as close to standard QueryParser syntax, which isn't capable of expressing any possible query constructed manually with the Query API. The PhraseQuery you've created won't be run through the analyzer, and so won't ever be tokenized. It will only be looking to the single token "big cat", rather than two adjacent tokens "big" and "cat".
The explain method provides much more complete information than toString, so you may find that to be a useful tool.
Also, you don't appear to want adjacent tokens, either, but rather you need to incorporate some slop into the query. You want "big cat" to match "big white cat", so you will need to set an adequate level of allowed slop.
So, something like this:
var phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term("Field1", "big"));
phraseQuery.Add(new Term("Field1", "cat"));
phraseQuery.setSlop(1);
You could also just run the query through the query parser, if you prefer. Simply, using the analyzer you've created in your third code block. You can set the default phrase slop for the query parser, to handle the slop issue discussed. Something like:
queryParser.setPhraseSlop(1)
queryParser.Parse("\"" + phrase + "\"")
// Or maybe just: queryParser.Parse(phrase);
You need to add a Term to the PhraseQuery for each word in the phrase, like this:
var phraseQuery = new PhraseQuery();
var words = phrase.Split(new Char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
phraseQuery.Add(new Term("Field1", word));
}
Related
I am new to Lucene, here I am facing serious issues with lecene search . When searching records using string/string with numbers it's working fine. But it does not bring any results when search the records using a string with special characters.
ex: example - Brings results
'examples' - no result
%example% - no result
example2 - Brings results
#example - no results
code:
Indexing;
_document.Add(new Field(dc.ColumnName, dr[dc.ColumnName].ToString(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
Search Query :
Lucene.Net.Store.Directory _dir = Lucene.Net.Store.FSDirectory.Open(Config.Get(directoryPath));
Lucene.Net.Analysis.Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
Query querySearch = queryParser.Parse("*" + searchParams.SearchForText + "*");
booleanQuery.Add(querySearch, Occur.MUST);
Can anyone help me to fix this.
It appears there's work to be done. I urge getting a good starter book on Lucene such as Lucene in Action, Second Edition (as you're using version 3). Although it targets Java examples are easily adapted to C#, and really the concepts are what matter most.
First, this:
"*" + searchParams.SearchForText + "*"
Don't do that. Leading wildcard searches are wildly inefficient and will suck an enormous amount of resources on a sizable index, doubly for leading and trailing wildcard search - what would happen if the query text was *e*?
There also seems to be more going on than shown in posted code as there is no reason not to be getting hits based on the inputs. The snippet below will produce the following in the console:
Index terms:
example
example2
raw text %example% as query text:example got 1 hits
raw text 'example' as query text:example got 1 hits
raw text example as query text:example got 1 hits
raw text #example as query text:example got 1 hits
raw text example2 as query text:example2 got 1 hits
Wildcard raw text example* as query text:example* got 2 hit(s)
See the Index Terms listing? NO 'special characters' land in the index because StandardAnalyzer removes them at index time - assuming StandardAnalyzer is used to index the field?
I recommend running the snippet below in the debugger and observe what is happening.
public static void Example()
{
var field_name = "text";
var field_value = "%example% 'example' example #example example";
var field_value2 = "example2";
var luceneVer = Lucene.Net.Util.Version.LUCENE_30;
using (var writer = new IndexWriter(new RAMDirectory(),
new StandardAnalyzer(luceneVer), IndexWriter.MaxFieldLength.UNLIMITED)
)
{
var doc = new Document();
var field = new Field(
field_name,
field_value,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
doc = new Document();
field = new Field(
field_name,
field_value2,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
writer.Commit();
Console.WriteLine();
// Show ALL terms in the index.
using (var reader = writer.GetReader())
{
TermEnum terms = reader.Terms();
Console.WriteLine("Index terms:");
while (terms.Next())
{
Console.WriteLine("\t{0}", terms.Term.Text);
}
}
// Search for each word in the original content #field_value
using (var searcher = new IndexSearcher(writer.GetReader()))
{
string query_text;
QueryParser parser;
Query query;
TopDocs topDocs;
List<string> field_queries = new List<string>(field_value.Split(' '));
field_queries.Add(field_value2);
var analyzer = new StandardAnalyzer(luceneVer);
while (field_queries.Count > 0)
{
query_text = field_queries[0];
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
field_queries.RemoveAt(0);
}
// Now do a wildcard query "example*"
query_text = "example*";
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("Wildcard raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
}
}
}
If you need to perform exact matching, and index certain characters like %, then you'll need to use something other than StandardAnalyzer, perhaps a custom analyzer.
Use Lucene.Net 3.0.3.
I have several classes with fields like this:
public class Test1
{
public string Name {get;set;}
}
How i create index:
var doc = new Document();
doc.Add(new Field(KEYWORDS_FIELD_NAME, someUid, CultureInfo.InvariantCulture), Field.Store.YES, Field.Index.ANALYZED));
How i create Query:
var analyzer=new RussianAnalyzer(Version.LUCENE_30);
private Query ParseQuery(string queryString,Analyzer analyzer)
{
var classQuery = new BooleanQuery();
var hs = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase);
foreach(var par in Parameters)
{
classQuery.add(new TermQuery(new Term(KEYWORDS_FIELD_NAME,par.ClassName.ToLower()),Occur.SHOULD);
hs.add(par.PropertyName);
}
var parser=new MultiFieldQueryParse(Version.LUCENE_30, hs.ToArray(), analyzer);
var multiQuery=parser.Parse(queryString.Ttim());
var result=new BooleanQuery
{
{classQuery,Occur.MUST},
new BooleanClause(multiQuery,Occur.MUST)
};
return result;
}
And search request:
var query=ParseQuery(queryString,analuzer);
using (var searcher = new IndexSearcher(luceneDirectory))
{
var hits = searcher.Search(query, null, 10000);
}
In the search index there is a "Name" property of class Test1.
Some of the values of the properties are:
40002
40001
4001
4009
and other similar values.
When I enter "4001", the search produces one result. That suits me.
However, when I enter "400", the search does not find any value.
I understand that this value is not in the index, but I expect that the search in this case will find "similar" values: 4001, 40002 and others.
Can this be done? What am I doing wrong?
Thank you.
P.S. it works with "400*" and MultiFieldQuery without RegexQuery. It slow per 30 percents.
When i use RegexQuery- 70-80 percents.
Try search for 400*, You would also need to WildCardQuery
Not a BooleanQuery
I need help figuring out which query types to use in given situations.
I think i'm right in saying that if i stored the word "FORD" in a lucene Field and i wanted to find an exact match i would use a TermQuery?
But which query type should i use if I was looking for the word "FORD" where the contents of the field where stored as :-
"FORD|HONDA|SUZUKI"
What if i was to search the contents of an entire page, looking for a phrase? such as "please help me"?
If you want to search FORD in FORD|HONDA|SUZUKI, either index with Field.Index.ANALYZED, or store it as below to use TermQuery
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var fs = FSDirectory.Open("test.index");
//Index a Test Document
IndexWriter wr = new IndexWriter(fs, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
var doc = new Document();
doc.Add(new Field("Model", "FORD", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("Model", "HONDA", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("Model", "SUZUKI", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("Text", #"What if i was to search the contents of an entire page, looking for a phrase? such as ""please help me""?",
Field.Store.YES, Field.Index.ANALYZED));
wr.AddDocument(doc);
wr.Commit();
var reader = wr.GetReader();
var searcher = new IndexSearcher(reader);
//Use TermQuery for "NOT_ANALYZED" fields
var result = searcher.Search(new TermQuery(new Term("Model", "FORD")), 100);
foreach (var item in result.ScoreDocs)
{
Console.WriteLine("1)" + reader.Document(item.Doc).GetField("Text").StringValue);
}
//Use QueryParser for "ANALYZED" fields
var qp = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Text", analyzer);
result = searcher.Search(qp.Parse(#"""HELP ME"""), 100);
foreach (var item in result.ScoreDocs)
{
Console.WriteLine("2)" + reader.Document(item.Doc).GetField("Text").StringValue);
}
TermQuery means you want to search the term as it is stored in index which depends on how you indexed that field(NOT_ANALYZED, ANALYZED+WhichAnalyzer). Most common use of it is with NOT_ANALYZED fields.
You can use TermQuery with ANALYZED fields too, but then you should know how the analyzer tokenizes your input string. Below is a sample to see what how analyzers tokenize your input
var text = #"What if i was to search the contents of an entire page, looking for a phrase? such as ""please help me""?";
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30 );
//var analyzer = new WhitespaceAnalyzer();
//var analyzer = new KeywordAnalyzer();
//var analyzer = new SimpleAnalyzer();
var ts = analyzer.TokenStream("", new StringReader(text));
var termAttr = ts.GetAttribute<ITermAttribute>();
while (ts.IncrementToken())
{
Console.Write("[" + termAttr.Term + "] " );
}
I would turn the problem sideways, so I put the multiple values for each field separately in the index -- this should make searching simpler. Looking at Field Having Multiple Values might be helpful.
Enviroment :
Lucene.Net 3.03
Visual Studio 2010
I've been stuck on this problem for hours at this point and I can't figure out the problem.
i build some index named "Stores" , the format like below ,
a075,a073,a021....
each string represent the id of shop , and it Separated by "," ,
i would like search "a073" , and it will return matched data if the "Stores" include "a073"
thanks in advance
static RAMDirectory dir = new RAMDirectory();
public void BuildIndex()
{
IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.Add(new Field("PROD_ID", "", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
doc.Add(new Field("Stores", "", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
for (int i = 1; i <= 10; i++)
{
doc.GetField("PROD_ID").SetValue(Guid.NewGuid().ToString());
doc.GetField("Stores").SetValue("a075,a073,no2" + i.ToString());
iw.AddDocument(doc);
}
iw.Optimize();
iw.Commit();
iw.Close();
}
private void Search(string KeyWord)
{
IndexSearcher search = new IndexSearcher(dir, true);
QueryParser qp = new QueryParser(Version.LUCENE_30, "Stores", new StandardAnalyzer(Version.LUCENE_30));
Query query = qp.Parse(KeyWord);
var hits = search.Search(query, null, search.MaxDoc).ScoreDocs;
foreach (var res in hits)
{
Response.Write(string.Format("PROD_ID:{0} / Stores{1}"
, search.Doc(res.Doc).Get("PROD_ID").ToString()
, search.Doc(res.Doc).Get("Stores").ToString() + "<BR>"));
}
}
Try to use Lucene.Net.Search.WildcardQuery and include wildcards.
Google for Lucene regular expression search to find some code to use regular expressions in your query... there is a contrib implementation called Contrib.Regex.RegexTermEnum.
An alternative would be a multivalued field, instead of a string separated by comma you would pass an array into it. This will be split and indexed by Lucene and you can query it in the same manner as a normal field. In addition you can query it multiple times, e.g. multiField:ValueA and mutliField:ValueB ...
I'm currently trying to implement a Lucene.NET based search on a large database and I've hit a snag trying to do a search on what is essentially relational data.
At a high level the data I'm trying to search is grouped, each item belongs to 1 to 3 groups. I then need to be able to do a search for all items that are in a combination of groups (EG: Each item belongs to both group A and group B).
Each of these groupings have ID's and Descriptions existing from the data I'm searching, but the descriptions may be sub-strings of one another (EG: One group named "Stuff" and the other "Other stuff"), and I don't want to match the categories that have a sub-string of the one I'm looking for.
I've been considering pulling the data back without this filtering and then filtering the ID's, but I was intending to paginate the data returned from Lucene for performance reasons. I've also considered putting the ID's in space-separated and doing a text-search on the field, but that seems like a total hack...
Does anyone have any idea how to best handle this kind of search in Lucene.NET? (Just to clarify before someone says I'm using the wrong tool, this is only a subset of a larger set of filters which includes full-text searching. If you still think I'm using the wrong tool though I'd love to hear what the right one is)
I've had my share of problems with storing relational data i Lucene but the one you have should be easy to fix.
I guess you tokenize the group fields and that makes it possible to search for substrings in the field value. Just add the field untokenized and it should work like expected.
Please check the following small piece of code:
internal class Program {
private static void Main(string[] args) {
var directory = new RAMDirectory();
var writer = new IndexWriter(directory, new StandardAnalyzer());
AddDocument(writer, "group", "stuff", Field.Index.UN_TOKENIZED);
AddDocument(writer, "group", "other stuff", Field.Index.UN_TOKENIZED);
writer.Close(true);
var searcher = new IndexSearcher(directory);
Hits hits = searcher.Search(new TermQuery(new Term("group", "stuff")));
for (int i = 0; i < hits.Length(); i++) {
Console.WriteLine(hits.Doc(i).GetField("group").StringValue());
}
}
private static void AddDocument(IndexWriter writer, string name, string value, Field.Index index) {
var document = new Document();
document.Add(new Field(name, value, Field.Store.YES, index));
writer.AddDocument(document);
}
}
The sample adds two documents to the index which are untokenized, does a search for stuff and gets one hit. If you changed the code to add them tokenized then you will have two hits as you see now.
The issue with using Lucene for relational data is that it might be expected that wildcard and range searches always will work. That is not really the case if the index is big due to way Lucene resolves those queries.
Another sample to illustrate the behavior:
private static void Main(string[] args) {
var directory = new RAMDirectory();
var writer = new IndexWriter(directory, new StandardAnalyzer());
var documentA = new Document();
documentA.Add(new Field("name", "A", Field.Store.YES, Field.Index.UN_TOKENIZED));
documentA.Add(new Field("group", "stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
documentA.Add(new Field("group", "other stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
writer.AddDocument(documentA);
var documentB = new Document();
documentB.Add(new Field("name", "B", Field.Store.YES, Field.Index.UN_TOKENIZED));
documentB.Add(new Field("group", "stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
writer.AddDocument(documentB);
var documentC = new Document();
documentC.Add(new Field("name", "C", Field.Store.YES, Field.Index.UN_TOKENIZED));
documentC.Add(new Field("group", "other stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
writer.AddDocument(documentC);
writer.Close(true);
var query1 = new TermQuery(new Term("group", "stuff"));
SearchAndDisplay("First sample", directory, query1);
var query2 = new TermQuery(new Term("group", "other stuff"));
SearchAndDisplay("Second sample", directory, query2);
var query3 = new BooleanQuery();
query3.Add(new TermQuery(new Term("group", "stuff")), BooleanClause.Occur.MUST);
query3.Add(new TermQuery(new Term("group", "other stuff")), BooleanClause.Occur.MUST);
SearchAndDisplay("Third sample", directory, query3);
}
private static void SearchAndDisplay(string title, Directory directory, Query query3) {
var searcher = new IndexSearcher(directory);
Hits hits = searcher.Search(query3);
Console.WriteLine(title);
for (int i = 0; i < hits.Length(); i++) {
Console.WriteLine(hits.Doc(i).GetField("name").StringValue());
}
}