Getting terms matched in a document when searching using a wildcard search - c#

I am looking for a way to find the terms that matched in the document using waldcard search in Lucene. I used the explainer to try and find the terms but this failed. A portion of the relevant code is below.
ScoreDoc[] myHits = myTopDocs.scoreDocs;
int hitsCount = myHits.Length;
for (int myCounter = 0; myCounter < hitsCount; myCounter++)
{
Document doc = searcher.Doc(myHits[myCounter].doc);
Explanation explanation = searcher.Explain(myQuery, myCounter);
string myExplanation = explanation.ToString();
...
When I do a search on say micro*, documents are found and it enter the loop but myExplanation contains NON-MATCH and no other information.
How do I get the term that was found in this document ?
Any help would be most appreciated.
Regards

class TVM : TermVectorMapper
{
public List<string> FoundTerms = new List<string>();
HashSet<string> _termTexts = new HashSet<string>();
public TVM(Query q, IndexReader r) : base()
{
List<Term> allTerms = new List<Term>();
q.Rewrite(r).ExtractTerms(allTerms);
foreach (Term t in allTerms) _termTexts.Add(t.Text());
}
public override void SetExpectations(string field, int numTerms, bool storeOffsets, bool storePositions)
{
}
public override void Map(string term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions)
{
if (_termTexts.Contains(term)) FoundTerms.Add(term);
}
}
void TermVectorMapperTest()
{
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
Document d = null;
d = new Document();
d.Add(new Field("text", "microscope aaa", Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.AddDocument(d);
d = new Document();
d.Add(new Field("text", "microsoft bbb", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.AddDocument(d);
writer.Close();
IndexReader reader = IndexReader.Open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser queryParser = new QueryParser("text", new Lucene.Net.Analysis.Standard.StandardAnalyzer());
queryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
Query query = queryParser.Parse("micro*");
TopDocs results = searcher.Search(query, 5);
System.Diagnostics.Debug.Assert(results.TotalHits == 2);
TVM tvm = new TVM(query, reader);
for (int i = 0; i < results.ScoreDocs.Length; i++)
{
Console.Write("DOCID:" + results.ScoreDocs[i].Doc + " > ");
reader.GetTermFreqVector(results.ScoreDocs[i].Doc, "text", tvm);
foreach (string term in tvm.FoundTerms) Console.Write(term + " ");
tvm.FoundTerms.Clear();
Console.WriteLine();
}
}

One way is to use the Highlighter; another way would be to mimic what the Highlighter does by rewriting your query by calling myQuery.rewrite() with an appropriate rewriter; this is probably closer in spirit to what you were trying. This will rewrite the query to a BooleanQuery containing all the matching Terms; you can get the words out of those pretty easily. Is that enough to get you going?
Here's the idea I had in mind; sorry about the confusion re: rewriting queries; it's not really relevant here.
TokenStream tokens = TokenSources.getAnyTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer);
CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
// do something with termAtt, which holds the matched term
}

Related

how to process hits on lucene 3.03

List<SearchResults> Searchresults = new List<SearchResults>();
// Specify the location where the index files are stored
string indexFileLocation = #"D:\Lucene.Net\Data\Persons";
Lucene.Net.Store.Directory dir = FSDirectory.Open(indexFileLocation);
// specify the search fields, lucene search in multiple fields
string[] searchfields = new string[] { "FirstName", "LastName", "DesigName", "CatagoryName" };
IndexSearcher indexSearcher = new IndexSearcher(dir);
// Making a boolean query for searching and get the searched hits
Query som = QueryMaker(searchString, searchfields);
int n = 1000;
TopDocs hits = indexSearcher.Search(som,null,n);
for (int i = 0; i <hits.TotalHits; i++)
{
SearchResults result = new SearchResults();
result.FirstName = hits.ScoreDocs.GetValue(i).ToString();
result.FirstName = hits.Doc.GetField("FirstName").StringValue();
result.LastName = hits.Doc(i).GetField("LastName").StringValue();
result.DesigName = hits.Doc(i).GetField("DesigName").StringValue();
result.Addres = hits.Doc(i).GetField("Addres").StringValue();
result.CatagoryName = hits.Doc(i).GetField("CatagoryName").StringValue();
Searchresults.Add(result);
}
i have table fields first name last name .... how can i process hit to get the values from the search result
i have an error that says TopDocs does not contain defination for doc
Lean on the compiler. There is no property or method called Doc in TopDocs class. In ScoreDocs property of TopDocs class you have list of hits with document number and score. You need to use this document number to get actual document. After that use method Doc which is in IndexSearcher to query for document with this number. And then you can get stored field data from that document.
You can process results like that:
foreach (var scoreDoc in hits.ScoreDocs)
{
var result = new SearchResults();
var doc = indexSearcher.Doc(scoreDoc.Doc);
result.FirstName = doc.GetField("FirstName").StringValue;
result.LastName = doc.GetField("LastName").StringValue;
result.DesigName = doc.GetField("DesigName").StringValue;
result.Addres = doc.GetField("Addres").StringValue;
result.CategoryName = doc.GetField("CategoryName").StringValue;
Searchresults.Add(result);
}
Or in more LINQ way:
var searchResults =
indexSearcher
.Search(som, null, n)
.ScoreDocs
.Select(scoreDoc => indexSearcher.Doc(scoreDoc))
.Select(doc =>
{
var result = new SearchResults();
result.FirstName = doc.GetField("FirstName").StringValue;
result.LastName = doc.GetField("LastName").StringValue;
result.DesigName = doc.GetField("DesigName").StringValue;
result.Addres = doc.GetField("Addres").StringValue;
result.CategoryName = doc.GetField("CategoryName").StringValue;
return result;
})
.ToList();
Separation of hits method will let you clear the matched documents and in future if you want to highlight the matched documents then you can easily embed the lucene.net highlighter in getMatchedHits method.
List<SearchResults> Searchresults = new List<SearchResults>();
// Specify the location where the index files are stored
string indexFileLocation = #"D:\Lucene.Net\Data\Persons";
Lucene.Net.Store.Directory dir = FSDirectory.Open(indexFileLocation);
// specify the search fields, lucene search in multiple fields
string[] searchfields = new string[] { "FirstName", "LastName", "DesigName", "CatagoryName" };
IndexSearcher indexSearcher = new IndexSearcher(dir);
// Making a boolean query for searching and get the searched hits
Query som = QueryMaker(searchString, searchfields);
int n = 1000;
var hits = indexSearcher.Search(som,null,n).ScoreDocs;
Searchresults = getMatchedHits(hits,indexSearcher);
getMatchedHits method code:
public static List<SearchResults> getMatchedHits(ScoreDoc[] hits, IndexSearcher searcher)
{
List<SearchResults> list = new List<SearchResults>();
SearchResults obj;
try
{
for (int i = 0; i < hits.Count(); i++)
{
// get the document from index
Document doc = searcher.Doc(hits[i].Doc);
string strFirstName = doc.Get("FirstName");
string strLastName = doc.Get("LastName");
string strDesigName = doc.Get("DesigName");
string strAddres = doc.Get("Addres");
string strCategoryName = doc.Get("CategoryName");
obj = new SearchResults();
obj.FirstName = strFirstName;
obj.LastName = strLastName;
obj.DesigName= strDesigName;
obj.Addres = strAddres;
obj.CategoryName = strCategoryName;
list.Add(obj);
}
return list;
}
catch (Exception ex)
{
return null; // or throw exception
}
}
Hope it Helps!

Field Boosting Doesn't Work/Effect Lucene.net

I'm trying to set boosting on documents fields to make the search results more accurate but as i see it doesn't work
however
here is my code
Indexing:
private static void _addToLuceneIndex(Datafile Datafile, IndexWriter writer)
{
// remove older index entry
var searchQuery = new TermQuery(new Term("Id", Datafile.article.Id.ToString()));
writer.DeleteDocuments(searchQuery);
// add new index entry
var doc = new Document();
var id = new Field("Id", Datafile.article.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED);
var content = new Field("Content", Datafile.article.Content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS);
content.Boost = 4;
var title = new Field("Title", Datafile.article.Title, Field.Store.YES, Field.Index.ANALYZED);
title.Boost = 6;
doc.Add(id);
doc.Add(content);
doc.Add(title);
foreach (var item in Datafile.article.Article_Tag)
{
var tmpta = new Field("Atid", item.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED);
var tagname = new Field("Tagname", item.Tag.name, Field.Store.YES, Field.Index.ANALYZED);
tagname.Boost = 8;
doc.Add(tmpta);
doc.Add(tagname);
}
// add lucene fields mapped to db fields
// add entry to index
writer.AddDocument(doc);
}
i've used Lukenet to see if the fields boosted however it doesn't and boosting still equal to 1.0
so i tried to run and test it ,but the result disappoint me anyway
here is my search code:
searching:
private static IEnumerable<Datafile> _search(string searchQuery, string searchField = "")
{
// validation
if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", "")))
return new List<Datafile>();
var indexReader = IndexReader.Open(Directory, false);
// set up lucene searcher
using (var searcher = new IndexSearcher(indexReader))
{
var hits_limit = 1000;
// search by single field
var enanalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
var aranalyzer = new SnowballAnalyzer(Version.LUCENE_30, "Arabic");
string[] fields = new string[] { "Title", "Content", "Tagname" };
// Dictionary<string, float> boosts = new Dictionary<string, float>();
// boosts.Add("Title", 5);
// boosts.Add("Content", 3);
// boosts.Add("Tagname", 7);
var enparser = new MultiFieldQueryParser(Version.LUCENE_30, fields, enanalyzer);
var arparser = new MultiFieldQueryParser(Version.LUCENE_30, fields, aranalyzer);
var query = QueryModel(searchQuery, new QueryParser[] { enparser, arparser });
searcher.SetDefaultFieldSortScoring(true, false);
TopFieldCollector collector = TopFieldCollector.Create(new Sort(new SortField(null, SortField.SCORE, false), new SortField("Title", SortField.STRING, true), new SortField("Tagname", SortField.STRING, true), new SortField("Content", SortField.STRING, true)),
hits_limit,
false, // fillFields - not needed, we want score and doc only
true, // trackDocScores - need doc and score fields
true, // trackMaxScore - related to trackDocScores
false); // should docs be in docId order?
searcher.Search(query, collector);
var hits = collector.TopDocs().ScoreDocs;
var results = new List<Datafile>();
foreach (var hit in hits)
{
var doc = searcher.Doc(hit.Doc);
var df = _mapLuceneDocumentToData(doc);
df.score = hit.Score;
results.Add(df);
}
searcher.Dispose();
return results;
// search by multiple fields (ordered by RELEVANCE)
}
}
QueryModel Method:
private static Query QueryModel(string searchQuery, QueryParser[] parsers)
{
BooleanQuery query = new BooleanQuery();
searchQuery = "*" + searchQuery + "*";
foreach (var parser in parsers)
{
parser.AllowLeadingWildcard = true;
var thequery = parser.Parse(searchQuery);
query.Add(new BooleanClause(thequery, Occur.SHOULD));
}
return query;
}
i'm new with lucene.net i love it but i can't get my head around this problem
PS:
also i want to get a fuzzy query as like when the user enter :
city in russua to get a result as if he enter: city in russia
i tried FuzzyQuery Class But it doesn't work anyway ,and is it necessary to use FuzzyQuery Class or not to get that result
So Since no one answer my question and i have found a solution for this issue i've used a search time query boosting and here is my code:
var QParser = new QueryParser(Version.LUCENE_30, "Content", analyzer);
QParser.AllowLeadingWildcard = true;
var Query = new QParser.Parse(searchQuery);
Query.Boost = 7.0f;
return Query;
you can use BooleanQuery if you want to Do an Or,And search

How to remove plurals in Lucene.NET?

I'm trying to extract some keywords from a text. It works quite fine but I need to remove plurals.
As I'm already using Lucene for searching purpose, I'm trying to use it to extract keyword from indexed terms.
1st, I index the document in a RAMDirectory index,
RAMDirectory idx = new RAMDirectory();
using (IndexWriter writer =
new IndexWriter(
idx,
new CustomStandardAnalyzer(StopWords.Get(this.Language),
Lucene.Net.Util.Version.LUCENE_30, this.Language),
IndexWriter.MaxFieldLength.LIMITED))
{
writer.AddDocument(createDocument(this._text));
writer.Optimize();
}
Then, I extract the keywords:
var list = new List<KeyValuePair<int, string>>();
using (var reader = IndexReader.Open(directory, true))
{
var tv = reader.GetTermFreqVector(0, "text");
if (tv != null)
{
string[] terms = tv.GetTerms();
int[] freq = tv.GetTermFrequencies();
for (int i = 0; i < terms.Length; i++)
list.Add(new KeyValuePair<int, string>(freq[i], terms[i]));
}
}
in the list of terms I can have terms like "president" and "presidents"
How could I remove it?
My CustomStandardAnalyzer use this:
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
//create the tokenizer
TokenStream result = new StandardTokenizer(this.version, reader);
//add in filters
result = new Lucene.Net.Analysis.Snowball.SnowballFilter(result, this.getStemmer());
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
result = new StopFilter(true, result, this.stopWords ?? StopWords.English);
return result;
}
So I already use the SnowballFilter (with the correct language specific stemmer).
How could I remove plurals?
My output from the following program is:
text:and
text:presid
text:some
text:text
text:with
class Program
{
private class CustomStandardAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
//create the tokenizer
TokenStream result = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_30, reader);
//add in filters
result = new Lucene.Net.Analysis.Snowball.SnowballFilter(result, new EnglishStemmer());
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
result = new StopFilter(true, result, new HashSet<string>());
return result;
}
}
private static Document createDocument(string text)
{
Document d = new Document();
Field f = new Field("text", "", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
f.SetValue(text);
d.Add(f);
return d;
}
static void Main(string[] args)
{
RAMDirectory idx = new RAMDirectory();
using (IndexWriter writer =
new IndexWriter(
idx,
new CustomStandardAnalyzer(),
IndexWriter.MaxFieldLength.LIMITED))
{
writer.AddDocument(createDocument("some text with president and presidents"));
writer.Commit();
}
using (var reader = IndexReader.Open(idx, true))
{
var terms = reader.Terms(new Term("text", ""));
if (terms.Term != null)
do
Console.WriteLine(terms.Term);
while (terms.Next());
}
Console.ReadLine();
}
}

C# Lucene get all the index

I am working on a windows application using Lucene. I want to get all the indexed keywords and use them as a source for a auto-suggest on search field. How can I receive all the indexed keywords in Lucene? I am fairly new in C#. Code itself is appreciated. Thanks.
Are you looking extract all terms from the index?
private void GetIndexTerms(string indexFolder)
{
List<String> termlist = new ArrayList<String>();
IndexReader reader = IndexReader.open(indexFolder);
TermEnum terms = reader.terms();
while (terms.next())
{
Term term = terms.term();
String termText = term.text();
int frequency = reader.docFreq(term);
termlist.add(termText);
}
reader.close();
}
For inspiration with Apache Lucene.Net version 4.8 you can look at GitHub msigut/LuceneNet48Demo. Use classes: SearcherManager, *QueryParser and IndexWriter for build index.
// you favorite Query parser (MultiFieldQueryParser for example)
_queryParser = new MultiFieldQueryParser(...
// Execute the search with a fresh indexSearcher
_searchManager.MaybeRefreshBlocking();
var searcher = _searchManager.Acquire();
try
{
var q = _queryParser.Parse(query);
var topDocs = searcher.Search(q, 10);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
var document = searcher.Doc(scoreDoc.Doc);
var hit = new QueryHit
{
Title = document.GetField("title")?.GetStringValue(),
// ... you logic to read data from index ...
};
}
}
finally
{
_searchManager.Release(searcher);
searcher = null;
}

Lucene returns same exact search results no matter the search term

Here is my code
term = Server.UrlDecode(term);
string indexFileLocation = "C:\\lucene\\Index\\post";
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, false);
//create an index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
//build a query object
Lucene.Net.Index.Term searchTerm =
new Lucene.Net.Index.Term("post_title", term);
Lucene.Net.Analysis.Standard.StandardAnalyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
Lucene.Net.QueryParsers.QueryParser queryParser = new
Lucene.Net.QueryParsers.QueryParser("post_title", analyzer);
Lucene.Net.Search.Query query = queryParser.Parse(term);
//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(query);
List<string> s = new List<string>();
for (int i = 0; i < hits.Length(); i++)
{
Lucene.Net.Documents.Document doc = hits.Doc(i);
s.Add(doc.Get("post_title_raw"));
}
ViewData["s"] = s;
here is my indexing code
//create post lucene index
LuceneType lt = new LuceneType();
lt.Analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
lt.Writer = new Lucene.Net.Index.IndexWriter("C:/lucene/Index/post", lt.Analyzer, true);
using (var context = new MvcApplication1.Entity.test2Entities())
{
var posts = from p in context.post
where object.Equals(p.post_parentid, null) && p.post_isdeleted == false
let Answers = from a in context.post
where a.post_parentid == p.post_id
select new
{
a.post_description
}
let Comments = from c in context.comment
where c.post.post_id == p.post_id
select new
{
c.comment_text
}
select new
{
p,
Answers,
Comments
};
foreach (var post in posts)
{
//lets concate all the answers and comments
StringBuilder answersSB = new StringBuilder();
StringBuilder CommentsSB = new StringBuilder();
foreach (var answer in post.Answers)
answersSB.Append(answer.post_description);
foreach (var comment in post.Comments)
CommentsSB.Append(comment.comment_text);
//add rows
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_id",
post.p.post_id.ToString(),
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.UN_TOKENIZED
));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_title",
new System.IO.StringReader(post.p.post_title)));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_title_raw",
post.p.post_title,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_titleslug",
post.p.post_titleslug,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_tagtext",
new System.IO.StringReader(post.p.post_tagtext)));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_tagtext",
post.p.post_tagtext,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_description",
new System.IO.StringReader(post.p.post_description)));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_description_raw",
post.p.post_description,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_Answers",
new System.IO.StringReader(answersSB.ToString())));
lt.Doc.Add(new Lucene.Net.Documents.Field(
"post_Comments",
new System.IO.StringReader(CommentsSB.ToString())));
}
lt.Writer.AddDocument(lt.Doc);
lt.Writer.Optimize();
lt.Writer.Close();
why does this return the same reuslts for any search term?
Lucene.Net.Search.Query query = queryParser.Parse(term);
In the code above instead of searchterm you have used term
Your code must be like below
Lucene.Net.Search.Query query = queryParser.Parse(searchterm);
You can make some small alteration as like below
//build a query object
Lucene.Net.Index.Term searchTerm =
new Lucene.Net.Index.Term("post_title", term);
TermQuery tq = new TermQuery(searchTerm);
......
......
Lucene.Net.Search.Query query = tq;
Now there is no need of Parser.
IF still u need parser then you can change the above line as
Lucene.Net.Search.Query query = queryParser.Parse(tq.ToString());
Hope this helps.
Not a direct answer, but get LUKE (It works with .NET indexes too) and open your index -- Try to use it's querier using the right type of optimizer. If that works, you know the problem is in your querying. If it doesn't it could be in both the indexing and the querying, but at least this ought to get you on the right track.

Categories