C# Lucene get all the index - c#

I am working on a windows application using Lucene. I want to get all the indexed keywords and use them as a source for a auto-suggest on search field. How can I receive all the indexed keywords in Lucene? I am fairly new in C#. Code itself is appreciated. Thanks.

Are you looking extract all terms from the index?
private void GetIndexTerms(string indexFolder)
{
List<String> termlist = new ArrayList<String>();
IndexReader reader = IndexReader.open(indexFolder);
TermEnum terms = reader.terms();
while (terms.next())
{
Term term = terms.term();
String termText = term.text();
int frequency = reader.docFreq(term);
termlist.add(termText);
}
reader.close();
}

For inspiration with Apache Lucene.Net version 4.8 you can look at GitHub msigut/LuceneNet48Demo. Use classes: SearcherManager, *QueryParser and IndexWriter for build index.
// you favorite Query parser (MultiFieldQueryParser for example)
_queryParser = new MultiFieldQueryParser(...
// Execute the search with a fresh indexSearcher
_searchManager.MaybeRefreshBlocking();
var searcher = _searchManager.Acquire();
try
{
var q = _queryParser.Parse(query);
var topDocs = searcher.Search(q, 10);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
var document = searcher.Doc(scoreDoc.Doc);
var hit = new QueryHit
{
Title = document.GetField("title")?.GetStringValue(),
// ... you logic to read data from index ...
};
}
}
finally
{
_searchManager.Release(searcher);
searcher = null;
}

Related

Lucene.Net 4.8.0-beta00016 is throwing exception when i try to read FastTaxonomyFacetCounts from File Directory

I'm using Lucene.Net 4.8.0-beta00016 version and using .Net 6.0.
When i write to RAMDirectory i'm able to fetch FastTaxonomyFacetCounts if i try to fetch from Directory (File System) it is throwing Index Corrupted. Missing parent data for category 0.
Below is the code which I'm facing issue. IndexDirectory and TaxoDirectory is the physical file path where lucene indexes generated.
using (DirectoryReader indexReader = DirectoryReader.Open(IndexDirectory))
using (DirectoryTaxonomyReader taxoReader = new DirectoryTaxonomyReader(TaxoDirectory))
{
IndexSearcher searcher = new IndexSearcher(indexReader);
FacetsCollector fc = new FacetsCollector();
Query q = new WildcardQuery(new Term("Brand", "*Ji*"));
TopScoreDocCollector tdc = TopScoreDocCollector.Create(10, true);
var topDocs = FacetsCollector.Search(searcher, q, 10, fc);
var topHits = topDocs.ScoreDocs;
var hits = searcher.Search(q, 10, Sort.INDEXORDER).ScoreDocs;
if (hits != null)
{
foreach (var hit in hits)
{
var document = searcher.Doc(hit.Doc);
}
}
Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
var result = facets.GetAllDims(1000);
}
If anyone has solution. Please guide me
It should return Facets.

Lucene query in C# not finding results with punctuation

I have a search bar that executes a lucene query on the "description" field, but it doesn't return results when with apostrophes. For example, I have a product where the description is Herter's® EZ-Load 200lb Feeder - 99018. When I search for "Herter", I get results, but I get no results if I search for "Herter's" or "Herters". This is my search code:
var query = Request.QueryString["q"];
var search = HttpContext.Current.Server.UrlDecode(query);
var rewardProductLookup = new RewardCatalogDataHelper();
RewardProductSearchCriteria criteria = new RewardProductSearchCriteria()
{
keywords = search,
pageSize = 1000,
sortDirection = "desc"
};
IEnumerable<SkinnyItem> foundProducts = rewardProductLookup.FindByKeywordQuery(criteria);
public IEnumerable<SkinnyItem> FindByKeywordQuery(RewardProductSearchCriteria query)
{
var luceneIndexDataContext = new LuceneDataContext("rewardproducts", _dbName);
string fieldToQuery = "rpdescription";
bool sortDirection = query.sortDirection.ToLower().Equals("desc");
MultiPhraseQuery multiPhraseQuery = new MultiPhraseQuery();
var keywords = query.keywords.ToLower().Split(',');
foreach (var keyword in keywords)
{
if (!String.IsNullOrEmpty(keyword))
{
var term = new Term(fieldToQuery, keyword);
multiPhraseQuery.Add(term);
}
}
var booleanQuery = new BooleanQuery();
booleanQuery.Add(multiPhraseQuery, BooleanClause.Occur.MUST);
return
luceneIndexDataContext.BooleanQuerySearch(booleanQuery, fieldToQuery, sortDirection)
.Where(i => i.Fields["eligibleforpurchase"] == "1");
}
The problem here is analysis. You haven't specified the analyzer being used in this case, so I'll assume it's StandardAnalyzer.
When analyzed, the term "Herter's" will be translated to "herter". However, no analyzer is being applied in your FindByKeywordQuery method, so looking for "herter" works, but "herter's" doesn't.
One solution would be to use the QueryParser, in stead of manually constructing a MultiPhraseQuery. The QueryParser will handle tokenizing, lowercasing, and such. Something like:
QueryParser parser = new QueryParser(VERSION, "text", new StandardAnalyzer(VERSION));
Query query = parser.Parse("\"" + query.keywords + "\"");
The single quote is the delimiter for text fields in a query.
Select * FROM Product where Description = 'foo'
You will need to escape or double any single quote your query. try this in the loop.
foreach (var keyword in keywords)
{
if (!String.IsNullOrEmpty(keyword))
{
var term = new Term(fieldToQuery, keyword);
term = term.Replace("'", "''");
multiPhraseQuery.Add(term);
}
}
You could also create an extension method
[DebuggerStepThrough]
public static string SanitizeSQL(this string value)
{
return value.Replace("'", "''").Replace("\\", "\\\\");
}
in which case you could then you could do this in the loop
foreach (var keyword in keywords)
{
if (!String.IsNullOrEmpty(keyword))
{
var term = new Term(fieldToQuery, keyword.SanitizeSQL());
multiPhraseQuery.Add(term);
}
}
Hope this helps.

C# MongoDB Driver OutOfMemoryException

I am trying to read data from a remote MongoDB instance from a c# console application but keep getting an OutOfMemoryException. The collection that I am trying to read data from has about 500,000 records. Does anyone see any issue with the code below:
var mongoCred = MongoCredential.CreateMongoCRCredential("xdb", "x", "x");
var mongoClientSettings = new MongoClientSettings
{
Credentials = new[] { mongoCred },
Server = new MongoServerAddress("x-x.mongolab.com", 12345),
};
var mongoClient = new MongoClient(mongoClientSettings);
var mongoDb = mongoClient.GetDatabase("xdb");
var mongoCol = mongoDb.GetCollection<BsonDocument>("Persons");
var list = await mongoCol.Find(new BsonDocument()).ToListAsync();
This is a simple workaround: you can page your results using .Limit(?int) and .Skip(?int); in totNum you have to store the documents number in your collection using
coll.Count(new BsonDocument) /*use the same filter you will apply in the next Find()*/
and then
for (int _i = 0; _i < totNum / 1000 + 1; _i++)
{
var result = coll.Find(new BsonDocument()).Limit(1000).Skip(_i * 1000).ToList();
foreach(var item in result)
{
/*Write your document in CSV file*/
}
}
I hope this can help...
P.S.
I used 1000 in .Skip() and .Limit() but, obviously, you can use what you want :-)

How to convert TermCollection to Tree object via CSOM

I'm querying a SharePoint 2013 Term Store via the SharePoint Client Object Model in order to get a TermCollection.
I'd like to bind the results to a WPF TreeView control. Any idea how I can turn the TermCollection into something that the TreeView will understand?
public static TermCollection GetTaxonomyTerms(string webUrl, string libraryTitle, string fieldTitle)
{
var context = new ClientContext(webUrl);
var web = context.Web;
var list = web.Lists.GetByTitle(libraryTitle);
var fields = list.Fields;
var field = context.CastTo<TaxonomyField>(fields.GetByInternalNameOrTitle(fieldTitle));
context.Load(field);
var termStores = TaxonomySession.GetTaxonomySession(context).TermStores;
context.Load(termStores);
context.ExecuteQuery(); // TODO: Can this ExecuteQuery be avoided by using a LoadQuery statement?
var termStore = termStores.Where(t => t.Id == field.SspId).FirstOrDefault();
var termSet = termStore.GetTermSet(field.TermSetId);
var terms = termSet.GetAllTerms(); //TODO: Do we need a version that returns a paged set of terms? or queries the server again when a node is expanded?
context.Load(terms);
context.ExecuteQuery();
return terms;
}
I ended up writing my own code (please let me know if there's an easier way to do this).
My 'Term' object below is just a simple POCO with Name and Terms.
var terms = SharePointHelper.GetTaxonomyTerms(webUrl, libraryTitle, fieldTitle);
var term = terms.AsRootTreeViewTerm();
....
}
public static Term AsRootTreeViewTerm(this SP.TermCollection spTerms)
{
var root = new Term();
foreach (SP.Term spTerm in spTerms)
{
List<string> names = spTerm.PathOfTerm.Split(';').ToList();
var term = BuildTerm(root.Terms, names);
if (!root.Terms.Contains(term))
root.Terms.Add(term);
}
return root;
}
static Term BuildTerm(IList<Term> terms, List<string> names)
{
Term term = terms.Where(x => x.Name == names.First())
.DefaultIfEmpty(new Term() { Name = names.First() })
.First();
names.Remove(names.First());
if (names.Count > 0)
{
Term child = BuildTerm(term.Terms, names);
if (!term.Terms.Contains(child))
term.Terms.Add(child);
}
return term;
}

Getting terms matched in a document when searching using a wildcard search

I am looking for a way to find the terms that matched in the document using waldcard search in Lucene. I used the explainer to try and find the terms but this failed. A portion of the relevant code is below.
ScoreDoc[] myHits = myTopDocs.scoreDocs;
int hitsCount = myHits.Length;
for (int myCounter = 0; myCounter < hitsCount; myCounter++)
{
Document doc = searcher.Doc(myHits[myCounter].doc);
Explanation explanation = searcher.Explain(myQuery, myCounter);
string myExplanation = explanation.ToString();
...
When I do a search on say micro*, documents are found and it enter the loop but myExplanation contains NON-MATCH and no other information.
How do I get the term that was found in this document ?
Any help would be most appreciated.
Regards
class TVM : TermVectorMapper
{
public List<string> FoundTerms = new List<string>();
HashSet<string> _termTexts = new HashSet<string>();
public TVM(Query q, IndexReader r) : base()
{
List<Term> allTerms = new List<Term>();
q.Rewrite(r).ExtractTerms(allTerms);
foreach (Term t in allTerms) _termTexts.Add(t.Text());
}
public override void SetExpectations(string field, int numTerms, bool storeOffsets, bool storePositions)
{
}
public override void Map(string term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions)
{
if (_termTexts.Contains(term)) FoundTerms.Add(term);
}
}
void TermVectorMapperTest()
{
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
Document d = null;
d = new Document();
d.Add(new Field("text", "microscope aaa", Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.AddDocument(d);
d = new Document();
d.Add(new Field("text", "microsoft bbb", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.AddDocument(d);
writer.Close();
IndexReader reader = IndexReader.Open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser queryParser = new QueryParser("text", new Lucene.Net.Analysis.Standard.StandardAnalyzer());
queryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
Query query = queryParser.Parse("micro*");
TopDocs results = searcher.Search(query, 5);
System.Diagnostics.Debug.Assert(results.TotalHits == 2);
TVM tvm = new TVM(query, reader);
for (int i = 0; i < results.ScoreDocs.Length; i++)
{
Console.Write("DOCID:" + results.ScoreDocs[i].Doc + " > ");
reader.GetTermFreqVector(results.ScoreDocs[i].Doc, "text", tvm);
foreach (string term in tvm.FoundTerms) Console.Write(term + " ");
tvm.FoundTerms.Clear();
Console.WriteLine();
}
}
One way is to use the Highlighter; another way would be to mimic what the Highlighter does by rewriting your query by calling myQuery.rewrite() with an appropriate rewriter; this is probably closer in spirit to what you were trying. This will rewrite the query to a BooleanQuery containing all the matching Terms; you can get the words out of those pretty easily. Is that enough to get you going?
Here's the idea I had in mind; sorry about the confusion re: rewriting queries; it's not really relevant here.
TokenStream tokens = TokenSources.getAnyTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer);
CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
// do something with termAtt, which holds the matched term
}

Categories