How to remove plurals in Lucene.NET? - c#

I'm trying to extract some keywords from a text. It works quite fine but I need to remove plurals.
As I'm already using Lucene for searching purpose, I'm trying to use it to extract keyword from indexed terms.
1st, I index the document in a RAMDirectory index,
RAMDirectory idx = new RAMDirectory();
using (IndexWriter writer =
new IndexWriter(
idx,
new CustomStandardAnalyzer(StopWords.Get(this.Language),
Lucene.Net.Util.Version.LUCENE_30, this.Language),
IndexWriter.MaxFieldLength.LIMITED))
{
writer.AddDocument(createDocument(this._text));
writer.Optimize();
}
Then, I extract the keywords:
var list = new List<KeyValuePair<int, string>>();
using (var reader = IndexReader.Open(directory, true))
{
var tv = reader.GetTermFreqVector(0, "text");
if (tv != null)
{
string[] terms = tv.GetTerms();
int[] freq = tv.GetTermFrequencies();
for (int i = 0; i < terms.Length; i++)
list.Add(new KeyValuePair<int, string>(freq[i], terms[i]));
}
}
in the list of terms I can have terms like "president" and "presidents"
How could I remove it?
My CustomStandardAnalyzer use this:
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
//create the tokenizer
TokenStream result = new StandardTokenizer(this.version, reader);
//add in filters
result = new Lucene.Net.Analysis.Snowball.SnowballFilter(result, this.getStemmer());
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
result = new StopFilter(true, result, this.stopWords ?? StopWords.English);
return result;
}
So I already use the SnowballFilter (with the correct language specific stemmer).
How could I remove plurals?

My output from the following program is:
text:and
text:presid
text:some
text:text
text:with
class Program
{
private class CustomStandardAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
//create the tokenizer
TokenStream result = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_30, reader);
//add in filters
result = new Lucene.Net.Analysis.Snowball.SnowballFilter(result, new EnglishStemmer());
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
result = new StopFilter(true, result, new HashSet<string>());
return result;
}
}
private static Document createDocument(string text)
{
Document d = new Document();
Field f = new Field("text", "", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
f.SetValue(text);
d.Add(f);
return d;
}
static void Main(string[] args)
{
RAMDirectory idx = new RAMDirectory();
using (IndexWriter writer =
new IndexWriter(
idx,
new CustomStandardAnalyzer(),
IndexWriter.MaxFieldLength.LIMITED))
{
writer.AddDocument(createDocument("some text with president and presidents"));
writer.Commit();
}
using (var reader = IndexReader.Open(idx, true))
{
var terms = reader.Terms(new Term("text", ""));
if (terms.Term != null)
do
Console.WriteLine(terms.Term);
while (terms.Next());
}
Console.ReadLine();
}
}

Related

Extracting a dictionary from sparse csv file

I have a sparsely populated excel file I want to extract two columns into a dictionary in C#. I have tried the following. This fails when it reads the blank lines. Is there a cleaner way to achieve the same. I don't care about any other values here. Just a mapping of AR ID to AR Type would do.
public class Table
{
private Dictionary<string, string> _ARID_ARTypeValues = new Dictionary<string, string>();
private string _arId;
public Table(string arId)
{
_arId = arId;
}
public void AddValue(string key, string value)
{
_ARID_ARTypeValues.Add(key, value);
}
}
public static IDictionary ParseCsvFile(StreamReader reader)
{
Dictionary<string, Table> tables = new Dictionary<string, Table>();
// First line contains column names.
var columnNames = reader.ReadLine().Split(',');
for (int i = 1; i < columnNames.Length; ++i)
{
var columnName = columnNames[i];
var ntable = new Table(columnName);
if ((columnName == "AR ID") || (columnName == "AR Type"))
{
tables.Add(columnName, ntable);
}
}
var line = reader.ReadLine();
while (line != null)
{
var columns = line.Split(',');
for (int j = 1; j < columns.Length; ++j)
{
var table = tables[columnNames[j]];
table.AddValue(columns[0], columns[j]);
}
line = reader.ReadLine();
}
return tables;
}
I would just use a CSV library, like CsvHelper and read the csv file with that.
Dictionary<string, string> arIdToArTypeMapping = new Dictionary<string, string>();
using (var sr = File.OpenText("test.csv"))
{
var csvConfiguration = new CsvConfiguration
{
SkipEmptyRecords = true
};
using (var csvReader = new CsvReader(sr, csvConfiguration))
{
while (csvReader.Read())
{
string arId = csvReader.GetField("AR ID");
string arType = csvReader.GetField("AR Type");
if (!string.IsNullOrEmpty(arId) && !string.IsNullOrEmpty(arType))
{
arIdToArTypeMapping.Add(arId, arType);
}
}
}
}
You can use Cinchoo ETL - an open source library, to read the csv and convert them to dictionary as simple as with few lines of code shown below
using (var parser = new ChoCSVReader("Dict1.csv")
.WithField("AR_ID", 7)
.WithField("AR_TYPE", 8)
.WithFirstLineHeader(true)
.Configure(c => c.IgnoreEmptyLine = true)
)
{
var dict = parser.ToDictionary(item => item.AR_ID, item => item.AR_TYPE);
foreach (var kvp in dict)
Console.WriteLine(kvp.Key + " " + kvp.Value);
}
Hope this helps.
Disclaimer: I'm the author of this library.

IList<T> return as a generic

I'm a beginner for coding ,and I was trying to create a search engine , but there s a part I dont know how to solve it that returns an IList as a generic,
public IList<T> Search<T>(string textSearch)
{
IList<T> list = new List<T>();
var result = new DataTable();
using (Analyzer analyzer = new PanGuAnalyzer())
{
var queryParser = new QueryParser(Version.LUCENE_30, "FullText", analyzer);
queryParser.AllowLeadingWildcard = true;
var query = queryParser.Parse(textSearch);
var collector = TopScoreDocCollector.Create(1000, true);
Searcher.Search(query, collector);
var matches = collector.TopDocs().ScoreDocs;
result.Columns.Add("Title");
result.Columns.Add("Starring");
result.Columns.Add("ID");
foreach (var item in matches)
{
var id = item.Doc;
var doc = Searcher.Doc(id);
var row = result.NewRow();
row["Title"] = doc.GetField("Title").StringValue;
row["Starring"] = doc.GetField("Starring").StringValue;
row["ID"] = doc.GetField("ID").StringValue;
result.Rows.Add(row);
}
}
return result;
}
but in this code , I couldn't return result ,it says Cannot Implicitly convert type 'Data.DataTable' to 'Generic.IList',An explicit conversion exists.so how can I solve this?
I guess you don't want to support generics since it doesn't make sense and is impossible. You have a class, for example Film, then return a List<Film>, you don't need the DataTable:
public IList<Film> SearchFilms(string textSearch)
{
IList<Film> list = new List<Film>();
using (Analyzer analyzer = new PanGuAnalyzer())
{
var queryParser = new QueryParser(Version.LUCENE_30, "FullText", analyzer);
queryParser.AllowLeadingWildcard = true;
var query = queryParser.Parse(textSearch);
var collector = TopScoreDocCollector.Create(1000, true);
Searcher.Search(query, collector);
var matches = collector.TopDocs().ScoreDocs;
foreach (var item in matches)
{
var film = new Film();
var id = item.Doc;
var doc = Searcher.Doc(id);
film.Title = doc.GetField("Title").StringValue;
film.Starring = doc.GetField("Starring").StringValue;
film.ID = doc.GetField("ID").StringValue;
list.Add(film);
}
}
return list;
}
Your return statement should be
result.AsEnumerable().ToList();
Don't forget to add namespace
using System.Linq;

Field Boosting Doesn't Work/Effect Lucene.net

I'm trying to set boosting on documents fields to make the search results more accurate but as i see it doesn't work
however
here is my code
Indexing:
private static void _addToLuceneIndex(Datafile Datafile, IndexWriter writer)
{
// remove older index entry
var searchQuery = new TermQuery(new Term("Id", Datafile.article.Id.ToString()));
writer.DeleteDocuments(searchQuery);
// add new index entry
var doc = new Document();
var id = new Field("Id", Datafile.article.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED);
var content = new Field("Content", Datafile.article.Content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS);
content.Boost = 4;
var title = new Field("Title", Datafile.article.Title, Field.Store.YES, Field.Index.ANALYZED);
title.Boost = 6;
doc.Add(id);
doc.Add(content);
doc.Add(title);
foreach (var item in Datafile.article.Article_Tag)
{
var tmpta = new Field("Atid", item.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED);
var tagname = new Field("Tagname", item.Tag.name, Field.Store.YES, Field.Index.ANALYZED);
tagname.Boost = 8;
doc.Add(tmpta);
doc.Add(tagname);
}
// add lucene fields mapped to db fields
// add entry to index
writer.AddDocument(doc);
}
i've used Lukenet to see if the fields boosted however it doesn't and boosting still equal to 1.0
so i tried to run and test it ,but the result disappoint me anyway
here is my search code:
searching:
private static IEnumerable<Datafile> _search(string searchQuery, string searchField = "")
{
// validation
if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", "")))
return new List<Datafile>();
var indexReader = IndexReader.Open(Directory, false);
// set up lucene searcher
using (var searcher = new IndexSearcher(indexReader))
{
var hits_limit = 1000;
// search by single field
var enanalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
var aranalyzer = new SnowballAnalyzer(Version.LUCENE_30, "Arabic");
string[] fields = new string[] { "Title", "Content", "Tagname" };
// Dictionary<string, float> boosts = new Dictionary<string, float>();
// boosts.Add("Title", 5);
// boosts.Add("Content", 3);
// boosts.Add("Tagname", 7);
var enparser = new MultiFieldQueryParser(Version.LUCENE_30, fields, enanalyzer);
var arparser = new MultiFieldQueryParser(Version.LUCENE_30, fields, aranalyzer);
var query = QueryModel(searchQuery, new QueryParser[] { enparser, arparser });
searcher.SetDefaultFieldSortScoring(true, false);
TopFieldCollector collector = TopFieldCollector.Create(new Sort(new SortField(null, SortField.SCORE, false), new SortField("Title", SortField.STRING, true), new SortField("Tagname", SortField.STRING, true), new SortField("Content", SortField.STRING, true)),
hits_limit,
false, // fillFields - not needed, we want score and doc only
true, // trackDocScores - need doc and score fields
true, // trackMaxScore - related to trackDocScores
false); // should docs be in docId order?
searcher.Search(query, collector);
var hits = collector.TopDocs().ScoreDocs;
var results = new List<Datafile>();
foreach (var hit in hits)
{
var doc = searcher.Doc(hit.Doc);
var df = _mapLuceneDocumentToData(doc);
df.score = hit.Score;
results.Add(df);
}
searcher.Dispose();
return results;
// search by multiple fields (ordered by RELEVANCE)
}
}
QueryModel Method:
private static Query QueryModel(string searchQuery, QueryParser[] parsers)
{
BooleanQuery query = new BooleanQuery();
searchQuery = "*" + searchQuery + "*";
foreach (var parser in parsers)
{
parser.AllowLeadingWildcard = true;
var thequery = parser.Parse(searchQuery);
query.Add(new BooleanClause(thequery, Occur.SHOULD));
}
return query;
}
i'm new with lucene.net i love it but i can't get my head around this problem
PS:
also i want to get a fuzzy query as like when the user enter :
city in russua to get a result as if he enter: city in russia
i tried FuzzyQuery Class But it doesn't work anyway ,and is it necessary to use FuzzyQuery Class or not to get that result
So Since no one answer my question and i have found a solution for this issue i've used a search time query boosting and here is my code:
var QParser = new QueryParser(Version.LUCENE_30, "Content", analyzer);
QParser.AllowLeadingWildcard = true;
var Query = new QParser.Parse(searchQuery);
Query.Boost = 7.0f;
return Query;
you can use BooleanQuery if you want to Do an Or,And search

Getting terms matched in a document when searching using a wildcard search

I am looking for a way to find the terms that matched in the document using waldcard search in Lucene. I used the explainer to try and find the terms but this failed. A portion of the relevant code is below.
ScoreDoc[] myHits = myTopDocs.scoreDocs;
int hitsCount = myHits.Length;
for (int myCounter = 0; myCounter < hitsCount; myCounter++)
{
Document doc = searcher.Doc(myHits[myCounter].doc);
Explanation explanation = searcher.Explain(myQuery, myCounter);
string myExplanation = explanation.ToString();
...
When I do a search on say micro*, documents are found and it enter the loop but myExplanation contains NON-MATCH and no other information.
How do I get the term that was found in this document ?
Any help would be most appreciated.
Regards
class TVM : TermVectorMapper
{
public List<string> FoundTerms = new List<string>();
HashSet<string> _termTexts = new HashSet<string>();
public TVM(Query q, IndexReader r) : base()
{
List<Term> allTerms = new List<Term>();
q.Rewrite(r).ExtractTerms(allTerms);
foreach (Term t in allTerms) _termTexts.Add(t.Text());
}
public override void SetExpectations(string field, int numTerms, bool storeOffsets, bool storePositions)
{
}
public override void Map(string term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions)
{
if (_termTexts.Contains(term)) FoundTerms.Add(term);
}
}
void TermVectorMapperTest()
{
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
Document d = null;
d = new Document();
d.Add(new Field("text", "microscope aaa", Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.AddDocument(d);
d = new Document();
d.Add(new Field("text", "microsoft bbb", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.AddDocument(d);
writer.Close();
IndexReader reader = IndexReader.Open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser queryParser = new QueryParser("text", new Lucene.Net.Analysis.Standard.StandardAnalyzer());
queryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
Query query = queryParser.Parse("micro*");
TopDocs results = searcher.Search(query, 5);
System.Diagnostics.Debug.Assert(results.TotalHits == 2);
TVM tvm = new TVM(query, reader);
for (int i = 0; i < results.ScoreDocs.Length; i++)
{
Console.Write("DOCID:" + results.ScoreDocs[i].Doc + " > ");
reader.GetTermFreqVector(results.ScoreDocs[i].Doc, "text", tvm);
foreach (string term in tvm.FoundTerms) Console.Write(term + " ");
tvm.FoundTerms.Clear();
Console.WriteLine();
}
}
One way is to use the Highlighter; another way would be to mimic what the Highlighter does by rewriting your query by calling myQuery.rewrite() with an appropriate rewriter; this is probably closer in spirit to what you were trying. This will rewrite the query to a BooleanQuery containing all the matching Terms; you can get the words out of those pretty easily. Is that enough to get you going?
Here's the idea I had in mind; sorry about the confusion re: rewriting queries; it's not really relevant here.
TokenStream tokens = TokenSources.getAnyTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer);
CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
// do something with termAtt, which holds the matched term
}

MongoDB + C#: Query inside a document

It seems I don't understand how I can get a value from a collection inside a document. I am using mongoDB in C#.
Here is my code:
var jimi = new Document();
jimi["Firstname"] = "Jimi";
jimi["Lastname"] = "James";
jimi["Pets"] = new[]
{
new Document().Append("Type", "Cat").Append("Name", "Fluffy"),
new Document().Append("Type", "Dog").Append("Name", "Barky"),
new Document().Append("Type", "Gorilla").Append("Name", "Bananas"),
};
test.Insert(jimi);
var query = new Document().Append("Pets.Type","Cat");
So my query will look for the pet cat. But I am not sure how I can get the name of my cat. I tried a few things but I mostly get the whole document back.
Thanks in advance,
Pickels
This isn't as elegant as I'd like as I'm still learning about MongoDB myself but it does show you one way to get the property you wanted.
[TestFixture]
public class When_working_with_nested_documents
{
[Test]
public void Should_be_able_to_fetch_properties_of_nested_objects()
{
var mongo = new Mongo();
mongo.Connect();
var db = mongo.getDB("tests");
var people = db.GetCollection("people");
var jimi = new Document();
jimi["Firstname"] = "Jimi";
jimi["Lastname"] = "James";
jimi["Pets"] = new[]
{
new Document().Append("Type", "Cat").Append("Name", "Fluffy"),
new Document().Append("Type", "Dog").Append("Name", "Barky"),
new Document().Append("Type", "Gorilla").Append("Name", "Bananas"),
};
people.Insert(jimi);
var query = new Document();
query["Pets.Type"] = "Cat";
var personResult = people.FindOne(query);
Assert.IsNotNull(personResult);
var petsResult = (Document[])personResult["Pets"];
var pet = petsResult.FindOne("Type", "Cat");
Assert.IsNotNull(pet);
Assert.AreEqual("Fluffy", pet["Name"]);
}
}
public static class DocumentExtensions
{
public static Document FindOne(this Document[] documents, string key, string value)
{
foreach(var document in documents)
{
var v = document[key];
if (v != null && v.Equals(value))
{
return document;
}
}
return null;
}
}

Categories