Lucene.Net:search does not find all the values - c#

Use Lucene.Net 3.0.3.
I have several classes with fields like this:
public class Test1
{
public string Name {get;set;}
}
How i create index:
var doc = new Document();
doc.Add(new Field(KEYWORDS_FIELD_NAME, someUid, CultureInfo.InvariantCulture), Field.Store.YES, Field.Index.ANALYZED));
How i create Query:
var analyzer=new RussianAnalyzer(Version.LUCENE_30);
private Query ParseQuery(string queryString,Analyzer analyzer)
{
var classQuery = new BooleanQuery();
var hs = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase);
foreach(var par in Parameters)
{
classQuery.add(new TermQuery(new Term(KEYWORDS_FIELD_NAME,par.ClassName.ToLower()),Occur.SHOULD);
hs.add(par.PropertyName);
}
var parser=new MultiFieldQueryParse(Version.LUCENE_30, hs.ToArray(), analyzer);
var multiQuery=parser.Parse(queryString.Ttim());
var result=new BooleanQuery
{
{classQuery,Occur.MUST},
new BooleanClause(multiQuery,Occur.MUST)
};
return result;
}
And search request:
var query=ParseQuery(queryString,analuzer);
using (var searcher = new IndexSearcher(luceneDirectory))
{
var hits = searcher.Search(query, null, 10000);
}
In the search index there is a "Name" property of class Test1.
Some of the values of the properties are:
   40002
   40001
   4001
   4009
and other similar values.
When I enter "4001", the search produces one result. That suits me.
However, when I enter "400", the search does not find any value.
I understand that this value is not in the index, but I expect that the search in this case will find "similar" values: 4001, 40002 and others.
Can this be done? What am I doing wrong?
Thank you.
P.S. it works with "400*" and MultiFieldQuery without RegexQuery. It slow per 30 percents.
When i use RegexQuery- 70-80 percents.

Try search for 400*, You would also need to WildCardQuery
Not a BooleanQuery

Related

Searching records using string with special characters not working in Lucene.Net

I am new to Lucene, here I am facing serious issues with lecene search . When searching records using string/string with numbers it's working fine. But it does not bring any results when search the records using a string with special characters.
ex: example - Brings results
'examples' - no result
%example% - no result
example2 - Brings results
#example - no results
code:
Indexing;
_document.Add(new Field(dc.ColumnName, dr[dc.ColumnName].ToString(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
Search Query :
Lucene.Net.Store.Directory _dir = Lucene.Net.Store.FSDirectory.Open(Config.Get(directoryPath));
Lucene.Net.Analysis.Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
Query querySearch = queryParser.Parse("*" + searchParams.SearchForText + "*");
booleanQuery.Add(querySearch, Occur.MUST);
Can anyone help me to fix this.
It appears there's work to be done. I urge getting a good starter book on Lucene such as Lucene in Action, Second Edition (as you're using version 3). Although it targets Java examples are easily adapted to C#, and really the concepts are what matter most.
First, this:
"*" + searchParams.SearchForText + "*"
Don't do that. Leading wildcard searches are wildly inefficient and will suck an enormous amount of resources on a sizable index, doubly for leading and trailing wildcard search - what would happen if the query text was *e*?
There also seems to be more going on than shown in posted code as there is no reason not to be getting hits based on the inputs. The snippet below will produce the following in the console:
Index terms:
example
example2
raw text %example% as query text:example got 1 hits
raw text 'example' as query text:example got 1 hits
raw text example as query text:example got 1 hits
raw text #example as query text:example got 1 hits
raw text example2 as query text:example2 got 1 hits
Wildcard raw text example* as query text:example* got 2 hit(s)
See the Index Terms listing? NO 'special characters' land in the index because StandardAnalyzer removes them at index time - assuming StandardAnalyzer is used to index the field?
I recommend running the snippet below in the debugger and observe what is happening.
public static void Example()
{
var field_name = "text";
var field_value = "%example% 'example' example #example example";
var field_value2 = "example2";
var luceneVer = Lucene.Net.Util.Version.LUCENE_30;
using (var writer = new IndexWriter(new RAMDirectory(),
new StandardAnalyzer(luceneVer), IndexWriter.MaxFieldLength.UNLIMITED)
)
{
var doc = new Document();
var field = new Field(
field_name,
field_value,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
doc = new Document();
field = new Field(
field_name,
field_value2,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
writer.Commit();
Console.WriteLine();
// Show ALL terms in the index.
using (var reader = writer.GetReader())
{
TermEnum terms = reader.Terms();
Console.WriteLine("Index terms:");
while (terms.Next())
{
Console.WriteLine("\t{0}", terms.Term.Text);
}
}
// Search for each word in the original content #field_value
using (var searcher = new IndexSearcher(writer.GetReader()))
{
string query_text;
QueryParser parser;
Query query;
TopDocs topDocs;
List<string> field_queries = new List<string>(field_value.Split(' '));
field_queries.Add(field_value2);
var analyzer = new StandardAnalyzer(luceneVer);
while (field_queries.Count > 0)
{
query_text = field_queries[0];
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
field_queries.RemoveAt(0);
}
// Now do a wildcard query "example*"
query_text = "example*";
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("Wildcard raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
}
}
}
If you need to perform exact matching, and index certain characters like %, then you'll need to use something other than StandardAnalyzer, perhaps a custom analyzer.

C# to Sort by Last Created (oldest record) & Limit results to 20 records from DynamoDB Table

How to apply Sort by Last Created (oldest record) & Limit results to 20 records from DynamoDB Table using BatchGetItemAsync Method. Thanks in Advance.
var table = Table.LoadTable(client, TableName);
var request = new BatchGetItemRequest
{
RequestItems = new Dictionary<string, KeysAndAttributes>()
{
{ TableName,
new KeysAndAttributes
{
AttributesToGet = new List<string> { "ID", "Status", "Date" },
Keys = new List<Dictionary<string, AttributeValue>>()
{
new Dictionary<string, AttributeValue>()
{
{ "Status", new AttributeValue { S = "Accepted" } }
}
}
}
}
}
};
var response = await client.BatchGetItemAsync(request);
var results = response.Responses;
var result = results[fullTableName];
There isn't a way to do what you're asking for with BatchGetItemAsync. That call is to get specific records, when you know the specific keys you are looking for. You'll need to use a query to do this, and you'll want to get your data in a structure that supports this access pattern. There was a really great session on DynamoDB access patterns at re:Invent 2018. I suggest watching it: https://www.youtube.com/watch?v=HaEPXoXVf2k

Simple PhraseQuery can't find any results

I am trying to implement Lucene search on 3 fields in my data. It should work as following:
when the field text is "My big white cat" when I search for "big cat" it would match.
Based on the tutorials I added the AddToLuceneIndex method:
private static void AddToLuceneIndex(MyObject myObject, IndexWriter writer)
{
var searchQuery = new TermQuery(new Term("Id", myObject.Id));
writer.DeleteDocuments(searchQuery);
var doc = new Document();
doc.Add(new Field("Field1", myObject.Field1, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("Field2", myObject.Field2, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("Field3", myObject.Field3, Field.Store.YES, Field.Index.ANALYZED));
(...)
writer.AddDocument(doc);
}
In my Search method i tried to use PhraseQuery:
public static IEnumerable<MyObject> Search(string phrase)
{
var luceneDir = FSDirectory.Open(new DirectoryInfo(LuceneDir));
var indexReader = IndexReader.Open(luceneDir, true);
var searcher = new IndexSearcher(indexReader);
var phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term("Field1", phrase));
const int maxHits = 1000;
var collector = TopScoreDocCollector.Create(maxHits, false);
searcher.Search(phraseQuery, collector);
var hits = collector.TopDocs().ScoreDocs;
return MapLuceneToDataList(hits, searcher).ToList();
}
There are always 0 hits (although there are matched objects)
When I use BooleanQuery like this:
public static IEnumerable<MyObject> Search(string phrase)
{
var luceneDir = FSDirectory.Open(new DirectoryInfo(LuceneDir));
var indexReader = IndexReader.Open(luceneDir, true);
var searcher = new IndexSearcher(indexReader);
var terms = phrase.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
var queryParser = new MultiFieldQueryParser
(Version.LUCENE_30,
new[] { "Field1", "Field2", "Field3"},
analyzer) { FuzzyMinSim = 0.8f };
var booleanQuery = new BooleanQuery();
foreach (var term in terms)
{
booleanQuery.Add(queryParser.Parse(term.Replace("~", "") + "~"), Occur.MUST);
}
const int maxHits = 1000;
var collector = TopScoreDocCollector.Create(maxHits, false);
searcher.Search(booleanQuery, collector);
var hits = collector.TopDocs().ScoreDocs;
return MapLuceneToDataList(hits, searcher).ToList();
}
it works well, but I don't need "big OR cat", I need something I've described earlier. What am I doing wrong using PhraseQuery?
There are two problems with what your PhraseQuery.
As stated by #groverboy, you must add separate terms separately to the PhraseQuery. While Query.toString() may show the same thing, they are not the same thing. The toString method does not show term breaks in a PhraseQuery. It tries to represent the query in as close to standard QueryParser syntax, which isn't capable of expressing any possible query constructed manually with the Query API. The PhraseQuery you've created won't be run through the analyzer, and so won't ever be tokenized. It will only be looking to the single token "big cat", rather than two adjacent tokens "big" and "cat".
The explain method provides much more complete information than toString, so you may find that to be a useful tool.
Also, you don't appear to want adjacent tokens, either, but rather you need to incorporate some slop into the query. You want "big cat" to match "big white cat", so you will need to set an adequate level of allowed slop.
So, something like this:
var phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term("Field1", "big"));
phraseQuery.Add(new Term("Field1", "cat"));
phraseQuery.setSlop(1);
You could also just run the query through the query parser, if you prefer. Simply, using the analyzer you've created in your third code block. You can set the default phrase slop for the query parser, to handle the slop issue discussed. Something like:
queryParser.setPhraseSlop(1)
queryParser.Parse("\"" + phrase + "\"")
// Or maybe just: queryParser.Parse(phrase);
You need to add a Term to the PhraseQuery for each word in the phrase, like this:
var phraseQuery = new PhraseQuery();
var words = phrase.Split(new Char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
phraseQuery.Add(new Term("Field1", word));
}

Why this difference between foreach vs Parallel.ForEach?

Can anyone explain to me in simple langauage why I get a file about 65 k when using foreach and more then 3 GB when using Parallel.ForEach?
The code for the foreach:
// start node xml document
var logItems = new XElement("log", new XAttribute("start", DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ss")));
var products = new ProductLogic().SelectProducts();
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
// loop through all products
foreach (var product in products)
{
// is in a specific group
var id = Convert.ToInt32(product["ProductID"]);
var isInGroup = productGroupLogic.GetProductGroups(new int[] { id }.ToList(), groupId).Count > 0;
// get product stock per option
var productSizes = productOptionLogic.GetProductStockByProductId(id).ToList();
// any stock available
var stock = productSizes.Sum(ps => ps.Stock);
var hasStock = stock > 0;
// get webpage for this product
var productUrl = string.Format(url, id);
var htmlPage = Html.Page.GetWebPage(productUrl);
// check if there is anything to log
var addToLog = false;
XElement sizeElements = null;
// if has no stock or in group
if (!hasStock || isInGroupNew)
{
// page shows => not ok => LOG!
if (!htmlPage.NotFound) addToLog = true;
}
// if page is ok
if (htmlPage.IsOk)
{
sizeElements = GetSizeElements(htmlPage.Html, productSizes);
addToLog = sizeElements != null;
}
if (addToLog) logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
}
// save
var xDocument = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("log", logItems));
xDocument.Save(fileName);
Use of the parallel code is a minor change, just replaced the foreach with Parallel.ForEach:
// loop through all products
Parallel.ForEach(products, product =>
{
... code ...
};
The methods GetSizeElements and CreateElements are both static.
update1
I made the methods GetSizeElements and CreateElements threadsafe with a lock, also doesn't help.
update2
I get answer to solve the problem. That's nice and fine. But I would like to get some more insigths on why this codes creates a file that is so much bigger then the foreach solutions. I am trying get some more sense in how the code is working when using threads. That way I get more insight and can I learn to avoid the pitfalls.
One thing stands out:
if (addToLog)
logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
logItems is not tread-safe. That could be your core problem but there are lots of other possibilities.
You have the output files, look for the differences.
Try to define the following parameters inside the foreach loop.
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
I think the only two is used by all of your threads inside the parallel foreach loop and the result is multiplied unnecessaryly.

SolrNet facetted search

I'm using Solr and Solrnet for the first time.
I have managed to create a schema, which has amongst them the following fields:
created_date DateTime
parent_id int
categories string multi-valued
I've been able to get a basic search working against the parent_id field, as so:
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<SolrNode>>();
ICollection<SolrNode> results = solr.Query(new SolrQueryByField("parent_id", currentNode.Id.ToString()));
I've been able to figure out (kind of) how to return facets for all of my results, as so:
var solrFacet = ServiceLocator.Current.GetInstance<ISolrOperations<SolrNode>>();
var r = solrFacet.Query(SolrQuery.All, new QueryOptions
{
Rows = 0,
Facet = new FacetParameters
{
Queries = new[] {
//new SolrFacetDateQuery("created_date", DateTime.Now /* range start */, DateTime.Now.AddMonths(-6) /* range end */, "+1DAY" /* gap */) {
// HardEnd = true,
// Other = new[] {FacetDateOther.After, FacetDateOther.Before}
//}
new SolrFacetDateQuery("created_date", new DateTime(2011, 1, 1).AddDays(-1) /* range start */, new DateTime(2014, 1, 1).AddMonths(1) /* range end */, "+1MONTH" /* gap */) {
HardEnd = true,
Other = new[] {FacetDateOther.After, FacetDateOther.Before}
}
//,
//new SolrFacetFieldQuery("categories")
},
}
});
//foreach (var facet in r.FacetFields["categories"])
//{
// this.DebugLiteral.Text += string.Format("{0}: {1}{2}", facet.Key, facet.Value, "<br />");
//}
DateFacetingResult dateFacetResult = r.FacetDates["created_date"];
foreach (KeyValuePair<DateTime, int> dr in dateFacetResult.DateResults)
{
this.DebugLiteral.Text += string.Format("{0}: {1}{2}", dr.Key, dr.Value, "<br />");
}
But what I'm not able to figure out is how to plumb it all together. My requirements are as follows:
Page loads - show all search results where parent_id matches N. Query facets of search results and show tick boxes for the facets like so:
Categories
Category 1
Category 2
Within
Last week
Last month
Last 3 months
Last 6 months
All time
User clicks on relevant tick boxes and then code executes another solr query, passing in both the parent_id criteria, along with the facets the user has selected.
I realise in my description I have simplified the process, and perhaps it is quite a big question to ask on StackOverflow, so I'm of course not expecting a working example (although if you're bored pls feel free ;-)) but could anyone provide any pointers, or examples? SolrNet does have an MVC sample app, but I'm using WebForms and not particularly comfortable with MVC just yet.
Any help would be greatly appreciated.
Thanks in advance
Al
You can club the Query and add facets as Query Opertations
ISolrQueryResults<TestDocument> r = solr.Query("product_id:XXXXX", new QueryOptions {
FacetQueries = new ISolrFacetQuery[] {
new SolrFacetFieldQuery("category")
}
});
I create a 'fq' string of the facets selected. For instance, if the user selects these facets:
united states
california
almond growers
and the facet field is 'content_type', I generate this query string:
(content_type:"united states" AND content_type:california AND content_type:"almond growers")
Note the quotes and the open and close parenthesis...important! I store this in the variable named finalFacet. I then submit it to Solr like this, where sQuery is the text the user is searching on and finalFacet is the facet query string as shown above:
articles = solr.Query(sQuery, new QueryOptions
{
ExtraParams = new Dictionary<string, string> {
{"fq", finalFacet},
},
Facet = new FacetParameters
{
Queries = new[] { new SolrFacetFieldQuery("content_type")}
},
Rows = 10
Start = 1,
});

Categories