I am indexing and searching with lucene.net, the only problem I am having with my code is that it does not find any hits when searching for "mvc2"(it seems to work with all the other words I search), I have tried a different analyzer(see comments by analyzer) and older lucene code, here is my index and search code, I would really appreciate if someone can show me where I am going wrong with this, Thanks.
////Indexing code
public void DoIndexing(string CvContent)
{
//state the file location of the index
const string indexFileLocation = #"C:\RecruitmentIndexer\IndexedCVs";
//if directory does not exist, create it, and create new index for it.
//if directory does exist, do not create directory, do not create new //index(add field to previous index).
bool creatNewDirectory; //to pass into lucene GetDirectory
bool createNewIndex; //to pass into lucene indexWriter
if (!Directory.Exists(indexFileLocation))
{
creatNewDirectory = true;
createNewIndex = true;
}
else
{
creatNewDirectory = false;
createNewIndex = false;
}
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, creatNewDirectory);//creates if true
//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.SimpleAnalyzer(); //this analyzer gets all //hits exept mvc2
//Lucene.Net.Analysis.Standard.StandardAnalyzer(); //this leaves out sql once //and mvc2 once
//create the index writer with the directory and analyzer defined.
Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Index.IndexWriter(dir, analyzer,
/*true to create a new index*/ createNewIndex);
//create a document, add in a single field
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldContent =
new Lucene.Net.Documents.Field("content",
CvContent,//"This is some text to search by indexing",
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.ANALYZED,
Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldContent);
//write the document to the index
indexWriter.AddDocument(doc);
//optimize and close the writer
indexWriter.Optimize();
indexWriter.Close();
}
////search code
private void button2_Click(object sender, EventArgs e)
{
string SearchString = textBox1.Text;
///after creating an index, search
//state the file location of the index
const string indexFileLocation = #"C:\RecruitmentIndexer\IndexedCVs";
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, false);
//create an index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
SearchString = SearchString.Trim();
SearchString = QueryParser.Escape(SearchString);
//build a query object
Lucene.Net.Index.Term searchTerm =
new Lucene.Net.Index.Term("content", SearchString);
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);
//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(query);
label1.Text = hits.Length().ToString();
//iterate over the results.
for (int i = 0; i < hits.Length(); i++)
{
Lucene.Net.Documents.Document docMatch = hits.Doc(i);
MessageBox.Show(docMatch.Get("content"));
}
}
I believe that StandardAnalyzer actually strips out "2" from "mvc2", leaving the indexed word to be only "mvc". I'm not sure about SimpleAnalyzer though. You could try to use WhitespaceAnalyzer, which I believe doesn't strip out numbers.
You should also process your search input the same way that you process indexing. A TermQuery is a "identical" match, which means that you if you try to search for "mvc2" where the actual strings in your index always says "mvc", then you won't get a match.
I haven't found a way to actually make use of an analyzer unless I use the QueryParser, and even then I always had odd results.
You could try this in order to "tokenize" your search string in the same way as you index your document, and make a boolean AND search on all terms:
// We use a boolean query to combine all prefix queries
var analyzer = new SimpleAnalyzer();
var query = new BooleanQuery();
using ( var reader = new StringReader( queryTerms ) )
{
// This is what we need to do in order to get the terms one by one, kind of messy but seemed to be the only way
var tokenStream = analyzer.TokenStream( "why_do_I_need_this", reader );
var termAttribute = tokenStream.GetAttribute( typeof( TermAttribute ) ) as TermAttribute;
// This will return false when all tokens has been processed.
while ( tokenStream.IncrementToken() )
{
var token = termAttribute.Term();
query.Add( new PrefixQuery( new Term( KEYWORDS_FIELD_NAME, token ) ), BooleanClause.Occur.MUST );
}
// I don't know if this is necessary, but can't hurt
tokenStream.Close();
}
You can replace the PrefixQuery with TermQuery if you only want full matches (PrefixQuery would match anything starting with, "search*")
Related
I am new to Lucene, here I am facing serious issues with lecene search . When searching records using string/string with numbers it's working fine. But it does not bring any results when search the records using a string with special characters.
ex: example - Brings results
'examples' - no result
%example% - no result
example2 - Brings results
#example - no results
code:
Indexing;
_document.Add(new Field(dc.ColumnName, dr[dc.ColumnName].ToString(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
Search Query :
Lucene.Net.Store.Directory _dir = Lucene.Net.Store.FSDirectory.Open(Config.Get(directoryPath));
Lucene.Net.Analysis.Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
Query querySearch = queryParser.Parse("*" + searchParams.SearchForText + "*");
booleanQuery.Add(querySearch, Occur.MUST);
Can anyone help me to fix this.
It appears there's work to be done. I urge getting a good starter book on Lucene such as Lucene in Action, Second Edition (as you're using version 3). Although it targets Java examples are easily adapted to C#, and really the concepts are what matter most.
First, this:
"*" + searchParams.SearchForText + "*"
Don't do that. Leading wildcard searches are wildly inefficient and will suck an enormous amount of resources on a sizable index, doubly for leading and trailing wildcard search - what would happen if the query text was *e*?
There also seems to be more going on than shown in posted code as there is no reason not to be getting hits based on the inputs. The snippet below will produce the following in the console:
Index terms:
example
example2
raw text %example% as query text:example got 1 hits
raw text 'example' as query text:example got 1 hits
raw text example as query text:example got 1 hits
raw text #example as query text:example got 1 hits
raw text example2 as query text:example2 got 1 hits
Wildcard raw text example* as query text:example* got 2 hit(s)
See the Index Terms listing? NO 'special characters' land in the index because StandardAnalyzer removes them at index time - assuming StandardAnalyzer is used to index the field?
I recommend running the snippet below in the debugger and observe what is happening.
public static void Example()
{
var field_name = "text";
var field_value = "%example% 'example' example #example example";
var field_value2 = "example2";
var luceneVer = Lucene.Net.Util.Version.LUCENE_30;
using (var writer = new IndexWriter(new RAMDirectory(),
new StandardAnalyzer(luceneVer), IndexWriter.MaxFieldLength.UNLIMITED)
)
{
var doc = new Document();
var field = new Field(
field_name,
field_value,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
doc = new Document();
field = new Field(
field_name,
field_value2,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
writer.Commit();
Console.WriteLine();
// Show ALL terms in the index.
using (var reader = writer.GetReader())
{
TermEnum terms = reader.Terms();
Console.WriteLine("Index terms:");
while (terms.Next())
{
Console.WriteLine("\t{0}", terms.Term.Text);
}
}
// Search for each word in the original content #field_value
using (var searcher = new IndexSearcher(writer.GetReader()))
{
string query_text;
QueryParser parser;
Query query;
TopDocs topDocs;
List<string> field_queries = new List<string>(field_value.Split(' '));
field_queries.Add(field_value2);
var analyzer = new StandardAnalyzer(luceneVer);
while (field_queries.Count > 0)
{
query_text = field_queries[0];
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
field_queries.RemoveAt(0);
}
// Now do a wildcard query "example*"
query_text = "example*";
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("Wildcard raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
}
}
}
If you need to perform exact matching, and index certain characters like %, then you'll need to use something other than StandardAnalyzer, perhaps a custom analyzer.
i got a piece of code to add filter with Lucene.net but good explanation was not there to understand the code. so here i paste the code for explanation.
List<SearchResults> Searchresults = new List<SearchResults>();
string indexFileLocation = #"C:\o";
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation);
string[] searchfields = new string[] { "fname", "lname", "dob", "id"};
IndexSearcher indexSearcher = new IndexSearcher(dir);
Filter fil= new QueryWrapperFilter(new TermQuery( new Term(field, "5/12/1998")));
var hits = indexSearcher.Search(QueryMaker(searchString, searchfields), fil);
for (int i = 0; i < hits.Length(); i++)
{
SearchResults result = new SearchResults();
result.fname = hits.Doc(i).GetField("fname").StringValue();
result.lname = hits.Doc(i).GetField("lname").StringValue();
result.dob = hits.Doc(i).GetField("dob").StringValue();
result.id = hits.Doc(i).GetField("id").StringValue();
Searchresults.Add(result);
}
i need explanation for the below two line
Filter fil= new QueryWrapperFilter(new TermQuery( new Term(field, "5/12/1998")));
var hits = indexSearcher.Search(QueryMaker(searchString, searchfields), fil);
i just like to know first lucene search & pull all data and after implement filter or from the beginning lucene pull data based on filter? please guide. thanks.
i just like to know first lucene search & pull all data and after implement filter or from the beginning lucene pull data based on filter? please guide. thanks.
Lucene.Net will perform your search AND your filtered query and after it, it will "merge" the result. The reason to do it I believe is to cache the filtered query, because it will be more likely to have a hit on the next time than the search query.
This is my code to search the Lucene index,
String DocPath=#"c:\Test1.txt";
if (File.Exists(DocPath))
{
StreamReader Reader = new StreamReader(DocPath);
StringBuilder Content = new StringBuilder();
Content.Append(Reader.ReadToEnd());
if (Content.ToString().Trim() != "")
{
FSDirectory Direc = FSDirectory.Open(new DirectoryInfo(IndexDir));
IndexReader Reader = IndexReader.Open(Direc, true);
IndexSearcher searcher = new IndexSearcher(Reader);
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Content", new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, new FileInfo(Application.StartupPath + Path.DirectorySeparatorChar + "noise.dat")));
BooleanQuery.MaxClauseCount = Convert.ToInt32(Content.ToString().Length);
Query query = parser.Parse(QueryParser.Escape(Content.ToString().ToLower()));
TopDocs docs = searcher.Search(query, Reader.maxDoc);
}
}
In this code I am opening one text file of 15MB and giving it to the index searcher. The search takes very long time and apparently throws an OutOfMemoryException. It even takes time to parse the query. Index size is around 16K docs.
I suggest you change your approach. With the document, store an additional field that contains the hash of the file, like a MD5 hash for example.
Use your input to compute it's hash and issue a Query for that hash, and compare the matching documents with your input for equality.
It will be a lot more robust, and will probably be more performant too.
I'm trying to get a link and another element from an HTML page, but I don't really know what to do. This is what I have right now:
var client = new HtmlWeb(); // Initialize HtmlAgilityPack's functions.
var url = "http://p.thedgtl.net/index.php?tag=-1&title={0}&author=&o=u&od=d&page=-1&"; // The site/page we are indexing.
var doc = client.Load(string.Format(url, textBox1.Text)); // Index the whole DB.
var nodes = doc.DocumentNode.SelectNodes("//a[#href]"); // Get every url.
string authorName = "";
string fileName = "";
string fileNameWithExt;
foreach (HtmlNode link in nodes)
{
string completeUrl = link.Attributes["href"].Value; // The complete plugin download url.
#region Get all jars
if (completeUrl.Contains(".jar")) // Check if the url contains .jar
{
fileNameWithExt = completeUrl.Substring(completeUrl.LastIndexOf('/') + 1); // Get the filename with extension.
fileName = fileNameWithExt.Remove(fileNameWithExt.LastIndexOf('.')); ; // Get the filename without extension.
Console.WriteLine(fileName);
}
#endregion
#region Get all Authors
if (completeUrl.Contains("?author=")) // Check if the url contains .jar
{
authorName = completeUrl.Substring(completeUrl.LastIndexOf('=') + 1); // Get the filename with extension.
Console.WriteLine(authorName);
}
#endregion
}
I am trying to get all the filenames and authors next to each other, but now everything is like randomly placed, why?
Can someone help me with this? Thanks!
If you look at the HTML, it's very unfortunate it is not well-formed. There's a lot of open tags and the way HAP structures it is not like a browser, it interprets the majority of the document as deeply nested. So you can't just simply iterate through the rows of the table like you would in the browser, it gets a lot more complicated than that.
When dealing with such documents, you have to change your queries quite a bit. Rather than searching through child elements, you have to search through descendants adjusting for the change.
var title = System.Web.HttpUtility.UrlEncode(textBox1.Text);
var url = String.Format("http://p.thedgtl.net/index.php?title={0}", title);
var web = new HtmlWeb();
var doc = web.Load(url);
// select the rows in the table
var xpath = "//div[#class='content']/div[#class='pluginList']/table[2]";
var table = doc.DocumentNode.SelectSingleNode(xpath);
// unfortunately the `tr` tags are not closed so HAP interprets
// this table having a single row with multiple descendant `tr`s
var rows = table.Descendants("tr")
.Skip(1); // skip header row
var query =
from row in rows
// there may be a row with an embedded ad
where row.SelectSingleNode("td/script") == null
// each row has 6 columns so we need to grab the next 6 descendants
let columns = row.Descendants("td").Take(6).ToList()
let titleText = columns[1].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let authorText = columns[2].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let downloadLink = columns[5].Elements("a").Select(a => a.GetAttributeValue("href", null)).FirstOrDefault()
select new
{
Title = titleText ?? "",
Author = authorText ?? "",
FileName = Path.GetFileName(downloadLink ?? ""),
};
So now you can just iterate through the query and write out what you want for each of the rows.
foreach (var item in query)
{
Console.WriteLine("{0} ({1})", item.FileName, item.Author);
}
I want to search a Lucene.net index for a stored url field. My code is given below:
Field urlField = new Field("Url", url.ToLower(), Field.Store.YES,Field.Index.TOKENIZED);
document.Add(urlField);`
indexWriter.AddDocument(document);
I am using the above code for writing into the index.
And the below code to search the Url in the index.
Lucene.Net.Store.Directory _directory = FSDirectory.GetDirectory(Host, false);
IndexReader reader = IndexReader.Open(_directory);
KeywordAnalyzer _analyzer = new KeywordAnalyzer();
IndexSearcher indexSearcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("Url", _analyzer);
Query query = parser.Parse("\"" + downloadDoc.Uri.ToString() + "\"");
TopDocs hits = indexSearcher.Search(query, null, 10);
if (hits.totalHits > 0)
{
//statements....
}
But whenever I search for a url for example: http://www.xyz.com/, I am not getting any hits.
Somehow, figured out the alternative. But this works in case of only one document in the index. If there are more documents, the below code will not yield correct result. Any ideas? Pls help
While writing the index, use KeywordAnalyzer()
KeywordAnalyzer _analyzer = new KeywordAnalyzer();
indexWriter = new IndexWriter(_directory, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
Then while searching also, use KeywordAnalyzer()
IndexReader reader = IndexReader.Open(_directory);
KeywordAnalyzer _analyzer = new KeywordAnalyzer();
IndexSearcher indexSearcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("Url", _analyzer);
Query query = parser.Parse("\"" + url.ToString() + "\"");
TopDocs hits = indexSearcher.Search(query, null, 1);
This is because the KeywordAnalyzer "Tokenizes" the entire stream as a
single token.
Please help. Its urgent.
Cheers
Sunil...
This worked for me:
IndexReader reader = IndexReader.Open(_directory);
IndexSearcher indexSearcher = new IndexSearcher(reader);
TermQuery tq= new TermQuery(new Term("Url", downloadDoc.Uri.ToString().ToLower()));
BooleanQuery bq = new BooleanQuery();
bq.Add(tq, BooleanClause.Occur.SHOULD);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
Use StandardAnalyzer while writing into the index.
This answer helped me: Lucene search by URL
try putting quotes around query, eg. like this :
"http://www.google.com/"
Using the whitespace or keyword analyzer should work.
Would anyone actually search for "http://www.Google.com"? Seems more likely that a user would search for "Google" instead.
You can always return the entire URL if their is a partial match. I think the standard analyzer should be more appropriate for searching and retrieving a URL.