I am new to Lucene, here I am facing serious issues with lecene search . When searching records using string/string with numbers it's working fine. But it does not bring any results when search the records using a string with special characters.
ex: example - Brings results
'examples' - no result
%example% - no result
example2 - Brings results
#example - no results
code:
Indexing;
_document.Add(new Field(dc.ColumnName, dr[dc.ColumnName].ToString(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
Search Query :
Lucene.Net.Store.Directory _dir = Lucene.Net.Store.FSDirectory.Open(Config.Get(directoryPath));
Lucene.Net.Analysis.Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
Query querySearch = queryParser.Parse("*" + searchParams.SearchForText + "*");
booleanQuery.Add(querySearch, Occur.MUST);
Can anyone help me to fix this.
It appears there's work to be done. I urge getting a good starter book on Lucene such as Lucene in Action, Second Edition (as you're using version 3). Although it targets Java examples are easily adapted to C#, and really the concepts are what matter most.
First, this:
"*" + searchParams.SearchForText + "*"
Don't do that. Leading wildcard searches are wildly inefficient and will suck an enormous amount of resources on a sizable index, doubly for leading and trailing wildcard search - what would happen if the query text was *e*?
There also seems to be more going on than shown in posted code as there is no reason not to be getting hits based on the inputs. The snippet below will produce the following in the console:
Index terms:
example
example2
raw text %example% as query text:example got 1 hits
raw text 'example' as query text:example got 1 hits
raw text example as query text:example got 1 hits
raw text #example as query text:example got 1 hits
raw text example2 as query text:example2 got 1 hits
Wildcard raw text example* as query text:example* got 2 hit(s)
See the Index Terms listing? NO 'special characters' land in the index because StandardAnalyzer removes them at index time - assuming StandardAnalyzer is used to index the field?
I recommend running the snippet below in the debugger and observe what is happening.
public static void Example()
{
var field_name = "text";
var field_value = "%example% 'example' example #example example";
var field_value2 = "example2";
var luceneVer = Lucene.Net.Util.Version.LUCENE_30;
using (var writer = new IndexWriter(new RAMDirectory(),
new StandardAnalyzer(luceneVer), IndexWriter.MaxFieldLength.UNLIMITED)
)
{
var doc = new Document();
var field = new Field(
field_name,
field_value,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
doc = new Document();
field = new Field(
field_name,
field_value2,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES
);
doc.Add(field);
writer.AddDocument(doc);
writer.Commit();
Console.WriteLine();
// Show ALL terms in the index.
using (var reader = writer.GetReader())
{
TermEnum terms = reader.Terms();
Console.WriteLine("Index terms:");
while (terms.Next())
{
Console.WriteLine("\t{0}", terms.Term.Text);
}
}
// Search for each word in the original content #field_value
using (var searcher = new IndexSearcher(writer.GetReader()))
{
string query_text;
QueryParser parser;
Query query;
TopDocs topDocs;
List<string> field_queries = new List<string>(field_value.Split(' '));
field_queries.Add(field_value2);
var analyzer = new StandardAnalyzer(luceneVer);
while (field_queries.Count > 0)
{
query_text = field_queries[0];
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
field_queries.RemoveAt(0);
}
// Now do a wildcard query "example*"
query_text = "example*";
parser = new QueryParser(luceneVer, field_name, analyzer);
query = parser.Parse(query_text);
topDocs = searcher.Search(query, null, 100);
Console.WriteLine();
Console.WriteLine("Wildcard raw text {0} as query {1} got {2} hit(s)",
query_text,
query,
topDocs.TotalHits
);
}
}
}
If you need to perform exact matching, and index certain characters like %, then you'll need to use something other than StandardAnalyzer, perhaps a custom analyzer.
everyone.
I want to parse 300+Mb text-file with 2.000.000+ lines in it and make some operations(split every line, make comparsions, save data in dict.) with stored data.
It takes program ~50+ mins to get expected result (for files with 80.000 lines it takes about 15-20seconds)
Is there any way to make it to work faster?
Code sample below:
using (FileStream cut_file = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(cut_file))
using (StreamReader s_reader = new StreamReader(bs)) {
string line;
while ((line = s_reader.ReadLine()) != null) {
string[] every_item = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
string car = every_item[0];
string[] cameras = every_item[1].Split(',');
if (!cars.Contains(car)) { //cars is List<string> defined at the beginning of programm
for (int camera = 0; camera < cameras.Count(); camera++) {
if (cams_input.Contains(cameras[camera])) { //cams_input is List<string> defined at the beginning of programm
cars.Add(car);
result[myfile]++; //result is Dictionary<string, int>. Used dict. for parsing several files.
}
}
}
}
}
Well, it is quite possible you have a problem linked to memory use.
However, you have some blatant inefficiencies in useless Linq usage:
when you call Contains() on a List, you basically do a foreach on the List.
So, an improvement over your code is to use HashSet instead of List in order to speed up the Contains().
Same for calling Count() on the array in the for loop. It's an array, so just call Array.Length.
Anyway, you should profile the code in your machine (I use the JetBrains Profiler and find it invaluable to do this kind of performance profiling).
My take on this:
string myfile = "";
var cars = new HashSet<string>();
var cams_input = new HashSet<string>();
var result = new Dictionary<string, int>();
foreach (var line in System.IO.File.ReadLines(myfile, System.Text.Encoding.UTF8))
{
var everyItem = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
var car = everyItem[0];
if (cars.Contains(car)) continue;
var cameras = everyItem[1].Split(',');
for (int camera = 0; camera < cameras.Length; camera++)
{
if (cams_input.Contains(cameras[camera]))
{
cars.Add(car);
// I really don't get who is inserting value zero.
result[myfile]++;
}
}
}
Edit: As per your comment, the performance seemed to be related to the use of lists. You should read a guide about collections available in the .Net framework, like this: http://www.codethinked.com/an-overview-of-system_collections_generic
Every single type is best suited for a type of task: the HashSet, for example, is meant to be used to store a set (doh!) of unique values, and the really shiny feat it gives you is O(1) Contains operations.
What you pay is the storage of the hashes, and computation of them.
You lose also sorting, etc.
A Dictionary is basically an HashSet with a value attached to each key.
Good study!
Ps: if the problem is solved, please close the question.
I want to search a Lucene.net index for a stored url field. My code is given below:
Field urlField = new Field("Url", url.ToLower(), Field.Store.YES,Field.Index.TOKENIZED);
document.Add(urlField);`
indexWriter.AddDocument(document);
I am using the above code for writing into the index.
And the below code to search the Url in the index.
Lucene.Net.Store.Directory _directory = FSDirectory.GetDirectory(Host, false);
IndexReader reader = IndexReader.Open(_directory);
KeywordAnalyzer _analyzer = new KeywordAnalyzer();
IndexSearcher indexSearcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("Url", _analyzer);
Query query = parser.Parse("\"" + downloadDoc.Uri.ToString() + "\"");
TopDocs hits = indexSearcher.Search(query, null, 10);
if (hits.totalHits > 0)
{
//statements....
}
But whenever I search for a url for example: http://www.xyz.com/, I am not getting any hits.
Somehow, figured out the alternative. But this works in case of only one document in the index. If there are more documents, the below code will not yield correct result. Any ideas? Pls help
While writing the index, use KeywordAnalyzer()
KeywordAnalyzer _analyzer = new KeywordAnalyzer();
indexWriter = new IndexWriter(_directory, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
Then while searching also, use KeywordAnalyzer()
IndexReader reader = IndexReader.Open(_directory);
KeywordAnalyzer _analyzer = new KeywordAnalyzer();
IndexSearcher indexSearcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("Url", _analyzer);
Query query = parser.Parse("\"" + url.ToString() + "\"");
TopDocs hits = indexSearcher.Search(query, null, 1);
This is because the KeywordAnalyzer "Tokenizes" the entire stream as a
single token.
Please help. Its urgent.
Cheers
Sunil...
This worked for me:
IndexReader reader = IndexReader.Open(_directory);
IndexSearcher indexSearcher = new IndexSearcher(reader);
TermQuery tq= new TermQuery(new Term("Url", downloadDoc.Uri.ToString().ToLower()));
BooleanQuery bq = new BooleanQuery();
bq.Add(tq, BooleanClause.Occur.SHOULD);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
Use StandardAnalyzer while writing into the index.
This answer helped me: Lucene search by URL
try putting quotes around query, eg. like this :
"http://www.google.com/"
Using the whitespace or keyword analyzer should work.
Would anyone actually search for "http://www.Google.com"? Seems more likely that a user would search for "Google" instead.
You can always return the entire URL if their is a partial match. I think the standard analyzer should be more appropriate for searching and retrieving a URL.
As of now, I am using this code to open a file and read it into a list and parse that list into a string[]:
string CP4DataBase =
"C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);
string[] splitCP4DataBaseLines = CP4DataBaseRTB.Text.Split('\n');
List<string> tempCP4List = new List<string>();
string[] line1CP4Components;
foreach (var line in splitCP4DataBaseLines)
tempCP4List.Add(line + Environment.NewLine);
string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
concattedUnitPart = concattedUnitPart + line;
line1CP4PartLines++;
}
line1CP4Components = new Regex("\"UNIT\",\"PARTS\"", RegexOptions.Multiline)
.Split(concattedUnitPart)
.Where(c => !string.IsNullOrEmpty(c)).ToArray();
I am wondering if there is a quicker way to do this. This is just one of the files I am opening, so this is repeated a minimum of 5 times to open and properly load the lists.
The minimum file size being imported right now is 257 KB. The largest file is 1,803 KB. These files will only get larger as time goes on as they are being used to simulate a database and the user will continually add to them.
So my question is, is there a quicker way to do all of the above code?
EDIT:
***CP4***
"UNIT","PARTS"
"BLOCK","HEADER-"
"NAME","106536"
"REVISION","0000"
"DATE","11/09/03"
"TIME","11:10:11"
"PMABAR",""
"COMMENT",""
"PTPNAME","R160805"
"CMPNAME","R160805"
"BLOCK","PRTIDDT-"
"PMAPP",1
"PMADC",0
"ComponentQty",180
"BLOCK","PRTFORM-"
"PTPSZBX",1.60
"PTPSZBY",0.80
"PTPMNH",0.25
"NeedGlue",0
"BLOCK","TOLEINF-"
"PTPTLBX",0.50
"PTPTLBY",0.40
"PTPTLCL",10
"PTPTLPX",0.30
"PTPTLPY",0.30
"PTPTLPQ",30
"BLOCK","ELDT+" "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
"PTPVIPL",0
"PTPVILCA",0
"PTPVILB",0
"PTPVICVT",10
"PENVILIT",0
"BLOCK","ENVDT"
"ELEMENT","CP43ENVDT-"
"PENNMI",1.0
"PENNMA",1.0
"PENNZN",""
"PENNZT",1.0
"PENBLM",12
"PENCRTS",0
"PENSPD1",100
"PTPCRDCT",0
"PENVICT",1
"PCCCRFT",1
"BLOCK","CARRING-"
"PTPCRAPO",0
"PTPCRPCK",0
"PTPCRPUX",0.00
"PTPCRPUY",0.00
"PTPCRRCV",0
"BLOCK","PACKCLS-"
"FDRTYPE","Emboss"
"TAPEWIDTH","8mm"
"FEEDPITCH",4
"REELDIAMETER",0
"TAPEDEPTH",0.0
"DOADVVACUUM",0
"CHKBEFOREFEED",0
"TAPEARMLENGTH",0
"PPCFDPP",0
"PPCFDEC",4
"PPCMNPT",30
"UNIT","PARTS"
"BLOCK","HEADER-"
"NAME","106653"
"REVISION","0000"
"DATE","11/09/03"
"TIME","11:10:42"
"PMABAR",""
"COMMENT",""
"PTPNAME","0603R"
"CMPNAME","0603R"
"BLOCK","PRTIDDT-"
"PMAPP",1
"PMADC",0
"ComponentQty",18
"BLOCK","PRTFORM-"
"PTPSZBX",1.60
"PTPSZBY",0.80
"PTPMNH",0.23
"NeedGlue",0
"BLOCK","TOLEINF-"
"PTPTLBX",0.50
"PTPTLBY",0.34
"PTPTLCL",0
"PTPTLPX",0.60
"PTPTLPY",0.40
"PTPTLPQ",30
"BLOCK","ELDT+" "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
"PTPVIPL",0
"PTPVILCA",0
"PTPVILB",0
"PTPVICVT",10
"PENVILIT",0
"BLOCK","ENVDT"
"ELEMENT","CP43ENVDT-"
"PENNMI",1.0
"PENNMA",1.0
"PENNZN",""
"PENNZT",1.0
"PENBLM",12
"PENCRTS",0
"PENSPD1",80
"PTPCRDCT",0
"PENVICT",1
"PCCCRFT",1
"BLOCK","CARRING-"
"PTPCRAPO",0
"PTPCRPCK",0
"PTPCRPUX",0.00
"PTPCRPUY",0.00
"PTPCRRCV",0
"BLOCK","PACKCLS-"
"FDRTYPE","Emboss"
"TAPEWIDTH","8mm"
"FEEDPITCH",4
"REELDIAMETER",0
"TAPEDEPTH",0.0
"DOADVVACUUM",0
"CHKBEFOREFEED",0
"TAPEARMLENGTH",0
"PPCFDPP",0
"PPCFDEC",4
"PPCMNPT",30
... the file goes on and on and on.. and will only get larger.
The REGEX is placing each "UNIT PARTS" and the following code until the NEXT "UNIT PARTS" into a string[].
After this, I am checking each string[] to see if the "NAME" section exists in a different list. If it does exist, I am outputting that "UNIT PARTS" at the end of a textfile.
This bit is a potential performance killer:
string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
concattedUnitPart = concattedUnitPart + line;
line1CP4PartLines++;
}
(See this article for why.) Use a StringBuilder for repeated concatenation:
// No need to use tempCP4List at all
StringBuilder builder = new StringBuilder();
foreach (var line in splitCP4DataBaseLines)
{
concattedUnitPart.AppendLine(line);
line1CP4PartLines++;
}
Or even just:
string concattedUnitPart = string.Join(Environment.NewLine,
splitCP4DataBaseLines);
Now the regex part may well also be slow - I'm not sure. It's not obvious what you're trying to achieve, whether you need regular expressions at all, or whether you really need to do the whole thing in one go. Can you definitely not just process it line by line?
You could achieve the same output list 'line1CP4Components' using the following:
Regex StripEmptyLines = new Regex(#"^\s*$", RegexOptions.Multiline);
Regex UnitPartsMatch = new Regex(#"(?<=\n)""UNIT"",""PARTS"".*?(?=(?:\n""UNIT"",""PARTS"")|$)", RegexOptions.Singleline);
string CP4DataBase =
"C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);
List<string> line1CP4Components = new List<string>(
UnitPartsMatch.Matches(StripEmptyLines.Replace(CP4DataBaseRTB.Text, ""))
.OfType<Match>()
.Select(m => m.Value)
);
return line1CP4Components.ToArray();
You may be able to ignore the use of StripEmptyLines, but your original code is doing this via the Where(c => !string.IsNullOrEmpty(c)). Also your original code is causing the '\r' part of the "\r\n" newline/linefeed pair to be duplicated. I assumed this was an accident and not intentional?
Also you don't seem to be using the value in 'line1CP4PartLines' so I omitted the creation of the value. It was seemingly inconsistent with the omission of empty lines later so I guess you're not depending on it. If you need this value a simple regex can tell you how many new lines are in the string:
int linecount = new Regex("^", RegexOptions.Multiline).Matches(CP4DataBaseRTB.Text).Count;
// example of what your code will look like
string CP4DataBase = "C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
List<string> Cp4DataList = new List<string>(File.ReadAllLines(CP4DataBase);
//or create a Dictionary<int,string[]> object
string strData = string.Empty;//hold the line item data which is read in line by line
string[] strStockListRecord = null;//string array that holds information from the TFE_Stock.txt file
Dictionary<int, string[]> dctStockListRecords = null; //dictionary object that will hold the KeyValuePair of text file contents in a DictList
List<string> lstStockListRecord = null;//Generic list that will store all the lines from the .prnfile being processed
if (File.Exists(strExtraLoadFileLoc + strFileName))
{
try
{
lstStockListRecord = new List<string>();
List<string> lstStrLinesStockRecord = new List<string>(File.ReadAllLines(strExtraLoadFileLoc + strFileName));
dctStockListRecords = new Dictionary<int, string[]>(lstStrLinesStockRecord.Count());
int intLineCount = 0;
foreach (string strLineSplit in lstStrLinesStockRecord)
{
lstStockListRecord.Add(strLineSplit);
dctStockListRecords.Add(intLineCount, lstStockListRecord.ToArray());
lstStockListRecord.Clear();
intLineCount++;
}//foreach (string strlineSplit in lstStrLinesStockRecord)
lstStrLinesStockRecord.Clear();
lstStrLinesStockRecord = null;
lstStockListRecord.Clear();
lstStockListRecord = null;
//Alter the code to fit what you are doing..
I am indexing and searching with lucene.net, the only problem I am having with my code is that it does not find any hits when searching for "mvc2"(it seems to work with all the other words I search), I have tried a different analyzer(see comments by analyzer) and older lucene code, here is my index and search code, I would really appreciate if someone can show me where I am going wrong with this, Thanks.
////Indexing code
public void DoIndexing(string CvContent)
{
//state the file location of the index
const string indexFileLocation = #"C:\RecruitmentIndexer\IndexedCVs";
//if directory does not exist, create it, and create new index for it.
//if directory does exist, do not create directory, do not create new //index(add field to previous index).
bool creatNewDirectory; //to pass into lucene GetDirectory
bool createNewIndex; //to pass into lucene indexWriter
if (!Directory.Exists(indexFileLocation))
{
creatNewDirectory = true;
createNewIndex = true;
}
else
{
creatNewDirectory = false;
createNewIndex = false;
}
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, creatNewDirectory);//creates if true
//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.SimpleAnalyzer(); //this analyzer gets all //hits exept mvc2
//Lucene.Net.Analysis.Standard.StandardAnalyzer(); //this leaves out sql once //and mvc2 once
//create the index writer with the directory and analyzer defined.
Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Index.IndexWriter(dir, analyzer,
/*true to create a new index*/ createNewIndex);
//create a document, add in a single field
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldContent =
new Lucene.Net.Documents.Field("content",
CvContent,//"This is some text to search by indexing",
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.ANALYZED,
Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldContent);
//write the document to the index
indexWriter.AddDocument(doc);
//optimize and close the writer
indexWriter.Optimize();
indexWriter.Close();
}
////search code
private void button2_Click(object sender, EventArgs e)
{
string SearchString = textBox1.Text;
///after creating an index, search
//state the file location of the index
const string indexFileLocation = #"C:\RecruitmentIndexer\IndexedCVs";
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, false);
//create an index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
SearchString = SearchString.Trim();
SearchString = QueryParser.Escape(SearchString);
//build a query object
Lucene.Net.Index.Term searchTerm =
new Lucene.Net.Index.Term("content", SearchString);
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);
//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(query);
label1.Text = hits.Length().ToString();
//iterate over the results.
for (int i = 0; i < hits.Length(); i++)
{
Lucene.Net.Documents.Document docMatch = hits.Doc(i);
MessageBox.Show(docMatch.Get("content"));
}
}
I believe that StandardAnalyzer actually strips out "2" from "mvc2", leaving the indexed word to be only "mvc". I'm not sure about SimpleAnalyzer though. You could try to use WhitespaceAnalyzer, which I believe doesn't strip out numbers.
You should also process your search input the same way that you process indexing. A TermQuery is a "identical" match, which means that you if you try to search for "mvc2" where the actual strings in your index always says "mvc", then you won't get a match.
I haven't found a way to actually make use of an analyzer unless I use the QueryParser, and even then I always had odd results.
You could try this in order to "tokenize" your search string in the same way as you index your document, and make a boolean AND search on all terms:
// We use a boolean query to combine all prefix queries
var analyzer = new SimpleAnalyzer();
var query = new BooleanQuery();
using ( var reader = new StringReader( queryTerms ) )
{
// This is what we need to do in order to get the terms one by one, kind of messy but seemed to be the only way
var tokenStream = analyzer.TokenStream( "why_do_I_need_this", reader );
var termAttribute = tokenStream.GetAttribute( typeof( TermAttribute ) ) as TermAttribute;
// This will return false when all tokens has been processed.
while ( tokenStream.IncrementToken() )
{
var token = termAttribute.Term();
query.Add( new PrefixQuery( new Term( KEYWORDS_FIELD_NAME, token ) ), BooleanClause.Occur.MUST );
}
// I don't know if this is necessary, but can't hurt
tokenStream.Close();
}
You can replace the PrefixQuery with TermQuery if you only want full matches (PrefixQuery would match anything starting with, "search*")