I am using Lucene.net version 3 library to index the data and search on them but for some reason, sometimes it throws a weird error message internally and I can not even understand why this happens. Below is my code when I try to search on the index:
using (var analyzer = new StandardAnalyzer(Version.LUCENE_30))
{
using (var reader = IndexReader.Open(directory, true))
{
using (var searcher = new IndexSearcher(reader))
{
term = QueryParser.Escape(term);
var parsedQuery =
new MultiFieldQueryParser(Version.LUCENE_30, includedFields, analyzer)
.Parse(term);
var result = searcher.Search(parsedQuery, reader.MaxDoc); //Here happens the error
totalCount = result.TotalHits;
var matches = result.ScoreDocs.OrderByDescending(x => x.Score).ToPaginated(page, size);
foreach (var match in matches)
{
int docId = match.Doc;
var document = searcher.Doc(docId);
var item = //...
list.Add(item);
}
}
}
}
This error message does not always happen, it only happens when some random text string are being set as a searching term, so sometimes it works well and sometimes crashes.
I am trying to escape the term but still no luck. Does anybody have any idea what I might be doing wrong in here ?
Here is the error message:
at Lucene.Net.Search.TermScorer.Score() in d:\Lucene.Net\FullRepo\trunk\src\core\Search\TermScorer.cs:line 136 at Lucene.Net.Search.DisjunctionSumScorer.AdvanceAfterCurrent() in d:\Lucene.Net\FullRepo\trunk\src\core\Search\DisjunctionSumScorer.cs:line 187 at Lucene.Net.Search.DisjunctionSumScorer.NextDoc() in d:\Lucene.Net\FullRepo\trunk\src\core\Search\DisjunctionSumScorer.cs:line 155 at Lucene.Net.Search.BooleanScorer2.NextDoc() in d:\Lucene.Net\FullRepo\trunk\src\core\Search\BooleanScorer2.cs:line 397 at Lucene.Net.Search.BooleanScorer.NextDoc() in d:\Lucene.Net\FullRepo\trunk\src\core\Search\BooleanScorer.cs:line 369 at Lucene.Net.Search.BooleanScorer.Score(Collector collector) in d:\Lucene.Net\FullRepo\trunk\src\core\Search\BooleanScorer.cs:line 389 at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Collector collector) in d:\Lucene.Net\FullRepo\trunk\src\core\Search\IndexSearcher.cs:line 221 at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Int32 nDocs) in d:\Lucene.Net\FullRepo\trunk\src\core\Search\IndexSearcher.cs:line 188 at Blog.Services.SearchService.<>c__DisplayClass3_0.b__0() in E:\Blog\Blog\Blog.Services\SearchService.cs:line 98 at
Here is a random string which makes it crash: 212 i think webapp using ef db first i am trying to use lazy. The string does not have any meaning but still strange to me why it should make lucene crash...
Related
I'm trying to replicate results from Gensim in C# to compare results and see if we need to bother trying to get Python to work within our broader C# context. I have been programming in C# for about a week, am usually a Python coder. I managed to get LDA to function and assign topics with C#, but there is no Catalyst model (that I could find) that does Doc2Vec explicitly, but rather I need to do something with FastText as they have in their sample code:
// Training a new FastText word2vec embedding model is as simple as this:
var nlp = await Pipeline.ForAsync(Language.English);
var ft = new FastText(Language.English, 0, "wiki-word2vec");
ft.Data.Type = FastText.ModelType.CBow;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(nlp.Process(GetDocs()));
ft.StoreAsync();
The claim is that it is simple, and fair enough... but what do I do with this? I am using my own data, a list of IDocuments, each with a label attached:
using (var csv = CsvDataReader.Create("Jira_Export_Combined.csv", new CsvDataReaderOptions
{
BufferSize = 0x20000
}))
{
while (await csv.ReadAsync())
{
var a = csv.GetString(1); // issue key
var b = csv.GetString(14); // the actual bug
// if (jira_base.Keys.ToList().Contains(a) == false)
if (jira.Keys.ToList().Contains(a) == false)
{ // not already in our dictionary... too many repeats
if (b.Contains("{panel"))
{
// get just the details/desc/etc
b = b.Substring(b.IndexOf("}") + 1, b.Length - b.IndexOf("}") - 1);
try { b = b.Substring(0, b.IndexOf("{panel}")); }
catch { }
}
b = b.Replace("\r\n", "");
jira.Add(a, nlp.ProcessSingle(new Document(b,Language.English)));
} // end if
} // end while loop
From a set of Jira Tasks and then I add labels:
foreach (KeyValuePair<string, IDocument> item in jira) { jira[item.Key].Labels.Add(item.Key); }
Then I add to a list (based on a breakdown from a topic model where I assign all docs that are at or above a threshold in that topic to the topic, jira_topics[n] where n is the topic numner, as such:
var training_lst = new List<IDocument>();
foreach (var doc in jira_topics[topic_num]) { training_lst.Add(jira[doc]); }
When I run the following code:
// FastText....
var ft = new FastText(Language.English, 0, $"vector-model-topic_{topic_num}");
ft.Data.Type = FastText.ModelType.Skipgram;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(training_lst);
var wtf = ft.PredictMax(training_lst[0]);
wtf is (null,NaN). [hence the name]
What am I missing? What else do I need to do to get Catalyst to vectorize my data? I want to grab the cosine similarities between the jira tasks and some other data I have, but I can't even get the Jira data into anything resembling a vectorization I can apply to something. Help!
Update:
So, Predict methods apparently only work for supervised learning in FastText (see comments below). And the following:
var wtf = ft.CompareDocuments(training_lst[0], training_lst[0]);
Throws an Implementation error (and only doesn't work with PVDM). How do I use PVDM, PVDCbow in Catalyst?
everyone.
I want to parse 300+Mb text-file with 2.000.000+ lines in it and make some operations(split every line, make comparsions, save data in dict.) with stored data.
It takes program ~50+ mins to get expected result (for files with 80.000 lines it takes about 15-20seconds)
Is there any way to make it to work faster?
Code sample below:
using (FileStream cut_file = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(cut_file))
using (StreamReader s_reader = new StreamReader(bs)) {
string line;
while ((line = s_reader.ReadLine()) != null) {
string[] every_item = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
string car = every_item[0];
string[] cameras = every_item[1].Split(',');
if (!cars.Contains(car)) { //cars is List<string> defined at the beginning of programm
for (int camera = 0; camera < cameras.Count(); camera++) {
if (cams_input.Contains(cameras[camera])) { //cams_input is List<string> defined at the beginning of programm
cars.Add(car);
result[myfile]++; //result is Dictionary<string, int>. Used dict. for parsing several files.
}
}
}
}
}
Well, it is quite possible you have a problem linked to memory use.
However, you have some blatant inefficiencies in useless Linq usage:
when you call Contains() on a List, you basically do a foreach on the List.
So, an improvement over your code is to use HashSet instead of List in order to speed up the Contains().
Same for calling Count() on the array in the for loop. It's an array, so just call Array.Length.
Anyway, you should profile the code in your machine (I use the JetBrains Profiler and find it invaluable to do this kind of performance profiling).
My take on this:
string myfile = "";
var cars = new HashSet<string>();
var cams_input = new HashSet<string>();
var result = new Dictionary<string, int>();
foreach (var line in System.IO.File.ReadLines(myfile, System.Text.Encoding.UTF8))
{
var everyItem = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
var car = everyItem[0];
if (cars.Contains(car)) continue;
var cameras = everyItem[1].Split(',');
for (int camera = 0; camera < cameras.Length; camera++)
{
if (cams_input.Contains(cameras[camera]))
{
cars.Add(car);
// I really don't get who is inserting value zero.
result[myfile]++;
}
}
}
Edit: As per your comment, the performance seemed to be related to the use of lists. You should read a guide about collections available in the .Net framework, like this: http://www.codethinked.com/an-overview-of-system_collections_generic
Every single type is best suited for a type of task: the HashSet, for example, is meant to be used to store a set (doh!) of unique values, and the really shiny feat it gives you is O(1) Contains operations.
What you pay is the storage of the hashes, and computation of them.
You lose also sorting, etc.
A Dictionary is basically an HashSet with a value attached to each key.
Good study!
Ps: if the problem is solved, please close the question.
I have encountering an error when inserting bulk data with the upsert function and cannot figure out how to fix it. Anyone know what is wrong here? What the program is essentially doing is grabbing data from a SQL server database and loading into our Couchbase bucket on an Amazon instance. It does initially begin loading but after about 10 or so upserts it then crashes.
My error is as follows:
Collection was modified; enumeration operation may not execute.
Here are the screen shots of the error (Sorry the error only is replicated on my other Amazon server instance and not locally):
http://imgur.com/a/ZJB0c
Here is the function which is calling the upsert method. This is called multiple times since I'm retrieving only parts of the data at a time since the SQL table is very large.
private void receiptItemInsert(double i, int k) {
const int BATCH_SIZE = 10000;
APSSEntities entity = new APSSEntities();
var data = entity.ReceiptItems.OrderBy(x => x.ID).Skip((int)i * BATCH_SIZE).Take(BATCH_SIZE);
var joinedData = from d in data
join s in entity.Stocks
on new { stkId = (Guid)d.StockID } equals new { stkId = s.ID } into ps
from s in ps.DefaultIfEmpty()
select new { d, s };
var stuff = joinedData.ToList();
var dict = new Dictionary<string, dynamic>();
foreach (var ri in stuff)
{
Stock stock = new Stock();
var ritem = new CouchModel.ReceiptItem(ri.d, k, ri.s);
string key = "receipt_item:" + k.ToString() + ":" + ri.d.ID.ToString();
dict.Add(key, ritem);
}
entity.Dispose();
using (var cluster = new Cluster(config))
{
//open buckets here
using (var bucket = cluster.OpenBucket("myhoney"))
{
bucket.Upsert(dict); #CRASHES HERE
}
}
}
as discussed in the Couchbase Forums, this is probably a bug in the SDK.
When initializing the internal map of the couchbase cluster, the SDK will construct a List of endpoints. If two+ threads (as is the case during a bulk upsert) trigger this code at the same time, one may see an instance of the List being populated by the other (because the lock is entered just after a call to List.Any(), which may crash if the list is being modified).
I am trying to figure out how I can add more weight to a description that has the same word multiple times in it to appear first for the lucene.net in c#.
Example:
Pre-condition:
Lets say I have a list of items like this:
Restore Exchange
Backup exchange
exchange is a really great tool, exchange can have many mailboxes
Scenario:
I search for exchange.
The list would be returned in this order:
(it has the same weight as 2 and it was added to the index first)
(it has the same weight as 1 and it was added to the index second)
(has a reference of exchange in it, but its length is greater then 1 and 2)
So I am trying to get #3 to show up first as it has exchange in the description more then one time.
Here is some code showing that I set the Similarity:
// set up lucene searcher
using (var searcher = new IndexSearcher(directory, false))
{
var hits_limit = 1000;
var analyzer = new StandardAnalyzer(Version.LUCENE_29);
searcher.Similarity = new test();
// search by single field
if (!string.IsNullOrEmpty(searchField))
{
var parser = new QueryParser(Version.LUCENE_29, searchField, analyzer);
var query = parseQuery(searchQuery, parser);
var hits = searcher.Search(query, hits_limit).ScoreDocs;
var results = mapLuceneToDataList(hits, searcher);
analyzer.Close();
searcher.Dispose();
return results;
}
// search by multiple fields (ordered by RELEVANCE)
else
{
var parser = new MultiFieldQueryParser
(Version.LUCENE_29, new[] { "Id", "Name", "Description" }, analyzer);
var query = parseQuery(searchQuery, parser);
var hits = searcher.Search
(query, null, hits_limit, Sort.RELEVANCE).ScoreDocs;
var results = mapLuceneToDataList(hits, searcher);
analyzer.Close();
searcher.Dispose();
return results;
}
Disclaimer: I can only speak about Lucene (and not Lucene.NET) but I believe they are built using the same principles.
The reason why documents #1 & #2 come up first is because field weights (1/2 for #1, 1/2 for #2) are higher than 2/11 for #3 (assuming you are not using stop words). The point here is that "exchange" term in first two documents has far more weight than in the third where it's more diluted. This is how default similarity algorithm works. In practice this is a bit more complex, as you can observe in the given link.
So what you are asking for is an alternative similarity algorithm. There's a similar discussion here where MySim, I believe, attempts to achieve something close to what you want. Just don't forget to set this similarity instance to both index writer and searcher.
I have this query where I am expecting a lot of results.
private void addContentInCmbPhy() {
DbClassesDataContext myDb = new DbClassesDataContext(dbPath);
var match = from phy in myDb.Physicians
select phy.Phy_FName;
for(IQueryable<string> phy in match){
cmbPhysicians.Items.Add(phy);
}
}
In my query above it will return several results, and I want those name results to be inserted as Items in my comboBox, how would I add it? it gives me this following errors
Error 7 Only assignment, call, increment, decrement, and new object expressions can be used as a statement C:\Users\John\documents\visual studio 2010\Projects\PatientAdministration\PatientAdministration\Pat_Demog.cs 415 43 PatientAdministration
Error 8 ; expected C:\Users\John\documents\visual studio 2010\Projects\PatientAdministration\PatientAdministration\Pat_Demog.cs 415 40 PatientAdministration
Error 9 ; expected C:\Users\John\documents\visual studio 2010\Projects\PatientAdministration\PatientAdministration\Pat_Demog.cs 415 43 PatientAdministration
Aren't you using the wrong looping statement? It should be foreach instead of for. If you are using a for loop then you need to have an incrementer.
It is god if you have you datacontext inside of a using statement. Becuase then you just have the connection as long as you need. And you are disposing it directly after. I would probably do it something like this:
private void addContentInCmbPhy()
{
List<string> match;
using (var myDb = new DbClassesDataContext(dbPath))
{
match = (from phy in myDb.Physicians
select phy.Phy_FName).ToList();
}
foreach(var phy in match){
cmbPhysicians.Items.Add(phy);
}
}
private void addContentInCmbPhy()
{
List<string> match;
using (var myDb = new DbClassesDataContext(dbPath))
{
cmbPhysicians.Items.AddRange((from phy in myDb.Physicians
select phy.Phy_FName).ToArray());
}
// foreach(var phy in match){
// cmbPhysicians.Items.Add(phy);
// }
}