Lucene Net Search fail if term is too short - c#

I am new to Lucene, so maybe this is a techical limit i dont understand.
I have indexed few text and the try to fetch the content.
If i query this text open-source reciprocal productivity with the query source i get a match.
If i sue the query sour i also gret a match. But if i use the query sou then i don't get any match.
I am using Lucene .Net version 4.8
Here the code i am using to creating index :
using (var dir = FSDirectory.Open(targetDirectory))
{
Analyzer analyzer = metadata.GetAnalyzer() ; //return new StandardAnalyzer(LuceneVersion.LUCENE_48);
var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);
using (IndexWriter writer = new IndexWriter(dir, indexConfig))
{
long entryNumber = csvRecords.Count();
long index = 0;
long lastPercentage = 0;
foreach (dynamic csvEntry in csvRecords)
{
Document doc = new Document();
IDictionary<string, object> dynamicCsvEntry = (IDictionary<string, object>)csvEntry;
var indexedMetadataFiled = metadata.IdexedFields;
foreach (string headField in header)
{
if (indexedMetadataFiled.ContainsKey(headField) == false || (indexedMetadataFiled[headField].NeedToBeIndexed == false && indexedMetadataFiled[headField].NeedToBeStored == false))
continue;
var field = new Field(headField,
((string)dynamicCsvEntry[headField] ?? string.Empty).ToLower(),
indexedMetadataFiled[headField].NeedToBeStored ? Field.Store.YES : Field.Store.NO, //YES
indexedMetadataFiled[headField].NeedToBeIndexed ? Field.Index.ANALYZED : Field.Index.NO //YES
);
doc.Add(field);
}
long percentage = (long)(((decimal)index / (decimal)entryNumber) * 100m);
if ( percentage > lastPercentage && percentage % 10 == 0)
{
_consoleLogger.Information($"..indexing {percentage}%..");
lastPercentage = percentage;
}
writer.AddDocument(doc);
index++;
}
writer.Commit();
}
}
And here the code i sue to query the index :
var tokens = Regex.Split(query.Trim(), #"\W+");
BooleanQuery composedQuery = new BooleanQuery();
foreach (var field in luceneHint.FieldsToSearch)
{
foreach (string word in tokens)
{
if (string.IsNullOrWhiteSpace(word))
continue;
var termQuery = new FuzzyQuery(new Term(field.FieldName, word.ToLower() ));
termQuery.Boost = (float)field.Weight;
composedQuery.Add(termQuery, Occur.SHOULD);
}
}
var indexManager = IndexManager.Instance;
ReferenceManager<IndexSearcher> index = indexManager.Read(boundle);
int resultLimit = luceneHint?.Top ?? RESULT_LIMIT;
var results = new List<JObject>();
var searcher = index.Acquire();
try
{
Dictionary<string, FieldDescriptor> filedToRead = (luceneHint?.FieldsToRead?.Any() ?? false) ?
luceneHint.FieldsToRead.ToDictionary(item => item.FieldName, item => item) :
new Dictionary<string, FieldDescriptor>();
bool fetchEveryField = filedToRead.Count == 0;
TopScoreDocCollector collector = TopScoreDocCollector.Create(resultLimit, true);
int startPageIndex = pageIndex * itemsPerPage;
searcher.Search(composedQuery, collector);
//TopDocs topDocs = searcher.Search(composedQuery, luceneHint?.Top ?? 100);
TopDocs topDocs = collector.GetTopDocs(startPageIndex, itemsPerPage);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
Document doc = searcher.Doc(scoreDoc.Doc);
dynamic result = new JObject();
foreach (var field in doc.Fields)
if (fetchEveryField || filedToRead.ContainsKey(field.Name))
result[field.Name] = field.GetStringValue();
results.Add(result);
}
}
finally
{
if ( searcher != null )
index.Release(searcher);
}
return results;
I am confused, is the fact the i cant get resoult for sou query relate to the fact that the StandardAnalyzer that is used to build the index, use a some stop-word that prevent my query term to be found in the index? (the index stop ad source and sour because those are both english words)
Ps : here the explain plot, even if i don't know how to use it :
searcher.Explain(composedQuery,6) {0 = (NON-MATCH) sum of: }
Description: "sum of:"
IsMatch: false
Match: false
Value: 0

The documentation for FuzzyQuery points out that it uses the default minimumSimilarity value of 0.5: https://lucenenet.apache.org/docs/3.0.3/d0/db9/class_lucene_1_1_net_1_1_search_1_1_fuzzy_query.html
minimumSimilarity - a value between 0 and 1 to set the required similarity between the query term and the matching terms. For example, for a minimumSimilarity of 0.5 a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term) * 0.5
So, it matches "source" when the query is "sour", because removing "ce" requires two edits, the edit distance is 2, and that's <= than length("sour") * 0.5. However, matching "source" to "sou" would need 3 edits, and so it's not a match.
You should be able to see the same document matching even if you search for something like "bounce" or "sauce", since those are also within two edits from "source".

Related

C# LINQ - SkipWhile() in reverse, without calling Reverse()?

In this code:
for (e = 0; e <= collection.Count - 2; e++)
{
var itm = collection.Read()
var itm_price = itm.Price
var forwards_satisfied_row = collection
.Skip(e + 1)
.SkipWhile(x => x.Price < ex_price)
.FirstOrDefault();
var backwards_satisfied_row = collection
.Reverse()
.Skip(collection.Count - e)
.SkipWhile(x => x.Price < ex_price)
.FirstOrDefault();
}
Suppose the collection contains millions of items and a Reverse() is too expensive, what would be the best way to achieve the same outcome as 'backwards_satisfied_row' ?
Edit:
For each item in the collection, it should find the first preceding item that matches the SkipWhile predicate.
For context I'm finding the distance a price extrema (minima or maxima) is from a horizontal clash with the price. This gives a 'strength' value for each Minima and Maxima, which determines the importance of it, and to help marry it up with extremas of a similar strength.
Edit 2
This chart shows the data in the reproc code below, note the dip in the middle at item #22, this item has a distance of 18.
Bear in mind this operation will be iterated millions of times.
So I'm trying not to read into memory, and to only evaluate the items needed.
When I run this on a large dataset r_ex takes 5 ms per row, whereas l_ex takes up to a second.
It might be tempting to iterate backwards and check that way, but there could be millions of previous records, being read from a binary file.
Many types of searches like Binary search wouldn't be practical here, since the values aren't ordered.
static void Main(string[] args)
{
var dict_dists = new Dictionary<Int32, Int32>();
var dict = new Dictionary<Int32, decimal> {
{1, 410},{2, 474},{3, 431},
{4, 503},{5, 461},{6, 535},
{7, 488},{8, 562},{9, 508},
{10, 582},{11, 522},{12, 593},
{13, 529},{14, 597},{15, 529},
{16, 593},{17, 522},{18, 582},
{19, 510},{20, 565},{21, 492},
{22, 544},{23, 483},{24, 557},
{25, 506},{26, 580},{27, 524},
{28, 598},{29, 537},{30, 609},
{31, 543},{32, 612},{33, 542},
{34, 607},{35, 534},{36, 594},
{37, 518},{38, 572},{39, 496},
{40, 544},{41, 469},{42, 511},
{43, 437},{44, 474},{45, 404},
{46, 462},{47, 427},{48, 485},
{49, 441},{50, 507}};
var i = 0;
for (i = 0; i <= dict.Count - 2; i++)
{
var ele = dict.ElementAt(i);
var current_time = ele.Key;
var current_price = ele.Value;
var is_maxima = current_price > dict.ElementAt(i + 1).Value;
//' If ele.Key = 23 Then here = True
var shortest_dist = Int32.MaxValue;
var l_ex = new KeyValuePair<int, decimal>();
var r_ex = new KeyValuePair<int, decimal>();
if (is_maxima)
{
l_ex = dict.Reverse().Skip(dict.Count - 1 - i + 1).SkipWhile(x => x.Value < current_price).FirstOrDefault();
r_ex = dict.Skip(i + 1).SkipWhile(x => x.Value < current_price).FirstOrDefault();
}
else
{ // 'Is Minima
l_ex = dict.Reverse().Skip(dict.Count - 1 - i + 1).SkipWhile(x => x.Value > current_price).FirstOrDefault();
r_ex = dict.Skip(i + 1).SkipWhile(x => x.Value > current_price).FirstOrDefault();
}
if (l_ex.Key > 0)
{
var l_dist = (current_time - l_ex.Key);
if ( l_dist < shortest_dist ) {
shortest_dist = l_dist;
};
}
if (r_ex.Key > 0)
{
var r_dist = (r_ex.Key - current_time);
if ( r_dist < shortest_dist ) {
shortest_dist = r_dist;
};
}
dict_dists.Add(current_time, shortest_dist);
}
var dist = dict_dists[23];
}
Edit: As a workaround I'm writing a reversed temp file for the left-seekers.
for (i = file.count - 1; i >= 0; i += -1)
{
file.SetPointerToItem(i);
temp_file.Write(file.Read());
}
You could make it more efficient by selecting the precedent of each item in one pass. Lets make an extension method for enumerables that selects a precedent for each element:
public static IEnumerable<T> SelectPrecedent<T>(this IEnumerable<T> source,
Func<T, bool> selector)
{
T selectedPrecedent = default;
foreach (var item in source)
{
if (selector(item)) selectedPrecedent = item;
yield return selectedPrecedent;
}
}
You could then use this method, and select the precedent and the subsequent of each element by doing only two Reverse operations in total:
var precedentArray = collection.SelectPrecedent(x => x.Price < ex_price).ToArray();
var subsequentArray = collection.Reverse()
.SelectPrecedent(x => x.Price < ex_price).Reverse().ToArray();
for (int i = 0; i < collection.Count; i++)
{
var current = collection[i];
var precedent = precedentArray[i];
var subsequent = subsequentArray[i];
// Do something with the current, precedent and subsequent
}
No need to do .Reverse() and then FirstOrDefault(), just use LastOrDefault(). Instead of Skip(collection.Count - e) use .Take(e) elements
var backwards_satisfied_row = collection
.SkipWhile(x => x.Price < ex_price) //Skip till x.Price < ex_price
.Skip(e+1) //Skip first e+1 elements
.LastOrDefault(); //Get Last or default value
You can make your code more efficient by storing collection and then just get FirstOrDefault() and LastOrDefault() for forwards_satisfied_row and backwards_satisfied_row respectively.
like,
for (e = 0; e <= collection.Count - 2; e++)
{
var itm = collection.Read()
var itm_price = itm.Price
var satisfied_rows = collection
.SkipWhile(x => x.Price < ex_price)
.Skip(e + 1)
.ToList();
var forwards_satisfied_row = satisfied_rows.FirstOrDefault();
var backwards_satisfied_row = satisfied_rows.LastOrDefault();
}

Find values which sum to 0 in Excel with many items

I have to find each subset in a enough big list, 500/1000 items that are positive and negative and are decimal, whiches sum to 0. I'm not an expert so I read many and many articles and solutions, and then I wrote my code. Datas comes from Excel worksheet and I would to mark found sums there.
Code works in this way:
Initally I find all pair that sum to 0
Then I put the remains sums into a list and take the combinations within 20 items, beacause I know the it is not possible bigger combination sum to 0
In these combinations I search if one combinations sums to 0 and save it in result list, else save sum in dictionary as key and then I'll search if dictionary contains next sums (so I check pairs of these subsets)
I keep track of the index so I can reach and modify the cells
To found solutions is enough fast but when I want elaborate the results in Excel become really slow. I don't take care about find all solutions but I want to find as max as possible in a short time.
What do you think about this solution? How can I improve the speed? How can I skip easly the sums that are already taken? And how can mark the cells fastly in my worksheet, beacuse now here is the bottleneck of the program?
I hope it is enough clear :) Thanks to everybody for any help
Here my code of the combination's part:
List<decimal> listDecimal = new List<decimal>();
List<string> listRange = new List<string>();
List<decimal> resDecimal = new List<decimal>();
List<IEnumerable<decimal>> resDecimal2 = new List<IEnumerable<decimal>>();
List<IEnumerable<string>> resIndex = new List<IEnumerable<string>>();
Dictionary<decimal, int> dicSumma = new Dictionary<decimal, int>();
foreach (TarkistaSummat.CellsRemain el in list)
{
decimal sumDec = Convert.ToDecimal(el.Summa.Value);
listDecimal.Add(sumDec);
string row = el.Summa.Cells.Row.ToString();
string col = el.Summa.Cells.Column.ToString();
string range = el.Summa.Cells.Row.ToString() + ":" + el.Summa.Cells.Column.ToString();
listRange.Add(range);
}
var subsets = new List<IEnumerable<decimal>> { new List<decimal>() };
var subsetsIndex = new List<IEnumerable<string>> { new List<string>() };
for (int i = 0; i < list.Count; i++)
{
if (i > 20)
{
List<IEnumerable<decimal>> parSubsets = subsets.GetRange(i, i + 20);
List<IEnumerable<string>> parSubsetsIndex = subsetsIndex.GetRange(i, i + 20);
var Z = parSubsets.Select(x => x.Concat(new[] { listDecimal[i] }));
//var Zfound = Z.Select(x => x).Where(w => w.Sum() ==0);
subsets.AddRange(Z.ToList());
var Zr = parSubsetsIndex.Select(x => x.Concat(new[] { listRange[i] }));
subsetsIndex.AddRange(Zr.ToList());
}
else
{
var T = subsets.Select(y => y.Concat(new[] { listDecimal[i] }));
//var Tfound = T.Select(x => x).Where(w => w.Sum() == 0);
//resDecimal2.AddRange(Tfound);
//var TnotFound = T.Except(Tfound);
subsets.AddRange(T.ToList());
var Tr = subsetsIndex.Select(y => y.Concat(new[] { listRange[i] }));
subsetsIndex.AddRange(Tr.ToList());
}
for (int i = 0; i < subsets.Count; i++)
{
decimal sumDec = subsets[i].Sum();
if (sumDec == 0m)
{
resDecimal2.Add(subsets[i]);
resIndex.Add(subsetsIndex[i]);
continue;
}
else
{
if(dicSumma.ContainsKey(sumDec * -1))
{
dicSumma.TryGetValue(sumDec * -1, out int index);
IEnumerable<decimal> addComb = subsets[i].Union(subsets[index]);
resDecimal2.Add(addComb);
var indexComb = subsetsIndex[i].Union(subsetsIndex[index]);
resIndex.Add(indexComb);
}
else
{
if(!dicSumma.ContainsKey(sumDec))
{
dicSumma.Add(sumDec, i);
}
}
}
}
for (int i = 0; i < resIndex.Count; i++)
{
//List<Range> ranges = new List<Range>();
foreach(string el in resIndex[i])
{
string[] split = el.Split(':');
Range cell = actSheet.Cells[Convert.ToInt32(split[0]), Convert.ToInt32(split[1])];
cell.Interior.ColorIndex = 6;
}
}
}

iterating through IEnumerable<string> causing serious performance issue

I am clue less about what has happend to performance of for loop when i tried to iterate through IEnumerable type.
Following is the code that cause serious performance issue
foreach (IEdge ed in edcol)
{
IEnumerable<string> row =
from r in dtRow.AsEnumerable()
where (((r.Field<string>("F1") == ed.Vertex1.Name) &&
(r.Field<string>("F2") == ed.Vertex2.Name))
|| ((r.Field<string>("F1") == ed.Vertex2.Name) &&
(r.Field<string>("F2") == ed.Vertex1.Name)))
select r.Field<string>("EdgeId");
int co = row.Count();
//foreach (string s in row)
//{
//}
x++;
}
The upper foreach(IEdge ed in edcol) has about 11000 iteration to complete.
It runs in fraction of seconds if i remove the line
int co = row.Count();
from the code.
The row.Count() have maximum value of 10 in all loops.
If i Uncomment the
//foreach (string s in row)
//{
//}
it goes for about 10 minutes to complete the execution of code.
Does IEnumerable type have such a serious performance issues.. ??
This answer is for the implicit question of "how do I make this much faster"? Apologies if that's not actually what you were after, but...
You can go through the rows once, grouping by the names. (I haven't done the ordering like Marc has - I'm just looking up twice when querying :)
var lookup = dtRow.AsEnumerable()
.ToLookup(r => new { F1 = r.Field<string>("F1"),
F2 = r.Field<string>("F2") });
Then:
foreach (IEdge ed in edcol)
{
// Need to check both ways round...
var first = new { F1 = ed.Vertex1.Name, F2 = ed.Vertex2.Name };
var second = new { F1 = ed.Vertex2.Name, F2 = ed.Vertex1.Name };
var firstResult = lookup[first];
var secondResult = lookup[second];
// Due to the way Lookup works, this is quick - much quicker than
// calling query.Count()
var count = firstResult.Count() + secondResult.Count();
var query = firstResult.Concat(secondResult);
foreach (var row in query)
{
...
}
}
At the moment you have O(N*M) performance, which could be probematic if both N and M are large. I would be inclined to pre-compute some of the DataTable info. For example, we could try:
var lookup = dtRows.AsEnumerable().ToLookup(
row => string.Compare(row.Field<string>("F1"),row.Field<string>("F2"))<0
? Tuple.Create(row.Field<string>("F1"), row.Field<string>("F2"))
: Tuple.Create(row.Field<string>("F2"), row.Field<string>("F1")),
row => row.Field<string>("EdgeId"));
then we can iterate that:
foreach(IEdge ed in edCol)
{
var name1 = string.Compare(ed.Vertex1.Name,ed.Vertex2.Name) < 0
? ed.Vertex1.Name : ed.Vertex2.Name;
var name2 = string.Compare(ed.Vertex1.Name,ed.Vertex2.Name) < 0
? ed.Vertex2.Name : ed.Vertex1.Name;
var matches = lookup[Tuple.Create(name1,name2)];
// ...
}
(note I enforced ascending alphabetical pairs in there, for convenience)

System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary

I receive the above error message when performing a unit test on a method. I know where the problem is at, I just don't know why it's not present in the dictionary.
Here is the dictionary:
var nmDict = xelem.Descendants(plantNS + "Month").ToDictionary(
k => new Tuple<int, int, string>(int.Parse(k.Ancestors(plantNS + "Year").First().Attribute("Year").Value), Int32.Parse(k.Attribute("Month1").Value), k.Ancestors(plantNS + "Report").First().Attribute("Location").Value.ToString()),
v => {
var detail = v.Descendants(plantNS + "Details").First();
return new HoursContainer
{
BaseHours = detail.Attribute("BaseHours").Value,
OvertimeHours = detail.Attribute("OvertimeHours").Value,
TotalHours = float.Parse(detail.Attribute("BaseHours").Value) + float.Parse(detail.Attribute("OvertimeHours").Value)
};
});
var mergedDict = new Dictionary<Tuple<int, int, string>, HoursContainer>();
foreach (var item in nmDict)
{
mergedDict.Add(Tuple.Create(item.Key.Item1, item.Key.Item2, "NM"), item.Value);
}
var thDict = xelem.Descendants(plantNS + "Month").ToDictionary(
k => new Tuple<int, int, string>(int.Parse(k.Ancestors(plantNS + "Year").First().Attribute("Year").Value), Int32.Parse(k.Attribute("Month1").Value), k.Ancestors(plantNS + "Report").First().Attribute("Location").Value.ToString()),
v => {
var detail = v.Descendants(plantNS + "Details").First();
return new HoursContainer
{
BaseHours = detail.Attribute("BaseHours").Value,
OvertimeHours = detail.Attribute("OvertimeHours").Value,
TotalHours = float.Parse(detail.Attribute("BaseHours").Value) + float.Parse(detail.Attribute("OvertimeHours").Value)
};
});
foreach (var item in thDict)
{
mergedDict.Add(Tuple.Create(item.Key.Item1, item.Key.Item2, "TH"), item.Value);
}
return mergedDict;
and here is the method that is being tested:
protected IList<DataResults> QueryData(HarvestTargetTimeRangeUTC ranges,
IDictionary<Tuple<int, int, string>, HoursContainer> mergedDict)
{
var startDate = new DateTime(ranges.StartTimeUTC.Year, ranges.StartTimeUTC.Month, 1);
var endDate = new DateTime(ranges.EndTimeUTC.Year, ranges.EndTimeUTC.Month, 1);
const string IndicatorName = "{6B5B57F6-A9FC-48AB-BA4C-9AB5A16F3745}";
DataResults endItem = new DataResults();
List<DataResults> ListOfResults = new List<DataResults>();
var allData =
(from vi in context.vDimIncidents
where vi.IncidentDate >= startDate.AddYears(-3) && vi.IncidentDate <= endDate
select new
{
vi.IncidentDate,
LocationName = vi.LocationCode,
GroupingName = vi.Location,
vi.ThisIncidentIs, vi.Location
});
var finalResults =
(from a in allData
group a by new { a.IncidentDate.Year, a.IncidentDate.Month, a.LocationName, a.GroupingName, a.ThisIncidentIs, a.Location }
into groupItem
select new
{
Year = String.Format("{0}", groupItem.Key.Year),
Month = String.Format("{0:00}", groupItem.Key.Month),
groupItem.Key.LocationName,
GroupingName = groupItem.Key.GroupingName,
Numerator = groupItem.Count(),
Denominator = mergedDict[Tuple.Create(groupItem.Key.Year, groupItem.Key.Month, groupItem.Key.LocationName)].TotalHours,
IndicatorName = IndicatorName,
}).ToList();
for (int counter = 0; counter < finalResults.Count; counter++)
{
var item = finalResults[counter];
endItem = new DataResults();
ListOfResults.Add(endItem);
endItem.IndicatorName = item.IndicatorName;
endItem.LocationName = item.LocationName;
endItem.Year = item.Year;
endItem.Month = item.Month;
endItem.GroupingName = item.GroupingName;
endItem.Numerator = item.Numerator;
endItem.Denominator = item.Denominator;
}
foreach(var item in mergedDict)
{
if(!ListOfResults.Exists(l=> l.Year == item.Key.Item1.ToString() && l.Month == item.Key.Item2.ToString()
&& l.LocationName == item.Key.Item3))
{
for (int counter = 0; counter < finalResults.Count; counter++)
{
var data = finalResults[counter];
endItem = new DataResults();
ListOfResults.Add(endItem);
endItem.IndicatorName = data.IndicatorName;
endItem.LocationName = item.Key.Item3;
endItem.Year = item.Key.Item1.ToString();
endItem.Month = item.Key.Item2.ToString();
endItem.GroupingName = data.GroupingName;
endItem.Numerator = 0;
endItem.Denominator = item.Value.TotalHours;
}
}
}
return ListOfResults;
}
The error occurs here:
Denominator = mergedDict[Tuple.Create(groupItem.Key.Year, groupItem.Key.Month, groupItem.Key.LocationName)].TotalHours,
I do not understand why it is not present in the key. The key consists on an int, int, string (year, month, location) and that is what I have assigned it.
I've looked at all of the other threads concerning this error message but I didn't see anything that applied to my situation.
I was unsure of what tags to put on this but from my understanding the dictionary was created with linq to xml, the query is linq to sql and it's all part of C# so I used all the tags. if this was incorrect then I apologize in advance.
The problem is with comparisons between the keys you are storing in the Dictionary and the keys you are trying to look up.
When you add something to a Dictionary or access the indexer of a Dictionary it uses the GetHashCode() method to get a hash value of the key. The hashcode for a Tuple is unique to that instance of the Tuple. This means that unless you are passing in the exact same instance of the Tuple class into the indexer, it will not find the previously stored value. Your usage of mergedDict[Tuple.Create(... creates a brand new Tuple with a different hash code than is stored in the Dictionary.
I would recommend creating your own class to use as the key and implementing GetHashCode() and the Equality methods on that class. That way the Dictionary will be able to find what you previously stored there.
More:
The reason this is confusing to a lot of people is that for something like String or Int32, String.GetHashCode() will return the same hash code for two different instances that have the same value. A more specialized class such as Tuple doesn't always work the same. The implementor of Tuple could have gotten the hash code of each input to the Tuple and added them together (or something), but running Tuple through a decompiler you can see that this is not the case.

Simple rating algorithm to sorting results according to user query

I'm developing a very basic web search engine that has several parts. After retrieving results according to a user query, I want to calculate rate of each result and then sort results by calculated rate. Here is my query:
var tmpQuery = (from urls in _context.Urls
join documents in _context.Documents
on urls.UrlId equals documents.DocumentId
let words = (from words in _context.Words
join hits in _context.Hits
on words.WordId equals hits.WordId
where hits.DocumentId == documents.DocumentId
select words.Text)
select new { urls, documents, words });
var results = (from r in tmpQuery.AsEnumerable()
where r.urls.ResolvedPath.Contains(breakedQuery, KeywordParts.Url, part) ||
r.documents.Title.Contains(breakedQuery, KeywordParts.Title, part) ||
r.documents.Keywords.Contains(breakedQuery, KeywordParts.Keywords, part) ||
r.documents.Description.Contains(breakedQuery, Description, part) ||
r.words.Contains(breakedQuery, KeywordParts.Content, part)
select new SearchResult()
{
UrlId = r.urls.UrlId,
Url = r.urls.ResolvedPath,
IndexedOn = r.documents.IndexedOn,
Title = r.documents.Title,
Description = r.documents.Description,
Host = new Uri(r.urls.ResolvedPath).Host,
Length = r.documents.Length,
Rate = 0CalculateRating(breakedQuery, r.urls.ResolvedPath, r.documents.Title, r.documents.Keywords, r.documents.Description, r.words)
}).AsEnumerable()
.OrderByDescending(result => result.Rate)
.Distinct(new SearchResultEqualityComparer());
and rate is calculated by this method:
private int CalculateRating(IEnumerable<string> breakedQuery, string resolvedPath, string title, string keywords, string description, IEnumerable<string> words)
{
var baseRate = 0;
foreach (var query in breakedQuery)
{
/*first I'm breaking up user raw query (Microsoft -Apple) to list of broken
queries (Microsoft, -Apple) if broken query start with - that means
results shouldn't have*/
var none = (query.StartsWith("-"));
string term = query.Replace("-", "");
var pathCount = Calculate(resolvedPath, term);
var titleCount = Calculate(title, term);
var keywordsCount = Calculate(keywords, term);
var descriptionCount = Calculate(description, term);
var wordsCount = Calculate(words, term);
var result = (pathCount * 100) + (titleCount * 50) + (keywordsCount * 25) + (descriptionCount * 10) + (wordsCount);
if (none)
baseRate -= result;
else
baseRate += result;
}
return baseRate;
}
private int Calculate(string source, string query)
{
if (!string.IsNullOrWhiteSpace(source))
return Calculate(source.Split(' ').AsEnumerable<string>(), query);
return 0;
}
private int Calculate(IEnumerable<string> sources, string query)
{
var count = 0;
if (sources != null && sources.Count() > 0)
{
//to comparing two strings
//first case sensitive
var elements = sources.Where(source => source == query);
count += elements.Count();
//second case insensitive (half point of sensitive)
count += sources.Except(elements).Where(source => source.ToLowerInvariant() == query.ToLowerInvariant()).Count() / 2;
}
return count;
}
Please guide me to improve performance (speed of my search engine is very very low)
I expect this is down to your from urls in _context.Urls - with no Where on this you're getting a lot of data to then throw away when building up your results. How many items are in tmpQuery / results?

Categories