I have a table of word, a lookup table of where those words are found in documents, and the number of times that word appears in that document. So there might be a record that says Alpha exists 5 times in document X, while Beta exists 3 times in document X, and another for Beta existing twice in document Y.
The user can enter a number of words to search, so "quick brown fox" is three queries, but "quick brown fox jumped" is four queries. While I can get a scored result set of each word in turn, what I actually want is to add the number of occurrences together for each word, such that the top result is the highest occurrence count for all words.
A document might have hundreds of "quick" and "brown" occurrences but no "fox" occurrences. The results should still be included as it could score higher than a document with only one each of "quick", "brown", and "fox".
The problem I can't work out is how to amalgamate the 1 to N queries with the occurences summed. I think I need to use GROUP BY and SUM() but not certain. Linq preferred but SQL would be ok. MS SQL 2016.
I want to pass the results on to a page indexer so a for-each over the results wouldn't work, plus we're talking 80,000 word records, 3 million document-word records, and 100,000 document records.
// TextIndexDocument:
// Id | WordId | Occurences | DocumentId | (more)
//
// TextIndexWord:
// Id | Word
foreach (string word in words)
{
string lword = word.ToLowerInvariant();
var results = from docTable in db.TextIndexDocuments
join wordTable in db.TextIndexWords on docTable.WordId equals wordTable.Id
where wordTable.Word == lword
orderby docTable.Occurences descending
select docTable;
// (incomplete)
}
More information
I understand that full text searching is recommended. The problem then is how to rank the results from a half dozen unrelated tables (searching in forum posts, articles, products...) into one unified result set - let's say record Id, record type (article/product/forum), and score. The top result might be a forum post while the next best hits are a couple of articles, then a product, then another forum post and so on. The TextIndexDocument table already has this information across all the relevant tables.
Let's assume that you can create a navigation property TextIndexDocuments in Document:
public virtual ICollection<TextIndexDocuments> TextIndexDocuments{ get; set; }
and a navigation property in TextIndexDocument:
public virtual TextIndexWord TextIndexWord { get; set; }
(highly recommended)
Then you can use the properties to get the desired results:
var results =
(
from doc in db.Documents
select new
{
doc,
TotalOccurrences =
doc.TextIndexDocuments
.Where(tid => lwords.Contains(tid.TextIndexWord.Word))
.Sum(doc => doc.Occurrences)
}
).OrderByDescending(x => x.TotalOccurrences)
As far as I know this can not, or at least easily, be accomplished in LINQ, especially in any kind of performant way.
What you really should consider, assuming your DBA will allow it, is Full-Text indexing of your documents stored in SQL Server. From my understanding the RANK operator is exactly what you are looking for which has been highly optimized for Full-Text.
In response to your comment: (sorry for not noticing that)
You'll need to either do a series of subqueries or Common-Table-Expressions. CTE's are a bit hard to get used to writing at first but once you get used to them they are far more elegant than the corresponding query written with sub queries. Either way the query execution plan will be exactly the same, so there is no performance gain from going the CTE route.
You want to add up occurences for the words per document. So group by document ID, use SUM and order by total descending:
select documentid, sum(occurences)
from doctable
where wordid in (select id from wordtable where word in 'quick', 'brown', 'fox')
group by documentid
order by sum(occurences) desc;
Related
I have multiple random numbers in table column ID like
8v12027
8v12025
8v12024
8v12029.
8v12023
8v12030
8v12020
O/p - 8v12020, From 8v12023 To 8v12025, 8v12027, From 8v12029 To 8v12030,
I assume you'are waiting for an sql solution so :
You have to use Lead or Lag KeyWord and concat it.
SELECT CONCAT('From ',Id,'To :', LEAD(p.Id) OVER (ORDER BY p.Id),'s') FROM YourTable p
There is a really good explanation about thoses keyword in the sqlauthority web site.
https://blog.sqlauthority.com/2013/09/22/sql-server-how-to-access-the-previous-row-and-next-row-value-in-select-statement/
But If you were waiting for a pure C# solution, you can retreive the data set in an Array, after order it by Id and concat and with a for loop concat current value with previous (or next) one.
Or with a Linq use Aggregate
yourArray.Aggregate((a,b)=> String.Concat("From ",a," To ",b,";")).Split(';')
I'm working on an ASP.NET Core MVC application and I need some help optimizing my LINQ queries. Currently my application is running very slowly and I presume it is due to the way I've written my queries. My SQL db contains about 1.9 millions rows of data, with 4 properties for each item.
Here is a a sample of what I have in the controller...
model.MostPopular = await _givenNameRepo.GetByAlphaSinceYear(model.StartsWith, model.Gender, model.SortCount, model.SinceYear).ToAsyncEnumerable().ToList();
And here is what I have in the repo...
public IEnumerable<string> GetByAlphaSinceYear(string alpha, string gender, int sortCount, int sinceYear)
{
var names = (from n in _context.AsNoTracking()
where n.Name.StartsWith(alpha) && n.Gender == gender && n.Year >= sinceYear
group n by new { n.Name } into nn
select new
{
nn.Key.Name,
Frequency = nn.Sum(s => s.Frequency)
}).OrderByDescending(i => i.Frequency)
.Select(j => j.Name).Take(sortCount);
return names;
}
I was playing around with trying to make the repo flexible, allowing for synchronous or asynchronous operation. But I'm now thinking this may be causing problems given what I have to do in the controller.
One question I guess would be is my query running multiple times given the way I'm calling it? I've read through the LINQ documentation, but it's not sinking into my head regarding when certain deferred operations are executing. As I understand it, the first run would be on the .OrderByDescending, and on that result .Select would run, and on that result .Take would run. But then in the controller, because I'm trying to run asynchronously I think need to call .ToAsyncEnumerable and then .ToList. My gut tells me there is some extra work going on here, but I'm not sure.
EDIT: There is just one table in the db and it looks like this...
Id | Year | Name | Gender | Frequency
Each year has its own names with their own genders and frequencies. The sample query looks for the most popular names by summing up the frequencies of the names, over some number of years (last year, last 5 years, last 10 years, etc.), and based on gender (names can potentially be male and female), and then sorts them in descending order based on those summed frequencies. Then you take from the top whatever number you want to display to show the most popular names.
I have List of object like this
List<Product> _products;
Then I get productId input and search in this list like this
var target = _peoducts.Where(o => o.productid == input).FirstOrDefault();
my Question is
If This list have 100 Products (productId from 1 to 100) and an
input I get productId = 100. that mean this Method must loop for 100
time Right ? (If I ORDER BY productId ASC in Query)
Between use this Method and Query on Database with where clause like
this WHERE productId = #param
Thank you.
No. If there is an index with key productId it finds the correct row with O(log n) operations
Just implement both methods and take the time. (hint: use StopWatch() class)
Edit
To get the full performance you should not create an intermediate (unsorted) List<T> but put all your logic in a LINQ query which operates on the SQL Server.
#might be helpful to get your answer.
https://www.linqpad.net/WhyLINQBeatsSQL.aspx
If you execute that Where on a List<Product>, then:
you got all 100 rows from the database
and then looped through all products in memory until you found the one that matches or until you went through the entire list and found nothing.
If, on the other hand, you used an IQueryable<Product> that was connected to the database table, then:
You wouldn't have read anything from the database yet
When you apply the Where, you still wouldn't read anything
When you apply the FirstOrDefault a sql query is constructed to find just the one row you need. Given correct indexes on the table, this would be quite fast.
I am using C# and Entity Framework. I have a Products table(Id, Name) that contains my products, a Tags table(Id, Name) that contains all available tags and a ProductTags table (Id, ProductId, TagId) that is used to assign tags to products.
User can specify that he want to see products that have several tags (Int[] SelectedTagIds).
The question is: how to get all the products, where each product has all the tags specified by user.
Now I am using this query
`var reault = Context.Products
.Where(x => SelectedTagIds.All(y =>
(x.ProductTags.Select(z => z.TagId))
.Contains(y)));`
I wonder if this is a correct way or there is a better/faster way?
You could use intersect and then check the length of both ensure that the intersected has the same count as the search tags, meaning that all search tags exist in ProductTags collection.
var result =
products.Where(
x => x.ProductTags.Intersect(searchTags).Count() == searchTags.Length);
But you would need to run performance analysis which works better for you. But as Oren commented, are you actually having performance issues? As both of these collections are most likely so small, that you would not cause a bottleneck.
EDIT: Just checked the performance of intersect, it is slower than using .All with .Contains
Perf on my machine, creating 1000 result queries and enumerating them using ToList(), on a Set of 1000 products, with 2 search tags gives me the followign performance:
searchTags.All(search => x.ProductTags.Contains(search)) = 202ms
!searchTags.Except(x.ProductTags).Any() = 339ms
x.ProductTags.Intersect(searchTags).Count() == searchTags.Length = 389ms
If you really need to improve performance, you could use HashSet for your ProductTags and SelectedTagIds
EDIT 2: Using HashSets comparison
Run a comparison using hashsets and got the following outputs creating 1000 query objects and executing the into a list using ToList() gave the followign results:
Using List<Tag>
Creating Events took 6ms
Number of Tags = 3000
Number of Products = 1003
Average Number of Tags per Products = 100
Number of Search Tags = 10
.All.Contains = 5379ms - Found 51
.Intersect = 5823ms - Found 51
.Except = 6720ms - Found 51
Using HashSet<Tag>
Creating Events took 26ms
Number of Tags = 3000
Number of Products = 1003
Average Number of Tags per Products = 100
Number of Search Tags = 10
.All.Contains = 259ms - Found 51
.Intersect = 6417ms - Found 51
.Except = 7347ms - Found 51
As you can see, using HashSet is considerably faster, even if you factor in the extra 20ms to create the HashSets. ALthough the hash was a simple hash of the ID field. If you was to use a more complicated Hash results would be different.
I have two documents and using Luke to investigate, I have confirmed in code that it has the same behavior, using StandardAnalyzer.
Document one with boost 1
stored/uncompressed,indexed,tokenized<Description:Nummer ett>
stored/uncompressed,indexed,tokenized<Id:2>
stored/uncompressed,indexed,tokenized<Name:Apa>
Document two with boost 2
stored/uncompressed,indexed,tokenized<Description:Nummer tvÄ>
stored/uncompressed,indexed,tokenized<Id:1>
stored/uncompressed,indexed,tokenized<Name:Apa>
Search apa in field Name
Returns with boost used and in the correct order.
Document 2 has Score 1,1891
Document 1 has Score 0.5945
Search ap*
Returns in no order and same score
Document 1 Score 1.0000
Document 2 Score 1.0000
Search apa*
Returns in no order and same score
Document 1 Score 1.0000
Document 2 Score 1.0000
Why is this? I would like to return some documents with higher boost value even if I have to use wildcards. Is this possible?
Cheers all cool coders out there!
This is what I want to accomplice.
A search string and want matches. Using wildcard.
Search "Lu" +"*"
Document
Name
City
I would like the Document whose Name is Lund to get higher rating than the document with the Name Lunt or City is Lund for example. This is due to I will know which documents that are most popular. I want to get the documents with city Stockholm and names Stockholm and Stockholmen but ordered as I choose.
Since WildcardQuery is a subclass of MultiTermQuery you are getting constant score of 1.
If you check the definition of t.getBoost():
t.getBoost() is a search time boost of term t in the query q as
specified in the query text (see query syntax), or as set by
application calls to setBoost(). Notice that there is really no direct
API for accessing a boost of one term in a multi term query, but
rather multi terms are represented in a query as multi TermQuery
objects, and so the boost of a term in the query is accessible by
calling the sub-query getBoost()
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/search/Similarity.html#formula_termBoost
One possible hack could be to set rewrite method of query parser:
myCustomQueryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE)