Linq with multiple contains - c#

In our UI the users can free text a search that is applied to a number of fields.
q = q.Where(p => p.Account.Contains(query)
|| p.AccountName.Contains(query)
|| p.AccountAKA.Contains(query)
|| p.AccountRef.Contains(query));
This translates into SQL. Is there a more optimal way to querying, as this is slow.
There are about 20,000 rows. Database disk size doesn't matter, memory usage does.

Since all these are text fields each of the values translates to Account LIKE '%'+ query+ '%'. All queries with wildcards on both sides will be slow, unfortunately there is not too much that can be done.
Maybe it is possible to use StartsWith() instead of Contains()? This would translate to `LIKE query + '%' which is generally much faster?

A better solution would be to change your data model and use a Description column that has ALL the account name info in it so you can do your query against a single column. Updates to the record result in an update to this Description column.

Related

Do I need to index on a id in EF Core if I'm searching for an id in 2 different columns?

If I do a query like below where I'm searching for the same ID but on two different columns. Should I have an index like this? Or should I create 2 separate indexes, one for each column?
modelBuilder.Entity<Transfer>()
.HasIndex(p => new { p.SenderId, p.ReceiverId });
Query:
var transfersCount = await _dbContext.Transfers
.Where(p => p.ReceiverId == user.Id || p.SenderId == user.Id)
.CountAsync();
What if I have a query like this below, would I need a multicolumn index on all 4 columns?
var transfersCount = await _dbContext.Transfers
.Where(p => (p.SenderId == user.Id || p.ReceiverId == user.Id) &&
(!transferParams.Status.HasValue || p.TransferStatus == (TransferStatus)transferParams.Status) &&
(!transferParams.Type.HasValue || p.TransferType == (TransferType)transferParams.Type))
.CountAsync();
I recommend two single-column indices.
The two single-column indices will perform better in this query because both columns would be in a fully ordered index. By contrast, in a multi-column index, only the first column is fully ordered in the index.
If you were using an AND condition for the sender and receiver, then you would benefit from a multi-column index. The multi-column index is ideal for situations where multiple columns have conditional statements that must all be evaluated to build the result set (e.g., WHERE receiver = 1 AND sender = 2). In an OR condition, a multi-column index would be leveraged as though it were a single-column index only for the first column; the second column would be unindexed.
The full intricacies of index design would take well more than an SO answer to explain; there are probably books about it, and it will feature as a reasonable proportion of a database administrator's job
Indexes have a cost to maintain so you generally strive to have the fewest possible that offer you the most flexibility with what you want to do. Generally an index will have some columns that define its key and a reference to rows in the table that have those keys. When using an index the database engine can quickly look up the key, and discover which rows it needs to read from. It then looks up those rows as a secondary operation.
Indexes can also store table data that aren't part of the lookup key, so you might find yourself creating indexes that also track other columns from the row so that by the time the database has found the key it's looking for in the index it also has access to the row data the query wants and doesn't then need to launch a second lookup operation to find the row. If a query wants too many rows from a table, the database might decide to skip using the index at all; there's some threshold beyond which it's faster to just read all the rows direct from the table and search them rather than suffer the indirection of using the index to find which rows need to be read
The columns that an index indexes can serve more than one query; order is important. If you always query a person by name and also sometimes query by age, but you never query by age alone, it would be better to index (name,age) than (age,name). An index on (name,age) can serve a query for just WHERE name = ..., and also WHERR name = ... and age = .... If you use an OR keyword in a where clause you can consider that as a separate query entirely that would need its own index. Indeed the database might decide to run "name or age" as two parallel queries and combine the results to remove duplicates. If your app needs later change so that instead of just querying a mix of (name), (name and age) it is now frequently querying (name), (name and age), (name or age), (age), (age and height) then it might make sense to have two indexes: (name, age) plus (age, height). The database can use part or all of both of these to server the common queries. Remember that using part of an index only works from left to right. An index on (name, age) wouldn't typically serve a query for age alone.
If you're using SQLServer and SSMS you might find that showing the query plan also reveals a missing index recommendation and it's worth considering carefully whether an index needs to be added. Apps deployed to Microsoft azure also automatically look at common queries where performance suffers because of a lack of an index and it can be the impetus to take a look at the query being run and seeing how existing or new indexes might be extended or rearranged to cover it; as first noted it's not really something a single SO answer of a few lines can prep you for with a "always do this and it will be fine" - companies operating at large scale hire people whose sole mission is to make sure the database runs well they usually grumble a lot about the devs and more so about things like entity framework because an EF LINQ query is a layer disconnected from the actual SQL being run and may not be the most optimal approach to getting the data. All these things you have to contend with.
In this particular case it seems like indexes on SenderId+TransferStatus+TransferType and another on ReceiverId+TransferStatus+TransferType could help the two queries shown, but I wouldn't go as far as to say "definitely do that" without taking a holistic view of everything this table contains, how many different values there are in those columns and what it's used for in the context of the app. If Sender/Receiver are unique, there may be no point in adding more columns to the index as keys. If TransferStatus and Type change such that some combination of them helps uniquely identify some particular row out of hundreds then it may make sense, but then if this query only runs once a day compared to another that is used 10 times a second... There's too much variable and unknown to provide a concrete answer to the question as presented; don't optmize prematurely - indexing columns just because they're used in some where clause somewhere would be premature

How to order by concatenated values in Entity Framework / Linq?

I have a situation where the records in a parent table (Record) can have one of two related records in child tables (PhysicalPerson and Company). One of the columns will be always empty.
When displaying the records in a UI grid, user should see only one of the two names in OwnerName column and user should be able to sort the column OwnerName without any knowledge if the OwnerName for any record comes from Company or from PhysicalPerson.
To avoid denormalizing data and copying and maintaining Name column, I attempted to do it all in a Linq query.
Basically, the desired SQL order by expression that works should look like this:
ORDER BY CONCAT(Record.PhysicalPerson.Name,
Record.PhysicalPerson.Surname,
Record.Company.Name)
This would automatically ignore NULL values and results look acceptable.
So I tried to implement it in Linq:
query = query.OrderBy(x => x.PhysicalPerson.Name +
x.PhysicalPerson.Surname +
x.Company.Name);
but the resulting query generated by Entity Framework looks like this:
[Extent4].[Name] + [Extent6].[Surname] + [Extent8].[Name] AS [C1]
...
ORDER BY [Project1].[C1] ASC
...
Obviously, + does not work as a substitute for CONCAT in SQL.
Is there any way to make EntityFramework generate CONCAT instead of + for string concatenation in OrderBy?
If not, then I guess I'll have to create a separate SQL view with calculated column for this specific UI grid (which might be more correct solution anyway).
Try this:
query = query.OrderBy(x => x.PhysicalPerson.Name).ThenBy(x.PhysicalPerson.Surname).ThenBy(x.Company.Name);
Unfortunately the LINQ translators for SQL/EF don't use CONCAT to translate string concatenation, which is inconsistent with the way other translations expect SQL to handle null automatically (or maybe the SQL definition of + as different from CONCAT is where the issue lies). In any case, you can make the null test explicit like so:
query = query.OrderBy(x => String.Concat(x.PhysicalPerson.Name ?? "", x.PhysicalPerson.Surname ?? "", x.Company.Name ?? ""));
You could also use + instead of String.Concat, but I think the intent is easier to see with String.Concat.

How can i speed up these linq queries?

Can i combine the following two linq queries in an single one, for speeding things up?
The first one, searches and performs the paging
Products.Data = db.Products.Where(x => x.ProductCode.Contains(search) ||
x.Name.Contains(search) ||
x.Description.Contains(search) ||
x.DescriptionExtra.Contains(search) ||
SqlFunctions.StringConvert(x.Price).Contains(search) ||
SqlFunctions.StringConvert(x.PriceOffer).Contains(search) ||
SqlFunctions.StringConvert(x.FinalPrice).Contains(search) ||
SqlFunctions.StringConvert(x.FinalPriceOffer).Contains(search))
.OrderBy(p => p.ProductID)
.Skip(PageSize * (page - 1))
.Take(PageSize).ToList();
while the second one counts the total filtered results.
int count = db.Products.Where(x => x.ProductCode.Contains(search) ||
x.Name.Contains(search) ||
x.Description.Contains(search) ||
x.DescriptionExtra.Contains(search) ||
SqlFunctions.StringConvert(x.Price).Contains(search) ||
SqlFunctions.StringConvert(x.PriceOffer).Contains(search) ||
SqlFunctions.StringConvert(x.FinalPrice).Contains(search) ||
SqlFunctions.StringConvert(x.FinalPriceOffer).Contains(search))
.Count();
Get rid of your ridiculously inefficient conversions.
SqlFunctions.StringConvert(x.Price).Contains(search) ||
No index use possible, full table scan, plus a conversion - that is as bad as it gets.
And make sure you have all indices.
Nothing else you can do.
I do not think You can combine them directly. This is one problem of paging - You need to know the total count of result anyway. Problem of dynamic paging further is, that one page can be inconsistent with another, because it is from different time. You can easily miss item completely because of this. If this can be a problem, I would avoid dynamic paging. You can fill ids of the whole result into some temporary table on server and do paging from there. Or You can return all ids from fulltext search and query the rest of data on demand.
There are some more optimizations, You can start returning results when the search string is at least 3 characters long, or You can build special table with count estimates for this purpose. You can also decide, that You would return only first ten pages and save server storage for ids (or client bandwidth for ids).
I am sad to see "Stop using contains" answers without alternative. Searching in the middle of the words is many times a must. The fact is, SQL server is terribly slow on text processing and searching is no exception. AFAIK even full-text indexes would not help You much with in-the-middle substring searching.
For the presented query on 10k records I would expect about 40ms per query to get counts or all results (my desktop). You can make computed persisted colum on this table with all texts concatenated and all numbers converted and query only that column. It would speed things up significantly (under 10ms for query on my desktop).
[computedCol] AS (((((((((([text1]+' ')+[text2])+' ')+[text3])+' ')+CONVERT([nvarchar](max),[d1]))+' ')+CONVERT([nvarchar](max),[d2]))+' ')+CONVERT([nvarchar](max),[d3])) PERSISTED
Stop using 'contains' function cause it's a very slow thing (if you can)
Make sure that Your queries can make use of index`es in the DB.
If You MUST have 'contains' - the take a look at full-text search capabilieties of SQL, but You might need to change this so pure sql or customise how your Linq is translated to SQL to make use of Full-Text indexes

Querying with many (~100) search terms with Entity Framework

I need to do a query on my database that might be something like this where there could realistically be 100 or more search terms.
public IQueryable<Address> GetAddressesWithTown(string[] towns)
{
IQueryable<Address> addressQuery = DbContext.Addresses;
addressQuery.Where( x => towns.Any( y=> x.Town == y ) );
return addressQuery;
}
However when it contains more than about 15 terms it throws and exception on execution because the SQL generated is too long.
Can this kind of query be done through Entity Framework?
What other options are there available to complete a query like this?
Sorry, are we talking about THIS EXACT SQL?
In that case it is a very simple "open your eyes thing".
There is a way (contains) to map that string into an IN Clause, that results in ONE sql condition (town in ('','',''))
Let me see whether I get this right:
addressQuery.Where( x => towns.Any( y=> x.Town == y ) );
should be
addressQuery.Where ( x => towns.Contains (x.Town)
The resulting SQL will be a LOT smaller. 100 items is still taxing it - I would dare saying you may have a db or app design issue here and that requires a business side analysis, I have not me this requirement in 20 years I work with databases.
This looks like a scenario where you'd want to use the PredicateBuilder as this will help you create an Or based predicate and construct your dynamic lambda expression.
This is part of a library called LinqKit by Joseph Albahari who created LinqPad.
public IQueryable<Address> GetAddressesWithTown(string[] towns)
{
var predicate = PredicateBuilder.False<Address>();
foreach (string town in towns)
{
string temp = town;
predicate = predicate.Or (p => p.Town.Equals(temp));
}
return DbContext.Addresses.Where (predicate);
}
You've broadly got two options:
You can replace .Any with a .Contains alternative.
You can use plain SQL with table-valued-parameters.
Using .Contains is easier to implement and will help performance because it translated to an inline sql IN clause; so 100 towns shouldn't be a problem. However, it also means that the exact sql depends on the exact number of towns: you're forcing sql-server to recompile the query for each number of towns. These recompilations can be expensive when the query is complex; and they can evict other query plans from the cache as well.
Using table-valued-parameters is the more general solution, but it's more work to implement, particularly because it means you'll need to write the SQL query yourself and cannot rely on the entity framework. (Using ObjectContext.Translate you can still unpack the query results into strongly-typed objects, despite writing sql). Unfortunately, you cannot use the entity framework yet to pass a lot of data to sql server efficiently. The entity framework doesn't support table-valued-parameters, nor temporary tables (it's a commonly requested feature, however).
A bit of TVP sql would look like this select ... from ... join #townTableArg townArg on townArg.town = address.town or select ... from ... where address.town in (select town from #townTableArg).
You probably can work around the EF restriction, but it's not going to be fast and will probably be tricky. A workaround would be to insert your values into some intermediate table, then join with that - that's still 100 inserts, but those are separate statements. If a future version of EF supports batch CUD statements, this might actually work reasonably.
Almost equivalent to table-valued paramters would be to bulk-insert into a temporary table and join with that in your query. Mostly that just means you're table name will start with '#' rather than '#' :-). The temp table has a little more overhead, but you can put indexes on it and in some cases that means the subsequent query will be much faster (for really huge data-quantities).
Unfortunately, using either temporary tables or bulk insert from C# is a hassle. The simplest solution here is to make a DataTable; this can be passed to either. However, datatables are relatively slow; the over might be relevant once you start adding millions of rows. The fastest (general) solution is to implement a custom IDataReader, almost as fast is an IEnumerable<SqlDataRecord>.
By the way, to use a table-valued-parameter, the shape ("type") of the table parameter needs to be declared on the server; if you use a temporary table you'll need to create it too.
Some pointers to get you started:
http://lennilobel.wordpress.com/2009/07/29/sql-server-2008-table-valued-parameters-and-c-custom-iterators-a-match-made-in-heaven/
SqlBulkCopy from a List<>

Selecting first 100 records using Linq

How can I return first 100 records using Linq?
I have a table with 40million records.
This code works, but it's slow, because will return all values before filter:
var values = (from e in dataContext.table_sample
where e.x == 1
select e)
.Take(100);
Is there a way to return filtered? Like T-SQL TOP clause?
No, that doesn't return all the values before filtering. The Take(100) will end up being part of the SQL sent up - quite possibly using TOP.
Of course, it makes more sense to do that when you've specified an orderby clause.
LINQ doesn't execute the query when it reaches the end of your query expression. It only sends up any SQL when either you call an aggregation operator (e.g. Count or Any) or you start iterating through the results. Even calling Take doesn't actually execute the query - you might want to put more filtering on it afterwards, for instance, which could end up being part of the query.
When you start iterating over the results (typically with foreach) - that's when the SQL will actually be sent to the database.
(I think your where clause is a bit broken, by the way. If you've got problems with your real code it would help to see code as close to reality as possible.)
I don't think you are right about it returning all records before taking the top 100. I think Linq decides what the SQL string is going to be at the time the query is executed (aka Lazy Loading), and your database server will optimize it out.
Have you compared standard SQL query with your linq query? Which one is faster and how significant is the difference?
I do agree with above comments that your linq query is generally correct, but...
in your 'where' clause should probably be x==1 not x=1 (comparison instead of assignment)
'select e' will return all columns where you probably need only some of them - be more precise with select clause (type only required columns); 'select *' is a vaste of resources
make sure your database is well indexed and try to make use of indexed data
Anyway, 40milions records database is quite huge - do you need all that data all the time? Maybe some kind of partitioning can reduce it to the most commonly used records.
I agree with Jon Skeet, but just wanted to add:
The generated SQL will use TOP to implement Take().
If you're able to run SQL-Profiler and step through your code in debug mode, you will be able to see exactly what SQL is generated and when it gets executed. If you find the time to do this, you will learn a lot about what happens underneath.
There is also a DataContext.Log property that you can assign a TextWriter to view the SQL generated, for example:
dbContext.Log = Console.Out;
Another option is to experiment with LINQPad. LINQPad allows you to connect to your datasource and easily try different LINQ expressions. In the results panel, you can switch to see the SQL generated the LINQ expression.
I'm going to go out on a limb and guess that you don't have an index on the column used in your where clause. If that's the case then it's undoubtedly doing a table scan when the query is materialized and that's why it's taking so long.

Categories