How smart is SQLite.Net when using Linq? - c#

When using SQLite.Net the Linq queries are turned into hopefully performant SQL queries.
Is there however a chance that some sort of query is too complex and leads to a full and slow table scan? Like, let's say I want to get one item from a table with 50k rows - I don't want it to loop all rows and compare the values against my requested value in .Net code; it should generate a proper "where" clause.
I know that one can see the queries generated by SQLite.Net but as they are dynamic there is no guarantee that a stupid combination of clauses might lead to slow performance.
Or am I thinking too far and this cannot happen?

Related

Entity Framework performance of include

It's more a technical (behind the scenes of EF) kind of question for a better understanding of Include for my own.
Does it make the query faster to Include another table when using a Select statement at the end?
ctx.tableOne.Include("tableTwo").Where(t1 => t1.Value1 == "SomeValueFor").Select(res => new {
res.Value1,
res.tableTwo.Value1,
res.tableTwo.Value2,
res.tableTwo.Value3,
res.tableTwo.Value4
});
Does it may depend on the number of values included from another table?
In the example above, 4 out of 5 values are from the included table. I wonder if it does have any performance-impact. Even a good or a bad one?
So, my question is: what is EF doing behind the scenes and is there any preferred way to use Include when knowing all the values I will select before?
In your case it doesn't matter if you use Include(<relation-property-name>) or not because you don't materialize the values before the Select(<mapping-expression>). If you use the SQL Server Profiler (or other profiler) you can see that EF generates two exactly the same queries.
The reason for this is because the data is not materialized in memory before the Select - you are working on IQueryable which means EF will generate an SQL query at the end (before calling First(), Single(), FirstOrDefault(), SingleOrDefault(), ToList() or use the collection in a foreach statement). If you use ToList() before the Select() it will materialize the entities from the database into your memory where Include() will come in hand not to make N+1 queries when accessing nested properties to other tables.
It is about how you want EF to load your data. If you want A 'Table' data to be pre populated than use Include. It is more handy if Include statement table is going to be used more frequently and it will be little slower as EF has to load all the relevant date before hand. Read the difference between Lazy and Eager loading. by using Include, it will be the eager loading where data will be pre populated while on the other hand EF will send a call to the secondary table when projection takes place i-e Lazy loading.
I agree with #Karamfilov for his general discussion, but in your example your query could not be the most performant. The performance can be affected by many factors, such as indexes present on the table, but you must always help EF in the generation of SQL. The Include method can produce a SQL that includes all columns of the table, you should always check what is the generated SQL and verify if you can obtain a better one using a Join.
This article explains the techniques that can be used and what impact they have on performance: https://msdn.microsoft.com/it-it/library/bb896272(v=vs.110).aspx

Should I do the Where clause in SQL or in LINQ?

I have a method that can pass in a where clause to a query. This method then returns a DataSet.
What is faster and more efficient? To pass the where clause through so the SQL Server sends back less data (but has to do more work) or let the web server handle it through LINQ?
Edit:
This is assuming that the SQL Server is more powerful than the web server (as probably should be the case).
Are you using straight up ADO.Net to perform your data acccess? If so, then yes - use a WHERE clause in your SQL and limit the amount of data sent back to your application.
SQL Server is effecient at this, you can design indexes to help it access data and you are transmitting less data back to your client application.
Imagine you have 20,000 rows in a table, but you are only interested in 100 of them. Of course it is much more effecient to only grab the 100 rows from the source and send them back, rather than the whole lot which you then filter in your web application.
You have tagged linq-to-sql, if that's the case then using a WHERE clause in your LINQ statement will generate the WHERE clause on the SQL Server.
But overall rule of thumb, just get the data you are interested in. It's less data over the wire, the query will generally run faster (as long as it's optimised via indexes etc) and your client application will have less work to do, it's already got only the data it's interested in.
SQL Server is good at filtering data, in fact: that's what it's built for, so always make use of that. If you filter in C#; you won't be able to make use of any indexes you have on your tables. It's going to be far less efficient.
Selecting all rows only to discard many/most of them is wasteful and it will definitely show in the performance.
If you are not using any sort of ORM then use the where condition
at database level as i believe filtering should happen at database
level.
But if your using any ORM like Entity Framework or Linq to SQL then from performance point of view its is the same as your Linq Where clause will ultimately translated into an SQL Where clause as far as your using where clause on IQuerable.
From the point of efficiency, it is SQL server that shoud do the job. If it does not requires multiple database callings, it is always a better solution to use SQL server. But, if you already have a Dataset from a database, you can filter it with LINQ

How to maximize performance?

I have a problem which I cannot seem to get around no matter how hard I try.
This company works in market analysis, and have pretty large tables (300K - 1M rows) and MANY columns (think 250-300) which we do some calculations on.
I´ll try to get straight to the problem:
The problem is the filtering of the data. All databases I´ve tried so far are way too slow to select data and return it.
At the moment I am storing the entire table in memory and filtering using dynamic LINQ.
However, while this is quite fast (about 100 ms to filter 250 000 rows) I need better results than this...
Is there any way I can change something in my code (not the data model) which could speed the filtering up?
I have tried using:
DataTable.Select which is slow. Dynamic LINQ which is better, but
still too slow. Normal LINQ (just for testing purposes) which almost
is good enough. Fetching from MySQL and do the processing later on
which is badass slow.
At the beginning of this project we thought that some high-performance database would be able to handle this, but I tried:
H2 (IKVM)
HSQLDB (compiled ODBC-driver)
CubeSQL
MySQL
SQL
SQLite
...
And they are all very slow to interface .NET and get results from.
I have also tried splitting the data into chunks and combining them later in runtime to make the total amount of data which needs filtering smaller.
Is there any way in this universe I can make this faster?
Thanks in advance!
UPDATE
I just want to add that I have not created this database in question.
To add some figures, if I do a simple select of 2 field in the database query window (SQLyog) like this (visit_munic_name is indexed):
SELECT key1, key2 FROM table1 WHERE filter1 = filterValue1
It takes 125 milliseconds on 225639 rows.
Why is it so slow? I have tested 2 different boxes.
Of course they must change someting, obviously?
You do not explain what exactly you want to do, or why filtering a lot of rows is important. Why should it matter how fast you can filter 1M rows to get an aggregate if your database can precalculate that aggregate for you? In any case it seems you are using the wrong tools for the job.
On one hand, 1M rows is a small number of rows for most databases. As long as you have the proper indexes, querying shouldn't be a big problem. I suspect that either you do not have indexes on your query columns or you want to perform ad-hoc queries on non-indexed columns.
Furthermore, it doesn't matter which database you use if your data schema is wrong for the job. Analytical applications typically use star schemas to allow much faster queries for a lot more data than you describe.
All databases used for analysis purposes use special data structures which require that you transform your data to a form they like.
For typical relational databases you have to create star schemas that are combined with cubes to precalculate aggregates.
Column databases store data in a columnar format usually combined with compression to achieve fast analytical queries, but they require that you learn to query them in their own language, which may be very different than the SQL language most people are accustomed to.
On the other hand, the way you query (LINQ or DataTable.Select or whatever) has minimal effect on performance. Picking the proper data structure is much more important.
For instance, using a Dictionary<> is much faster than using any of the techniques you mentioned. A dictionary essentially checks for single values in memory. Executing DataTable.Select without indexes, using LINQ to Datasets or to Objects is essentially the same as scanning all entries of an array or a List<> for a specific value,because that is what all these methods do - scan an entire list sequentially.
The various LINQ providers do not do the job of a database. They do not optimize your queries. They just execute what you tell them to execute. Even doing a binary search on a sorted list is faster than using the generic LINQ providers.
There are various things you can try, depending on what you need to do:
If you are looking for a quick way to slice and dice your data, use an existing product like PowerPivot functionality of Excel 2010. PowerPivot loads and compresses MANY millions of rows in an in-memory columnar format and allows you to query your data just as you would with a Pivot table, and even define joins with other in memory sources.
If you want a more repeatable process you can either create the appropriate star schemas in a relational database or use a columnar database. In either case you will have to write the scripts to load your data in the proper structures.
If you are creating your own application you really need to investigate the various algorithms and structures used by other similar tools either for in memory.

LINQ to SQL - How to make this works with database faster

I have a problem. My LINQ to SQL queries are pushing data to the database at ~1000 rows per second. But this is much too slow for me. The objects are not complicated. CPU usage is <10% and bandwidth is not the bottleneck too.
10% is on client, on server is 0% or max 1% generally not working at all, not traversing indexes etc.
Why 1000/s are slow, i need something around 20000/s - 200000/s to solve my problem in other way i will get more data than i can calculate.
I dont using transaction but LINQ using, when i post for example milion objects new objects to DataContext and run SubmitChanges() then this is inserting in LINQ internal transaction.
I dont use parallel LINQ, i dont have many selects, mostly in this scenario i'm inserting objects and want use all resources i have not only 5% od cpu and 10kb/s of network!
when i post for example milion objects
Forget it. Linq2sql is not intended for such large batch updates/inserts.
The problem is that Linq2sql will execute a separate insert (or update) statement for each insert (update). This kind of behaviour is not suitable with such large numbers.
For inserts you should look into SqlBulkCopy because it is a lot faster (and really order of magnitudes faster).
Some performance optimization can be achived with LINQ-to-SQL using first off precompiled queries. A large part of the cost is compiling the actual query.
http://www.albahari.com/nutshell/speedinguplinqtosql.aspx
http://msdn.microsoft.com/en-us/library/bb399335.aspx
Also you can disable object tracking which may give you milliseconds of improvement. This is done on the datacontext right after you instantiate it.
I also encountered this problem before. The solution I used is Entity Framework. There is a tutorial here. One traditional way is to use LINQ-To-Entity, which has similar syntax and seamless integration of C# objects. This way gave me 10x acceleration in my impression. But a more efficient (in magnitude) way is to write SQL statement by yourself, and then use ExecuteStoreQuery function to fetch the results. It requires you to write SQL rather than LINQ statements, but the returned results can still be read by C# easily.

Why write a custom LINQ provider?

What is the benefit of writing a custom LINQ provider over writing a simple class which implements IEnumerable?
For example this quesiton shows Linq2Excel:
var book = new ExcelQueryFactory(#"C:\Users.xls");
var administrators = from x in book.Worksheet<User>()
where x.Role == "Administrator"
select x;
But what is the benefit over the "naive" implementation as IEnumerable?
A Linq provider's purpose is to basically "translate" Linq expression trees (which are built behind the scenes of a query) into the native query language of the data source. In cases where the data is already in memory, you don't need a Linq provider; Linq 2 Objects is fine. However, if you're using Linq to talk to an external data store like a DBMS or a cloud, it's absolutely essential.
The basic premise of any querying structure is that the data source's engine should do as much of the work as possible, and return only the data that is needed by the client. This is because the data source is assumed to know best how to manage the data it stores, and because network transport of data is relatively expensive time-wise, and so should be minimized. Now, in reality, that second part is "return only the data asked for by the client"; the server can't read your program's mind and know what it really needs; it can only give what it's asked for. Here's where an intelligent Linq provider absolutely blows away a "naive" implementation. Using the IQueryable side of Linq, which generates expression trees, a Linq provider can translate the expression tree into, say, a SQL statement that the DBMS will use to return the records the client is asking for in the Linq statement. A naive implementation would require retrieving ALL the records using some broad SQL statement, in order to provide a list of in-memory objects to the client, and then all the work of filtering, grouping, sorting, etc is done by the client.
For example, let's say you were using Linq to get a record from a table in the DB by its primary key. A Linq provider could translate dataSource.Query<MyObject>().Where(x=>x.Id == 1234).FirstOrDefault() into "SELECT TOP 1 * from MyObjectTable WHERE Id = 1234". That returns zero or one records. A "naive" implementation would probably send the server the query "SELECT * FROM MyObjectTable", then use the IEnumerable side of Linq (which works on in-memory classes) to do the filtering. In a statement you expect to produce 0-1 results out of a table with 10 million records, which of these do you think would do the job faster (or even work at all, without running out of memory)?
You don't need to write a LINQ provider if you only want to use the LINQ-to-Objects (i.e. foreach-like) functionality for your purpose, which mostly works against in-memory lists.
You do need to write a LINQ provider if you want to analyse the expression tree of a query in order to translate it to something else, like SQL. The ExcelQueryFactory you mentioned seems to work with an OLEDB-Connection for example. This possibly means that it doesn't need to load the whole excel file into memory when querying its data.
In general performance. If you have some kind of index you can do a query much faster than what is possible on a simple IEnumerable<T>.
Linq-To-Sql is a good example for that. Here you transform the linq statement into another for understood by the SQL server. So the server will do the filtering, ordering,... using the indexes and doesn't need to send the whole table to the client which then does it with linq-to-objects.
But there are simpler cases where it can be useful too:
If you have a tree index over the propery Time then a range query like .Where(x=>(x.Time>=now)&&(x.Time<=tomorrow)) can be optimized a lot, and doesn't need to iterate over every item in the enumerable.
LINQ will provide deferred execution as much as maximum possible to improve the performance.
IEnumurable<> and IQueryable<> will totally provide different program implementations. IQueryable will give native query by building expression tree dynamically which provides good performance indeed then IEnumurable.
http://msdn.microsoft.com/en-us/vcsharp/ff963710.aspx
if we are not sure we may use var keyword and dynamically it will initialize a most suitable type.

Categories