Why write a custom LINQ provider? - c#

What is the benefit of writing a custom LINQ provider over writing a simple class which implements IEnumerable?
For example this quesiton shows Linq2Excel:
var book = new ExcelQueryFactory(#"C:\Users.xls");
var administrators = from x in book.Worksheet<User>()
where x.Role == "Administrator"
select x;
But what is the benefit over the "naive" implementation as IEnumerable?

A Linq provider's purpose is to basically "translate" Linq expression trees (which are built behind the scenes of a query) into the native query language of the data source. In cases where the data is already in memory, you don't need a Linq provider; Linq 2 Objects is fine. However, if you're using Linq to talk to an external data store like a DBMS or a cloud, it's absolutely essential.
The basic premise of any querying structure is that the data source's engine should do as much of the work as possible, and return only the data that is needed by the client. This is because the data source is assumed to know best how to manage the data it stores, and because network transport of data is relatively expensive time-wise, and so should be minimized. Now, in reality, that second part is "return only the data asked for by the client"; the server can't read your program's mind and know what it really needs; it can only give what it's asked for. Here's where an intelligent Linq provider absolutely blows away a "naive" implementation. Using the IQueryable side of Linq, which generates expression trees, a Linq provider can translate the expression tree into, say, a SQL statement that the DBMS will use to return the records the client is asking for in the Linq statement. A naive implementation would require retrieving ALL the records using some broad SQL statement, in order to provide a list of in-memory objects to the client, and then all the work of filtering, grouping, sorting, etc is done by the client.
For example, let's say you were using Linq to get a record from a table in the DB by its primary key. A Linq provider could translate dataSource.Query<MyObject>().Where(x=>x.Id == 1234).FirstOrDefault() into "SELECT TOP 1 * from MyObjectTable WHERE Id = 1234". That returns zero or one records. A "naive" implementation would probably send the server the query "SELECT * FROM MyObjectTable", then use the IEnumerable side of Linq (which works on in-memory classes) to do the filtering. In a statement you expect to produce 0-1 results out of a table with 10 million records, which of these do you think would do the job faster (or even work at all, without running out of memory)?

You don't need to write a LINQ provider if you only want to use the LINQ-to-Objects (i.e. foreach-like) functionality for your purpose, which mostly works against in-memory lists.
You do need to write a LINQ provider if you want to analyse the expression tree of a query in order to translate it to something else, like SQL. The ExcelQueryFactory you mentioned seems to work with an OLEDB-Connection for example. This possibly means that it doesn't need to load the whole excel file into memory when querying its data.

In general performance. If you have some kind of index you can do a query much faster than what is possible on a simple IEnumerable<T>.
Linq-To-Sql is a good example for that. Here you transform the linq statement into another for understood by the SQL server. So the server will do the filtering, ordering,... using the indexes and doesn't need to send the whole table to the client which then does it with linq-to-objects.
But there are simpler cases where it can be useful too:
If you have a tree index over the propery Time then a range query like .Where(x=>(x.Time>=now)&&(x.Time<=tomorrow)) can be optimized a lot, and doesn't need to iterate over every item in the enumerable.

LINQ will provide deferred execution as much as maximum possible to improve the performance.
IEnumurable<> and IQueryable<> will totally provide different program implementations. IQueryable will give native query by building expression tree dynamically which provides good performance indeed then IEnumurable.
http://msdn.microsoft.com/en-us/vcsharp/ff963710.aspx
if we are not sure we may use var keyword and dynamically it will initialize a most suitable type.

Related

Should I do the Where clause in SQL or in LINQ?

I have a method that can pass in a where clause to a query. This method then returns a DataSet.
What is faster and more efficient? To pass the where clause through so the SQL Server sends back less data (but has to do more work) or let the web server handle it through LINQ?
Edit:
This is assuming that the SQL Server is more powerful than the web server (as probably should be the case).
Are you using straight up ADO.Net to perform your data acccess? If so, then yes - use a WHERE clause in your SQL and limit the amount of data sent back to your application.
SQL Server is effecient at this, you can design indexes to help it access data and you are transmitting less data back to your client application.
Imagine you have 20,000 rows in a table, but you are only interested in 100 of them. Of course it is much more effecient to only grab the 100 rows from the source and send them back, rather than the whole lot which you then filter in your web application.
You have tagged linq-to-sql, if that's the case then using a WHERE clause in your LINQ statement will generate the WHERE clause on the SQL Server.
But overall rule of thumb, just get the data you are interested in. It's less data over the wire, the query will generally run faster (as long as it's optimised via indexes etc) and your client application will have less work to do, it's already got only the data it's interested in.
SQL Server is good at filtering data, in fact: that's what it's built for, so always make use of that. If you filter in C#; you won't be able to make use of any indexes you have on your tables. It's going to be far less efficient.
Selecting all rows only to discard many/most of them is wasteful and it will definitely show in the performance.
If you are not using any sort of ORM then use the where condition
at database level as i believe filtering should happen at database
level.
But if your using any ORM like Entity Framework or Linq to SQL then from performance point of view its is the same as your Linq Where clause will ultimately translated into an SQL Where clause as far as your using where clause on IQuerable.
From the point of efficiency, it is SQL server that shoud do the job. If it does not requires multiple database callings, it is always a better solution to use SQL server. But, if you already have a Dataset from a database, you can filter it with LINQ

Why do I need LINQ if I use NHibernate-like ORMs?

I was reading this SO question but still I am not clear about one specific thing.
If I use NHibernate, why do I need LINQ?
The question in my mind becomes more aggravated when I knew that NHibernate also included LINQ support.
LINQ to NHibernate?
WTF!
LINQ is a query language. It allows you to express queries in a way that is not tied in to your persistence layer.
You may be thinking about the LINQ 2 SQL ORM.
Using LINQ in naming the two is causes unfortunate confusions like yours.
nHibernate, EF, LINQ2XML are all LINQ providers - they all allow you to query a data source using the LINQ syntax.
Well, you don't need Linq, you can always do without it, but you might want it.
Linq provides a way to express operations that behave on sets of data that can be queried and where we can then perform other operations based on the state of that data. It's deliberately written so as to be as agnostic as possible whether that data is in-memory collections, XML, database, etc. Ultimately it's always operating on some sort of in-memory object, with some means of converting between in-memory and the ultimate source, though some bindings go further than others in pushing some of the operations down to a different layer. E.g. calling .Count() can end up looking at a Count property, spinning through a collection and keeping a tally, sending a Count(*) query to a database or maybe something else.
ORMs provide a way to have in-memory objects and database rows reflect each other, with changes to one being reflected by changes to the other.
That fits nicely into the "some means of converting" bit above. Hence Linq2SQL, EF and Linq2NHibernate all fulfil both the ORM role and the Linq provider role.
Considering that Linq can work on collections you'd have to be pretty perverse to create an ORM that couldn't support Linq at all (you'd have to design your collections to not implement IEnumerable<T> and hence not work with foreach). More directly supporting it though means you can offer better support. At the very least it should make for more efficient queries. For example if an ORM gave us a means to get a Users object that reflected all rows in a users table, then we would always be able to do:
int uID = (from u in Users where u.Username == "Alice" select u.ID).FirstOrDefault();
Without direct support for Linq by making Users implement IQueryable<User>, then this would become:
SELECT * FROM Users
Followed by:
while(dataReader.Read())
yield return ConstructUser(dataReader);
Followed by:
foreach(var user in Users)
if(user.Username == "Alice")
return user.ID;
return 0;
Actually, it'd be just slightly worse than that. With direct support the SQL query produced would be:
SELECT TOP 1 id FROM Users WHERE username = 'Alice'
Then the C# becomes equivalent to
return dataReader.Read() ? dataReader.GetInt32(0) : 0;
It should be pretty clear how the greater built-in Linq support of a Linq provider should lead to better operation.
Linq is an in-language feature of C# and VB.NET and can also be used by any .NET language though not always with that same in-language syntax. As such, ever .NET developer should know it, and every C# and VB.NET developer should particularly know it (or they don't know C# or VB.NET) and that's the group NHibernate is designed to be used by, so they can depend on not needing to explain a whole bunch of operations by just implementing them the Linq way. Not supporting it in a .NET library that represents queryable data should be considered a lack of completeness at best; the whole point of an ORM is to make manipulating a database as close to non-DB related operations in the programming language in use, as possible. In .NET that means Linq supprt.
First of all LINQ alone is not an ORM. It is a DSL to query the objects irrespective of the source it came from.
So it makes perfect sense that you can use LINQ with Nhibernate too
I believe you misunderstood the LINQ to SQL with plain LINQ.
Common sense?
There is a difference between an ORM like NHibernate and compiler integrated way to express queries which is use full in many more scenarios.
Or: Usage of LINQ (not LINQ to SQL etc. - the langauge, which is what you talk about though I am not sure you meant what you said) means you dont have to deal with Nhibernate special query syntax.
Or: Anyone NOT using LINQ - regardless of NHibernate or not - de qualifies without good explanation.
You don't need it, but you might find it useful. Bear in mind that Linq, as others have said, is not the same thing as Linq to SQL. Where I work, we write our own SQL queries to retrieve data, but it's quite common for us to use Linq to manipulate that data in order to serve a specific need. For instance, you might have a data access method that allows you to retrieve all dogs owned by Dave:
new DogOwnerDal().GetForOwner(id);
If you're only interested in Dave's daschunds for one specific need, and performance isn't that much of an issue, you can use Linq to filter the response for all of Dave's dogs down to the specific data that you need:
new DogOwnerDal().GetForOwner(id).Where(d => d.Breed == DogBreeds.Daschund);
If performance was crucial, you might want to write a specific data access method to retrieve dogs by owner and breed, but in many cases the effort expended to create the new data access method doesn't increase efficiency enough to be worth doing.
In your example, you might want to use NHibernate to retrieve a lump of data, and then use Linq to break that data into lots of individual subsets for some form of processing. It might well be cheaper to get the data once and use Linq to split it up, instead of repeatedly interrogating the database for different mixtures of the same data.

mixing database and object query in linq and provide paged results

I need to build a query that provides paged results. Part of filtering occurs in the database and part of it occurs in objects that are in memory.
Below is a simplified sample that shows what I could do i.e. run a linq query against the database and then further filter it using the custom code and then use skip/take for paging but this would be very inefficient as it needs to load all items that match the first part of my query.
Things.Where(e=>e.field1==1 && e.field2>1).ToList()
.Where(e=>Helper.MyFilter(e,param1,param2)).Skip(m*pageSize).Take(pageSize);
MyFilter function uses additional data that is not located in the database and it is run with additional parameters (paramX in the above example)
Is there a preferred way to handle this situation without loading the initial result fully in memory.
yes, query and page at the database level. whatever logic is in Helper.MyFilter needs to be in the sql query.
the other option, which is more intrusive to your code base. is to save the view model, as well as the domain entity when the entity changes. part of the view model would contain the result of Helper.MyFilter(e) so you can quickly and efficiently query for it.
To support Jason's answer above - entity framework supports .Skip().Take(). So send it all down to the db level and convert your where into something EF can consume.
If your where helper is complicated use Albahari's predicate builder:
http://www.albahari.com/nutshell/predicatebuilder.aspx
or the slightly easier to use Universal Predicate Builder:
http://petemontgomery.wordpress.com/2011/02/10/a-universal-predicatebuilder/ based on the above.
.ToList()
You are converting your query into a memory object i.e. list and thus causing the query to execute and then you provide the paging on the data.
You can put it all in one Where clause:
Things.Where(e=>e.field1==1 && e.field2>1
&& e=>Helper.MyFilter(e)).Skip(m*pageSize).Take(pageSize);
and then .ToList().
That way you will give Linq to Sql a chance to generate a query and get you only the data that you want.
Or there is a particular reason why you want to do just that - converting to a memory object and then filtering? Although I don't see the point. You should be able to filter out the results that you don't want in the Linq to Sql query before you actually execute it against the database.
EDIT
As I can see from the discussion you have several options.
If you have a lot of data and do more reads than writes it might be wise to save the results from Helper.MyFilter into the database on insert if it's possible. That way you can increase performance on select as you will not pull all the data from the database and also you will have a more filtered data on the SELECT itself.
Or you can take another approach. You can put Helper class in a separate assembly and reference that assembly from SQL Server. This will enable you to put the paging logic in your database and use your code as well.

What is the difference between "LINQ to Entities", "LINQ to SQL" and "LINQ to Dataset"

I've been working for quite a while now with LINQ. However, it remains a bit of a mystery what the real differences are between the mentioned flavours of LINQ.
The successful answer will contain a short differentiation between them. What is the main goal of each flavor, what is the benefit, and is there a performance impact...
P.S.
I know that there are a lot of information sources out there, but I'm looking for a kind of a "cheat sheet" which instructs a newbie where to head for a specific goal.
all of them are LINQ - Language Integrated Query - so they all share a lot of commonality. All these "dialects" basically allow you to do a query-style select of data, from various sources.
Linq-to-SQL is Microsoft's first attempt at an ORM - Object-Relational Mapper. It supports SQL Server only. It's a mapping technology to map SQL Server database tables to .NET objects.
Linq-to-Entities is the same idea, but using Entity Framework in the background, as the ORM - again from Microsoft, but supporting multiple database backends
Linq-to-DataSets is LINQ, but using is against the "old-style" ADO.NET 2.0 DataSets - in the times before ORM's from Microsoft, all you could do with ADO.NET was returning DataSets, DataTables etc., and Linq-to-DataSets queries those data stores for data. So in this case, you'd return a DataTable or DataSets (System.Data namespace) from a database backend, and then query those using the LINQ syntax
LINQ is a broad set of technologies, based around (for example) a query comprehension syntax, for example:
var qry = from x in source.Foo
where x.SomeProp == "abc"
select x.Bar;
which is mapped by the compiler into code:
var qry = source.Foo.Where(x => x.SomeProp == "abc").Select(x => x.Bar);
and here the real magic starts. Note that we haven't said what Foo is here - and the compiler doesn't care! As long as it can resolve some suitable method called Where that can take a lambda, and the result of that has some Select method that can accept the lambda, it is happy.
Now consider that the lambda can be compiled either into an anonymous method (delegate, for LINQ-to-Objects, which includes LINQ-to-DataSet), or to an expression-tree (a runtime model that represents the lambda in an object model).
For in-memory data (typically IEnumerable<T>), it just executes the delegate - fine and fast. But for IQueryable<T> the object-representation of the expression (a LambdaExpression<...>) it can pull it apart and apply it to any "LINQ-to-Something" example.
For databases (LINQ-to-SQL, LINQ-to-Entities) this might mean writing TSQL, for example:
SELECT x.Bar
FROM [SomeTable] x
WHERE x.SomeProp = #p1
But it could (for ADO.NET Data Services, for example) mean writing an HTTP query.
Executing a well-written TSQL query that returns a small amount of data is faster than loading an entire database over the network and then filtering at the client. Both have ideal scenarios and plain-wrong scenarios, though.
The goal and benefit here is to allow you to use a single, static-checked syntax to query a wide range of data-sources, and to make the code more expressive ("traditional" code to group data, for example, isn't very clear in terms of what it is trying to do - it is lost in the mass of code).
LINQ stands for language integrated query. It allows you to use "SQL style" query language directly within C# to extract information from data sources.
That data source could be a SQL server database - this is Linq to SQL
That data source could be an data context of entity framework objects - Linq to entities.
That data source could be ADO.net data sets - Linq to Dataset.
That data source could also be an XML file - Linq to XML.
Or even just a Collection class of plain objects - Linq to Objects.
LINQ describes the querying technology, the rest of the name describes the source of the data being queried.
For a bit of extra background:
Datasets are ADO.net objects where data is loaded from a database into a .net Dataset and Linq can be used to query that data after it's loaded.
With Linq to SQL you define .net classes that map to the database and Linq-to-SQL takes care of loading the data from the SQL server database
And finally the Entity framework is a system where you can define a database and object mapping in XML, and can then use Linq to query the data that is loaded via this mapping.

A Better DataTable

I have an application that uses DataTables to perform grouping, filtering and aggregation of data. I want to replace datatables with my own data structures so we don't have any unnecessary overhead that we get from using datatables. So my question is if Linq can be used to perform the grouping, filtering and aggregation of my data and if it can is the performance comparable to datatables or should I just hunker down and write my own algorithms to do it?
Thanks
Dan R.
Unless you go for simple classes (POCO etc), your own implementation is likely to have nearly as much overhead as DataTable. Personally, I'd look more at using tools like LINQ-to-SQL, Entity Framework, etc. Then you can use either LINQ-to-Objects against local data, or the provider-specific implementation for complex database queries without pulling all the data to the client.
LINQ-to-Objects can do all the things you mention, but it involves having all the data in memory. If you have non-trivial data, a database is recommended. SQL Server Express Edition would be a good starting point if you look at LINQ-to-SQL or Entity Framework.
Edited re comment:
Regular TSQL commands are fine and dandy, but you ask about the difference... the biggest being that LINQ-to-SQL will provide the entire DAL for you, which is a huge time saver, as well as making it possible to get a lot more compile-time safety. But is also allows you to use the same approach to look at your local objects and your database - for example, the following is valid C# 3.0 (except for [someDataSource], see below):
var qry = from row in [someDataSource]
group row by row.Category into grp
select new {Category = grp.Key, Count = grp.Count(),
TotalValue = grp.Sum(x=>x.Value) };
foreach(var x in qry) {
Console.WriteLine("{0}, {1}, {2}", x.Category, x.Count, x.TotalValue);
}
If [someDataSource] is local data, such as a List<T>, this will execute locally; but if this is from your LINQ-to-SQL data-context, it can build the appropriate TSQL at the database server. This makes it possible to use a single query mechanism in your code (within the bounds of LOLA, of course).
You'd be better off letting your database handle grouping, filtering and aggregation. DataTables are actually relatively good at this sort of thing (their bad reputation seems to come primarily from inappropriate usage), but not as good as an actual database. Moreover, without a lot of work on your part, I would put my money on the DataTable's having better performance than your homegrown data structure.
Why not use a local database like Sqlserver CE or firebird embedded? (or even ms access! :)). Store the data in the local database, do the processing using simple sql queries and pull the data back. Much simpler and likely less overhead, plus you don't have to write all the logic for grouping/aggregates etc. as the database systems already have that logic built in, debugged and working.
Yes, you can use LINQ to do all those things using your custom objects.
And I've noticed a lot of people suggest that you do this type of stuff in the database... but you never indicated where the database was coming from.
If the data is coming from the database then at the very least the filtering should probably happen there, unless you are doing something specialized (e.g. working from a cached set of data). And even then, if you are working with a significant amount of cached data, you might do well to put that data into an embedded database like SQLite, as someone else has already mentioned.

Categories