Compare local list to DataBase - c#

I have a local List with entities, some hundreds, and I have a SQL Server table where I store the ID of the successful processed entities, some millions. I would like to know, which entities form my local set are not yet processed i.e. are not in the SQL Table.
The first approach is to iterate through the local list with the following Linq statement:
Entity entity = db.Entities.FirstOrDefault(m => m.ID == ID);
if (entity == null) { NewList.Add(ID) }
the NewList would then contain all the new entities. However this is very slow.
In LINQ, how would you send the entire local list to the SQL Server with one call and then return the ones not in the SQL table?
Do you really have to create a temporary table with my local list, then left-join on the already processed table and return the ones with a null?

Use .Contains method to retrieve already processed ids
and Except to create list of not yet processed ids.
var localList = new List<int> { 1, 2, 3 };
var processed = db.Entities
.Where(entity => localList.Contains(entity.Id))
.Select(entity => entity.Id)
.ToList();
var notProcessed = localList.Except(processed).ToList();
It will depend on provider, but .Contains should generate sql like:
SELECT Id FROM Entity WHERE Id IN (1, 2, 3)

suggestion:
create a temp table and insert your IDs
select your result on the SQL side
EDIT:
"Can you do that in LINQ?"
TL;DR:
yes* but that's an ugly piece of work, write the SQL yourself
*)depends on what you mean with "in" LINQ, because that is not in the scope of LINQ. In other words: a LINQ expression is one layer too abstract, but if you happen to have an LINQ accessible implementation for this, you can use this in your LINQ statements
on the LINQ expression side you have something like:
List<int> lst = new List<int>() { 1,2,3 };
List<int> result = someQueryable.Where(x=>lst.Contains(x.ID)).Select(x=>x.ID).ToList();
the question now is: what happens on the SQL side (assuming the queryable leads us to a SQL database)?
the queryable provider (e.g. Entity Framework) somehow has to translate that into SQL, execute it and come back with the result
here would be the place to modify the translation...
for example examine the expression tree with regard to the object that is the target for the Contains(...) call and if it is more than just a few elements, go for the temp table approach...
the very same LINQ expression can be translated into different SQL commands. The provider decides how the translation has to be done.
if your provider lacks support for large Contains(...) cases, you will probably experience poor performance... good thing is usually nobody forces you to use it this way ... you can skip linq for performance optimized queries, or you could write a provider extension yourself but then you are not on the "doing something with LINQ"-side but extending the functionality of your LINQ provider
if you are not developing a large scalable product that will be deployed to work with different DB-Backends, it is usually not worth the effort... the easier way to go is to write the sql yourself and just use the raw sql option of your db connection

Related

SQL generated from LINQ not consistent

I am using Telerik Open/Data Access ORM against an ORACLE.
Why do these two statements result in different SQL commands?
Statement #1
IQueryable<WITransmits> query = from wiTransmits in uow.DbContext.StatusMessages
select wiTransmits;
query = query.Where(e=>e.MessageID == id);
Results in the following SQL
SELECT
a."MESSAGE_ID" COL1,
-- additional fields
FROM "XFE_REP"."WI_TRANSMITS" a
WHERE
a."MESSAGE_ID" = :p0
Statement #2
IQueryable<WITransmits> query = from wiTransmits in uow.DbContext.StatusMessages
select new WITransmits
{
MessageID = wiTranmits.MessageID,
Name = wiTransmits.Name
};
query = query.Where(e=>e.MessageID == id);
Results in the following SQL
SELECT
a."MESSAGE_ID" COL1,
-- additional fields
FROM "XFE_REP"."WI_TRANSMITS" a
The query generated with the second statement #2 returns, obviously EVERY record in the table when I only want the one. Millions of records make this prohibitive.
Telerik Data Access will try to split each query into database-side and client-side (or in-memory LINQ if you prefer it).
Having projection with select new is sure trigger that will make everything in your LINQ expression tree after the projection to go to the client side.
Meaning in your second case you have inefficient LINQ query as any filtering is applied in-memory and you have already transported a lot of unnecessary data.
If you want compose LINQ expressions in the way done in case 2, you can append the Select clause last or explicitly convert the result to IEnumerable<T> to make it obvious that any further processing will be done in-memory.
The first query returns the full object defined, so any additional limitations (like Where) can be appended to it before it is actually being run. Therefore the query can be combined as you showed.
The second one returns a new object, which can be whatever type and contain whatever information. Therefore the query is sent to the database as "return everything" and after the objects have been created all but the ones that match the Where clause are discarded.
Even though the type were the same in both of them, think of this situation:
var query = from wiTransmits in uow.DbContext.StatusMessages
select new WITransmits
{
MessageID = wiTranmits.MessageID * 4 - 2,
Name = wiTransmits.Name
};
How would you combine the Where query now? Sure, you could go through the code inside the new object creation and try to move it outside, but since there can be anything it is not feasible. What if the checkup is some lookup function? What if it's not deterministic?
Therefore if you create new objects based on the database objects there will be a border where the objects will be retrieved and then further queries will be done in memory.

At what point do I need to off-load the work of a query to the DB?

I have a web app, and I'm connecting to a DB with entity framework.
To select all Employee records of a particular department, for example, I can quite easily write:
....Employees.Where(o => o.Department == "HR").ToList();
Works fine. But is it most optimal?
Should this Where clause be incorporated into a stored procedure or view? Or does my entity framework code do the job of converting it to SQL anyway?
We've had performance problems in our team in the past from when people pull records into memory and then do the filtering in .net instead of at a database level. I'm trying to avoid this happening again so want to be crystal clear on what I must avoid.
If Employees is provided by Entity Framework then the Where() will be translated into SQL and sent to the database. It is only the point that you materialise the objects does it take the filters you have applied before and turn them into SQL. Anything after that point is just plain LINQ to objects.
Methods that cause materialisation to happen include things like .ToList() and .ToArray() (there are more, but these two are probably the most common).
If you want to see what is happening on SQL Server, you should open up the SQL Profiler and have a look at the queries that are being sent.
We've had performance problems in our team in the past from when people pull records into memory and then do the filtering in .net instead of at a database level.
Just as an addendum to Colin's answer, and to target the quote above, the way to avoid this is to make sure your database queries are fully constructed with IQueryable<T> first, before enumerating the results with a call such as .ToList(), or .ToArray().
As an example, consider the following:
IEnumerable<Employee> employees = context.Employees;
// other code, before executing the following
var hrEmployees = employees.Where(o => o.Department == "HR").ToList();
The .ToList() will enumerate the results grabbing all of the employees from the context first, and then performing the filtering on the client. This isn't going to perform very well if you've got a lot of employees to contend with, and it's certainly not going to scale very well.
Compare that with this:
IQueryable<Employee> employees = context.Employees;
// other code, before executing the following
var hrEmployees = employees.Where(o => o.Department == "HR").ToList();
IQueryable<T> derives from IEnumerable<T>. The difference between them is that IQueryable<T> has a query provider built in, and the query you construct is represented as an expression tree. That means it's not evaluated until a call that enumerates the results of the query, such as .ToList().
In the second example above, the query provider will execute SQL to fetch only those employees that belong to the HR department, performing the filtering on the database itself.

Entity Framework SQL Query Execution

Using the Entity Framework, when one executes a query on lets say 2000 records requiring a groupby and some other calculations, does the query get executed on the server and only the results sent over to the client or is it all sent over to the client and then executed?
This using SQL Server.
I'm looking into this, as I'm going to be starting a project where there will be loads of queries required on a huge database and want to know if this will produce a significant load on the network, if using the Entity Framework.
I would think all database querying is done on the server side (where the database is!) and the results are passed over. However, in Linq you have what's known as Delayed Execution (lazily loaded) so your information isn't actually retrieved until you try to access it e.g. calling ToList() or accessing a property (related table).
You have the option to use the LoadWith to do eager loading if you require it.
So in terms of performance if you only really want to make 1 trip to the Database for your query (which has related tables) I would advise using the LoadWith options. However, it does really depend on the particular situation.
It's always executed on SQL Server. This also means sometimes you have to change this:
from q in ctx.Bar
where q.Id == new Guid(someString)
select q
to
Guid g = new Guid(someString);
from q in ctx.Bar
where q.Id == g
select q
This is because the constructor call cannot be translated to SQL.
Sql's groupby and linq's groupby return differently shaped results.
Sql's groupby returns keys and aggregates (no group members)
Linq's groupby returns keys and group members.
If you use those group members, they must be (re-)fetched by the grouping key. This can result in +1 database roundtrip per group.
well, i had the same question some time ago.
basically: your linq-statement is converted to a sql-statement. however: some groups will get translated, others not - depending on how you write your statement.
so yes - both is possible
example:
var a = (from entity in myTable where entity.Property == 1 select entity).ToList();
versus
var a = (from entity in myTable.ToList() where entity.Property == 1 select entity).ToList();

LINQPad, using multiple datacontexts

I am often comparing data in tables in different databases. These databases do not have the same schema. In TSQL, I can reference them with the DB>user>table structure (DB1.dbo.Stores, DB2.dbo.OtherPlaces) to pull the data for comparison. I like the idea of LINQPad quite a bit, but I just can't seem to easily pull data from two different data contexts within the same set of statements.
I've seen people suggest simply changing the connection string to pull the data from the other source into the current schema but, as I mentioned, this will not do. Did I just skip a page in the FAQ? This seems a fairly routine procedure to be unavailable to me.
In the "easy" world, I'd love to be able to simply reference the typed datacontext that LINQPad creates. Then I could simply:
DB1DataContext db1 = new DB1DataContext();
DB2DataContext db2 = new DB2DataContext();
And work from there.
Update: it's now possible to do cross-database SQL Server queries in LINQPad (from LINQPad v4.31, with a LINQPad Premium license). To use this feature, hold down the Control key while dragging databases from the Schema Explorer to the query window.
It's also possible to query linked servers (that you've linked by calling sp_add_linkedserver). To do this:
Add a new LINQ to SQL connection.
Choose Specify New or Existing Database and choose the primary database you want to query.
Click the Include Additional Databases checkbox and pick the linked server(s) from the list.
Keep in mind that you can always create another context on your own.
public FooEntities GetFooContext()
{
var entityBuilder = new EntityConnectionStringBuilder
{
Provider = "Devart.Data.Oracle",
ProviderConnectionString = "User Id=foo;Password=foo;Data Source=Foo.World;Connect Mode=Default;Direct=false",
Metadata = #"D:\FooModel.csdl|D:\FooModel.ssdl|D:\FooModel.msl"
};
return new FooEntities(entityBuilder.ToString());
}
You can instantiate as many contexts as you like to disparate SQL instances and execute pseudo cross database joins, copy data, etc. Note, joins across contexts are performed locally so you must call ToList(), ToArray(), etc to execute the queries using their respective data sources individually before joining. In other words if you "inner" join 10 rows from DB1.TABLE1 with 20 rows from DB2.TABLE2, both sets (all 30 rows) must be pulled into memory on your local machine before Linq performs the join and returns the related/intersecting set (20 rows max per example).
//EF6 context not selected in Linqpad Connection dropdown
var remoteContext = new YourContext();
remoteContext.Database.Connection.ConnectionString = "Server=[SERVER];Database="
+ "[DATABASE];Trusted_Connection=false;User ID=[SQLAUTHUSERID];Password="
+ "[SQLAUTHPASSWORD];Encrypt=True;";
remoteContext.Database.Connection.Open();
var DB1 = new Repository(remoteContext);
//EF6 connection to remote database
var remote = DB1.GetAll<Table1>()
.Where(x=>x.Id==123)
//note...depending on the default Linqpad connection you may get
//"EntityWrapperWithoutRelationships" results for
//results that include a complex type. you can use a Select() projection
//to specify only simple type columns
.Select(x=>new { x.Col1, x.Col1, etc... })
.Take(1)
.ToList().Dump(); // you must execute query by calling ToList(), ToArray(),
// etc before joining
//Linq-to-SQL default connection selected in Linqpad Connection dropdown
Table2.Where(x=>x.Id = 123)
.ToList() // you must execute query by calling ToList(), ToArray(),
// etc before joining
.Join(remote, a=> a.d, b=> (short?)b.Id, (a,b)=>new{b.Col1, b.Col2, a.Col1})
.Dump();
remoteContext.Database.Connection.Close();
remoteContext = null;
I do not think you are able to do this. See this LinqPad request.
However, you could build multiple dbml files in a separate dll and reference them in LinqPad.
Drag-and-drop approach: hold down the Ctrl key while dragging additional databases
from the Schema Explorer to the query editor.
Use case:
//Access Northwind
var ID = new Guid("107cc232-0319-4cbe-b137-184c82ac6e12");
LotsOfData.Where(d => d.Id == ID).Dump();
//Access Northwind_v2
this.NORTHWIND_V2.LotsOfData.Where(d => d.Id == ID).Dump();
Multiple databases are as far as I know only available in the "paid" version of LinqPad (what I wrote applies to LinqPad 6 Premium).
For more details, see this answer in StackOverflow (section "Multiple database support").

Select clause containing non-EF method calls

I'm having trouble building an Entity Framework LINQ query whose select clause contains method calls to non-EF objects.
The code below is part of an app used to transform data from one DBMS into a different schema on another DBMS. In the code below, Role is my custom class unrelated to the DBMS, and the other classes are all generated by Entity Framework from my DB schema:
// set up ObjectContext's for Old and new DB schemas
var New = new NewModel.NewEntities();
var Old = new OldModel.OldEntities();
// cache all Role names and IDs in the new-schema roles table into a dictionary
var newRoles = New.roles.ToDictionary(row => row.rolename, row => row.roleid);
// create a list or Role objects where Name is name in the old DB, while
// ID is the ID corresponding to that name in the new DB
var roles = from rl in Old.userrolelinks
join r in Old.roles on rl.RoleID equals r.RoleID
where rl.UserID == userId
select new Role { Name = r.RoleName, ID = newRoles[r.RoleName] };
var list = roles.ToList();
But calling ToList gives me this NotSupportedException:
LINQ to Entities does not recognize
the method 'Int32
get_Item(System.String)' method, and
this method cannot be translated into
a store expression
Sounds like LINQ-to-Entities is barfing on my call to pull the value out of the dictionary given the name as a key. I admittedly don't understand enough about EF to know why this is a problem.
I'm using devart's dotConnect for PostgreSQL entity framework provider, although I assume at this point that this is not a DBMS-specific issue.
I know I can make it work by splitting up my query into two queries, like this:
var roles = from rl in Old.userrolelinks
join r in Old.roles on rl.RoleID equals r.RoleID
where rl.UserID == userId
select r;
var roles2 = from r in roles.AsEnumerable()
select new Role { Name = r.RoleName, ID = newRoles[r.RoleName] };
var list = roles2.ToList();
But I was wondering if there was a more elegant and/or more efficient way to solve this problem, ideally without splitting it in two queries.
Anyway, my question is two parts:
First, can I transform this LINQ query into something that Entity Framework will accept, ideally without splitting into two pieces?
Second, I'd also love to understand a little about EF so I can understand why EF can't layer my custom .NET code on top of the DB access. My DBMS has no idea how to call a method on a Dictionary class, but why can't EF simply make those Dictionary method calls after it's already pulled data from the DB? Sure, if I wanted to compose multiple EF queries together and put custom .NET code in the middle, I'd expect that to fail, but in this case the .NET code is only at the end, so why is this a problem for EF? I assume the answer is something like "that feature didn't make it into EF 1.0" but I am looking for a bit more explanation about why this is hard enough to justify leaving it out of EF 1.0.
The problem is that in using Linq's delayed execution, you really have to decide where you want the processing and what data you want to traverse the pipe to your client application. In the first instance, Linq resolves the expression and pulls all of the role data as a precursor to
New.roles.ToDictionary(row => row.rolename, row => row.roleid);
At that point, the data moves from the DB into the client and is transformed into your dictionary. So far, so good.
The problem is that your second Linq expression is asking Linq to do the transform on the second DB using the dictionary on the DB to do so. In other words, it is trying to figure out a way to pass the entire dictionary structure to the DB so that it can select the correct ID value as part of the delayed execution of the query. I suspect that it would resolve just fine if you altered the second half to
var roles = from rl in Old.userrolelinks
join r in Old.roles on rl.RoleID equals r.RoleID
where rl.UserID == userId
select r.RoleName;
var list = roles.ToDictionary(roleName => roleName, newRoles[roleName]);
That way, it resolves your select on the DB (selecting just the rolename) as a precursor to processing the ToDictionary call (which it should do on the client as you'd expect). This is essentially exactly what you are doing in your second example because AsEnumerable is pulling the data to the client before using it in the ToList call. You could as easily change it to something like
var roles = from rl in Old.userrolelinks
join r in Old.roles on rl.RoleID equals r.RoleID
where rl.UserID == userId
select r;
var list = roles.AsEnumerable().Select(r => new Role { Name = r.RoleName, ID = newRoles[r.RoleName] });
and it'd work out the same. The call to AsEnumerable() resolves the query, pulling the data to the client for use in the Select that follows it.
Note that I haven't tested this, but as far as I understand Entity Framework, that's my best explanation for what's going on under the hood.
Jacob is totally right.
You can not transform the desired query without splitting it in two parts, because Entity Framework is unable to translate the get_Item call into the SQL query.
The only way is to write the LINQ to Entities query and then write a LINQ to Objects query to its result, just as Jacob advised.
The problem is Entity-Framework-specific one, it does not arise from our implementation of the Entity Framework support.

Categories