C# - Concatenate an in memory IList and IQueryable? - c#

Suppose I have a List containing one string value. Suppose I also have an IQueryable that contains several strings from a database. I want to be able to concatenate these two containers into one list and then be able to call methods such as .Skip or .Take on the list. I want to be able to do this in such a way that when I combine the two containers I don't load all of the DB data into memory (only after I call .Skip and .Take). Basically, I want to do something like this (pseudocode):
IQueryable someQuery = myEntities.GetDBQuery(); // Gets "test2", "test3"
IList inMemoryList = new List();
inMemoryList.Add("test");
IList finalList = inMemoryList.Union(someQuery) // Can I do something like this without loading DB data into memory? finalList should contain all 3 strings.
// At this point it is fine to load the filtered query into memory.
foreach (string myString in finalList.Skip(100).Take(200))
{
// Do work...
}
How can I achieve this?

If I didn't misunderstand, you are trying to query the data, part of which comes from memory and others from database, like this:
//the following code will not compile, just for example
var dbQuery = BuildDbQuery();
var list = BuildListInMemory();
var myQuery = (dbQuery + list).OrderBy(aa).Skip(bb).Take(cc).Select(dd);
//and you don't want to load all records into memory by dbQuery
//because you only need some of them
The short answer is NO, you can't. Consider the .OrderBy method, all data have to be in a same "place", otherwise the code can't sort them. So the code loads all records in database by dbQuery into memory(now they are in a same place) and then sorts all of them including those in list. That probably causes a memory issue when dbQuery gives thousands of rows.
HOW TO RESOLVE
Pass the data in list into database (as parameters of dbQuery) so that the query happens in database. This is easy if your list has only a few items.
If list also has lots of records that will makes dbQuery too complex, you can try to query twice, one for dbQuery and one for list. For example, you have 10,000 users in database and 1,000 users in your memory list, and you want to get the top 10 youngest users. You don't need to load 10,000 users into memory and then find the youngest 10. Instead, you find 10 youngest (ResultA) in dbQuery and load into memory, and 10 youngest (ResultB) in memory list, and then compare between ResultA and ResultB.

I entirely agree with Danny's answer when he says you need to somehow find a way to include in memory user list into db so that you achieve what you want. As for the example which you sought in your comment, without knowing data structure of your User object, seems difficult. However assuming you would be able to connect the dots. Here is my suggested approach:
Create temporary table with identical structure that of your regular user table in your db and insert all your inmemory users into it
Write a query to Union temporary and regular table both identical in structure so that should be easy.
Return the result in your application and use it performing standard Linq operations
If you want exact code which you can use as it is then you will have to provide your User object structure - fields type etc in db to enable me to write the code.

You specify that your query and your list are both sequences of strings. someQuery can be performed completely on the database side (not in-memory)
Let's make your sequences less generic:
IQueryable<string> someQuery = ...
IList<string> myList = ...
You also specify that myList contains only one element.
string myOneAndOnlyString = myList.Single();
As your list is in-memory, this has to be performed in-memory. But because the list has only one element, this won't take any time.
The query that you request:
IQueryable<string> correctQuery = someQuery
.Where(item => item.Equals(myOneandOnlyString)
.Skip(skipCount)
.Take(takeCount)
Use your SQL server profiler to check the used SQL and see that the request is completely performed in one SQL statement.

Related

Most efficient collection for storing data from LINQ to Entities?

I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.

How does linq actually execute the code to retrieve data from the data source?

I will start working on xamarin shortly and will be transferring a lot of code from android studio's java to c#.
In java I am using a custom classes which are given arguments conditions etc, convert them to SQL statements and then loads the results to the objects in the project's model
What I am unsure of is wether linq is a better option for filtering such data.
For example what would happen currently is somethng along these lines
List<Customer> customers = (new CustomerDAO()).get_all()
Or if I have a condition
List<Customer> customers = (new CustomerDAO()).get(new Condition(CustomerDAO.Code, equals, "code1")
Now let us assume I have transferred the classes to c# and I wish to do somethng similar to the second case.
So I will probably write something along the lines of:
var customers = from customer
in (new CustomerDAO()).get_all()
where customer.code.equals("code1")
select customer
I know that the query will only be executed when I actually try to access customers, but if I have multiple accesses to customers ( let us say that I use 4 foreach loops later on) will the get_all method be called 4 times? or are the results stored at the first execution?
Also is it more efficient (time wise because memory wise it is probably not) to just keep the get_all() method and use linq to filter the results? Or use my existing setup which in effect executes
Select * from Customers where code = 'code1'
And loads the results to an object?
Thanks in advance for any help you can provide
Edit: yes I do know there is sqlite.net which pretty much does what my daos do but probably better, and at some point I will probably convert all my objects to use it, I just need to know for the sake of knowing
if I have multiple accesses to customers ( let
us say that I use 4 foreach loops later on) will the get_all method be
called 4 times? or are the results stored at the first execution?
Each time you enumerate the enumerator (using foreach in your example), the query will re-execute, unless you store the materialized result somewhere. For example, if on the first query you'd do:
var customerSource = new CustomerDAO();
List<Customer> customerSource.Where(customer => customer.Code.Equals("code1")).ToList();
Then now you'll be working with an in-memory List<Customer> without executing the query over again.
On the contrary, if each time you'd do:
var filteredCustomers = customerSource.Where(customer => customer.Code.Equals("code1"))
foreach (var customer in filteredCustomers)
{
// Do stuff
}
Then for each enumeration you'll be exeucting the said query over again.
Also is it more efficient (time wise because memory wise it is
probably not) to just keep the get_all() method and use linq to filter
the results? Or use my existing setup which in effect executes
That really depends on your use-case. Lets imagine you were using LINQ to EF, and the customer table has a million rows, do you really want to be bringing all of them in-memory and only then filtering them out to use a subset of data? It would usually be better to full filtered query.

Entity Framework - Performance in count

I've a little question about performance with Entity Framework.
Something like
using (MyContext context = new MyContext())
{
Document DocObject = context.Document.Find(_id);
int GroupCount = context.Document.Where(w=>w.Group == DocObject.Group).ToList().Count();
}
takes about 2 seconds in my database (about 30k datasets), while this one
using (MyContext context = new MyContext())
{
Document DocObject = context.Document.Find(_id);
int GroupCount = context.Document.Where(w=>w.Group == DocObject.Group).Count();
}
takes 0,02 seconds.
When my filter for 10 documents had 20 seconds to wait, I checked my code, and changed this to not use ToList() before Count().
Any ideas why it needs 2 seconds for this line with the ToList()?
Calling ToList() then Count() will:
execute the whole SELECT FROM WHERE against your database
then materialize all the resulting entities as .Net objects
create a new List<T> object containing all the results
return the result of the Count property of the .Net list you just created
Calling Count() against an IQueryable will:
execute SELECT COUNT FROM WHERE against your database
return an Int32 with the number of rows
Obviously, if you're only interested in the number of items (not the items themselves), then you shouldn't ever call ToList() first, as it will require a lot of resources for nothing.
Yes, ToList() will evaluate the results (retrieving the objects from database), if you do not use ToList(), the objects arenĀ“t retrieved from the database.
Linq-To-Entities uses LazyLoading per default.
It works something like this;
When you query your underlying DB connection using Linq-To-Entities you will get a proxy object back on which you can perform a number of operations (count being one). This means that you do not get all the data from the DB at once but rather the objects are retrieved from the DB at the time of evaluation. One way of evaluating the object is by using ToList().
Maybe you should read this.
Because ToList() will query the database for the whole objects (will do a SELECT * so to say), and then you'll use Count() on the list in memory with all the records, whereas if you use Count() on the IQueryable (and not on the List), EF will translate it to a simple SELECT COUNT(*) SQL query
Your first query isnt fully transalted to sql - when you call .ToList().Count(), you are basically saying "download all, materialize it to POCO and call extension method named Count()" which, of course, take some time.
Your second query is, however, transalted to something like select count(*) from Documents where GroupId = #DocObjectGroup which is much faster to execute and you arent materializing anything, just simple scalar.
Using the extension method Enumerable.ToList() will construct a new List object from the IEnumerable<T> source collection which means that there is an associated cost with doing ToList().

Using Linq to return the count of items from a query along with its resultset

I am using C# MVC4 with Linq.
I have used dependency injection for my project which resulted in me having a separate Model's project along with a separate Repository project (and one for testing ect). All this no problem.
I moved my queries out of the controllers (old style) and into the repository (new DI style), and injected them. It works fine.
I have a standard linq query (pick any example, they are basic enough), which returns a set of items from the database as normal. No problems here either.
My problem is, that I want to implement paging, and I taught it would be simple enough to. Here is my steps:
Take in the results of the linq query from the repository (injected into the controller) store it in a var. It looks something like:
var results = _someInjectedCode.GetListById(SomeId);
Before, I was able to do something simple like:
results.Count()
results.Skip(SomeNum).Take(SomeOtherNum)
But now that I want paging, I need to do my Skip Take something like this:
var results = from xyz in _someInjectedCode.GetListById(SomeId).SomeId).Skip(SomeNum).Take(SomeOtherNum)
select new[] {a,id, a.fName, a.lName .....}
The problem with this is that I no longer have access to the total count of items before the list was shortened to the Pre Skip...Take state unless I do two queries which means hitting the DB twice.
What is the best way to resolve this issue.
I just do it like this:
var result = (from n in mycollection
where n.someprop == "some value"
select n).ToList();
var count = result.Count;
There are probably other ways, but this is the simplest that I know of.
Thinking about it from a SQL point of view, I can't think of a way in a single normal query to retrieve both the total count and a subset of the data, so I don't think you will be able to do it in LINQ either.
To avoid creating two separate commands, only thing I can think of is a stored proc that returns two tables (one with just the count, the other with your subset of results). It would still execute two queries, but in a single connection. You'd lose your LINQ though. So if you want to keep your LINQ query, you might be stuck with making two separate calls.
The other way is to retrieve the entire unpaged resultset into memory, then run your Take and Skip against the array, but this is pretty wasteful and probably worse than two calls.
You can either add additional parameters to your repository interface/class which will provide paging parameters and return count alongside your result or modify your interfaces to return IQueryable and then apply count and then skip/take before query is compiled and sent for execution.

Collecting metadata into table

I have tabluar data that passes through a C# program that I need to collect some metadata on before finishing. The metadata is always counts based on fields of the data. Also, I need them all grouped by one field in the data. Periodically, I need to add new counts to this collection of metadata.
I've been researching it for a little while, and I think what makes sense is to rework my program to store the data as a DataTable, then run LINQ queries on the table. The problem I'm having is being able to put the different counts into one table-like structure and then write that out.
I might run a query like this:
var query01 =
from record in records.AsEnumerable()
group record by record.Field<String>("Association Key") into associationsGroup
select new { AssociationKey = associationsGroup.Key, Count = associationsGroup.Count<DataRow>() };
To get a count of all of the records grouped by the field Association Key. I'm going to want another count, grouped in the same way:
var query02 =
from record in records.AsEnumerable()
where record.Field<String>("Number 9") == "yes"
group record by record.Field<String>("Association Key") into associationsGroup
select new { AssociationKey = associationsGroup.Key, Number9Count = associationsGroup.Count<DataRow>() };
And so on.
I thought about trying Union chain the queries but I was having trouble getting them to union since I'm projecting into anonymous types. I couldn't figure out how to do it differently to make a union work better.
So, how can I collect my metadata into one table-like structure?
Not going to union because you have different types. Add Number9Count and Count to both annonymous types and try union again.
I ended up solving the problem by creating a class that holds the set of records I need as a DataTable. A user can add queries through a method, taking an argument Func<DataRow, bool>. The method constructs the query supplying that argument as the where clause, maintaining the same grouping and properties in the resulting anonymous-typed object.
When retrieving the results, the class iterates over each query stored and enters the results into a new DataTable.

Categories