Speed improvement in LINQ Where(Array.Contains) - c#

I initially had a method that contained a LINQ query returning int[], which then got used later in a fashion similar to:
int[] result = something.Where(s => previousarray.Contains(s.field));
This turned out to be horribly slow, until the first array was retrieved as the native IQueryable<int>. It now runs very quickly, but I'm wondering how I'd deal with the situation if I was provided an int[] from elsewhere which then had to be used as above.
Is there a way to speed up the query in such cases? Converting to a List doesn't seem to help.

In LINQ-SQL, a Contains will be converted to a SELECT ... WHERE field IN(...) and should be relatively fast. In LINQ-Objects however, it will call ICollection<T>.Contains if the source is an ICollection<T>.
When a LINQ-SQL result is treated as an IEnumerable instead of an IQueryable, you lose the linq provider - i.e., any further operations will be done in memory and not in the database.
As for why its much slower in memory:
Array.Contains() is an O(n) operation so
something.Where(s => previousarray.Contains(s.field));
is O(p * s) where p is the size of previousarray and s is the size of something.
HashSet<T>.Contains() on the other hand is an O(1) operation. If you first create a hashset, you will see a big improvement on the .Contains operation as it will be O(s) instead of O(p * s).
Example:
var previousSet = new HashSet<int>(previousarray);
var result = something.Where(s => previousSet.Contains(s.field));

Where on Lists/Arrays/IEnumarables etc is O[N] operation. It is O[~1] on HashSet. So you should try to use it.

Related

PLINQ slower than actual Linq for this snippet

Below is the code snippet. EF6 is used.
var itemNames = context.cam.AsParallel()
.Where(x=> x.cams ==
"edsfdf")
.Select(item => item.fg)
.FirstOrDefault();
Why is PLINQ slower?
If you look at the signature of .AsParallel(), it takes an IEnumerable<T>, rather than an IQueryable<T>. A linq query is only converted to a SQL statement while it's kept as an IQueryable. Once you enumerate it, it executes the query and returns the records.
So to break down the query you have:
context.cam.AsParallel()
This bit of code essentially will execute SELECT * FROM cam on the database, and then start iterating through the results.The results will be passed into a ParallelQuery. This essentially will load the entire table into memory.
.Where(x=> x.cams == "edsfdf")
.Select(item => item.fg)
.FirstOrDefault()
After this point, all of those operations will happen in parallel. A simple string equality comparison is likely extremely inexpensive compared to the overhead of spinning up a lot of threads and managing locking and concurrency between them (which PLINQ will take care of for you). Parallel processing and the costs/benefits is a complicated topic, but it's usually best saved for CPU-intensive work.
If you skipped the AsParallel() call, everything remains as an IQueryable all the way through the linq statement, so EntityFramework will send a single SQL command that looks something like SELECT fg FROM cam WHERE cams = 'edsfdf' and return that single result, which SQL Server will optimize to a very fast lookup, especially if there's an index on cams.
It's slower because you have used parralisim in a bad place. Your Where clause doesn't do heavy duty work, then it's not good scenario to use PLINQ.
If PLINQ comes with associated
complexity then what is the real benefit for it's existence?
You don't need hundred people to find a missing child in a home, because finding 100 people takes longer than finding the child by yourself. But if you were to search a forest, this overhead was worth it, because finding a child in a big forest alone takes longer than to find 100 people to find a child in a big forest!
parallelism is effective when used at the right time in the right place.

Most efficient collection for storing data from LINQ to Entities?

I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.

linq orderby.tolist() performance

I have an ordering query to a List and calling for many times.
list = list.OrderBy().ToList();
In this code ToList() method is spending high resources and takes very long time. How can I speed up with another ordering method without converting back to a list. Should I use .Sort extension for arrays?
First of all, try to sort the list once, and keep it sorted.
To speed up things you can use Parallel LINQ.
see: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
An OrderBy() Parallel looks like this:
var query = data.AsParallel().Where(x => p(x)).Orderby(x => k(x)).ToList();
You only need to call ToList() once to get your sorted list. All future actions should use sortedList.
sortedList = list.OrderBy().ToList();

Converting IEnumerable<T> to List<T> on a LINQ result, huge performance loss

On a LINQ-result you like this:
var result = from x in Items select x;
List<T> list = result.ToList<T>();
However, the ToList<T> is Really Slow, does it make the list mutable and therefore the conversion is slow?
In most cases I can manage to just have my IEnumerable or as Paralell.DistinctQuery but now I want to bind the items to a DataGridView, so therefore I need to as something else than IEnumerable, suggestions on how I will gain performance on ToList or any replacement?
On 10 million records in the IEnumerable, the .ToList<T> takes about 6 seconds.
.ToList() is slow in comparison to what?
If you are comparing
var result = from x in Items select x;
List<T> list = result.ToList<T>();
to
var result = from x in Items select x;
you should note that since the query is evaluated lazily, the first line doesn't do much at all. It doesn't retrieve any records. Deferred execution makes this comparison completely unfair.
It's because LINQ likes to be lazy and do as little work as possible. This line:
var result = from x in Items select x;
despite your choice of name, isn't actually a result, it's just a query object. It doesn't fetch any data.
List<T> list = result.ToList<T>();
Now you've actually requested the result, hence it must fetch the data from the source and make a copy of it. ToList guarantees that a copy is made.
With that in mind, it's hardly surprising that the second line is much slower than the first.
No, it's not creating the list that takes time, it's fetching the data that takes time.
Your first code line doesn't actually fetch the data, it only sets up an IEnumerable that is capable of fetching the data. It's when you call the ToList method that it will actually get all the data, and that is why all the execution time is in the second line.
You should also consider if having ten million lines in a grid is useful at all. No user is ever going to look through all the lines, so there isn't really any point in getting them all. Perhaps you should offer a way to filter the result before getting any data at all.
I think it's because of memory reallocations: ToList cannot know the size of the collection beforehand, so that it could allocate enough storage to keep all items. Therefore, it has to reallocate the List<T> as it grows.
If you can estimate the size of your resultset, it'll be much faster to preallocate enough elements using List<T>(int) constructor overload, and then manually add items to it.

Does Linq cache results of multiple Froms?

Suppose we got a code like this:
IEnumerable<Foo> A = meh();
IEnumerable<Foo> B = meh();
var x =
from a in A
from b in B
select new {a, b};
Let's also assume that meh returns an IEnumerable which performs a lot of expensive calculations when iterated over. Of course, we can simply cache the calculated results manually by means of
IEnumerable<Foo> A = meh().ToList();
My question is if this manual caching of A and B is required, or if the above query caches the results of A and B itself during execution, so each line gets calculated only once. The difference of 2 * n and n * n calculations may be huge and I did not find a description of the behavior in MSDN, that's why I'm asking.
Assuming you mean LINQ to Objects, it definitely doesn't do any caching - nor should it, IMO. You're building a query, not the results of a query, if you see what I mean. Apart from anything else, you might want to iterate through a sequence which is larger than can reasonably be held in memory (e.g. iterate over every line in a multi-gigabyte log file). I certainly wouldn't want LINQ to try to cache that for me!
If you want a query to be evaluated and buffered for later quick access, calling ToList() is probably your best approach.
Note that it would be perfectly valid for B to depend on A, which is another reason not to cache. For example:
var query = from file in Directory.GetFiles("*.log")
from line in new LineReader(file)
...;
You really don't want LINQ to cache the results of reading the first log file and use the same results for every log file. I suppose it could be possible for the compiler to notice that the second from clause didn't depend on the range variable from the first one - but it could still depend on some side effect or other.

Categories