Why does LINQ not cache enumerations? - c#

So it is my understanding that LINQ does not execute everything immediately, it simply stores information to get at the data. So if you do a Where, nothing actually happens to the list, you just get an IEnumerable that has the information it needs to become the list.
One can 'collapse' this information to an actual list by calling ToList.
Now I am wondering, why would the LINQ team implement it like this? It is pretty easy to add a List at each step (or a Dictionary) to cache the results that have already been calculated, so I guess there must be a good reason.
This can be checked by this code:
var list = Enumerable.Range(1, 10).Where(i => {
Console.WriteLine("Enumerating: " + i);
return true;
});
var list2 = list.All(i => {
return true;
});
var list3 = list.Any(i => {
return false;
});
If the cache were there, it would only output the Enumerating: i once for each number, it would get the items from the cache the second time.
Edit: Additional question, why does LINQ not include a cache option? Like .Cache() to cache the result of the previous enumerable?

Because it makes no sense, and if you would think about all the cases where it makes no sense you would not ask it. This is not so much a "does it sometimes make sense" question as a "are there side effects that make it bad". Next time you evaluate something like this, think about the negatives:
Memory consumption goes up as you HAVE to cache the results, even if not wanted.
On then ext run, the results may be different as incoming data may have changed. your simplistic example (Enumerable.Range) has no issue with that - but filtering a list of customers may have them updated.
Stuff like that makes is very hard to sensibly take away the choice from the developer. Want a buffer, make one (easily). But the side effects would be bad.

It is pretty easy to add a List at each step
Yes, and very memory intensive. What if the data set contains 2 GB of data in total, and you have to store that in memory at once. If you iterate over it and fetch it in parts, you don't have a lot of memory pressure. When serializing 2 GB to memory you do, not to imagine what happens if every step will do the same...
You know your code and your specific use case, so only you as a developer can determine when it is useful to split off some iterations to memory. The framework can't know that.

Related

C# Update or insert from a List by some criteria

I have 2 lists, both with a different number of items, both with 1 parameter in common that I have to compare. If the value of the parameter is the same I have to update the DB, but if the item in a list doesn't have an item in the second list, I have to insert it into the DB.
This is what I was trying:
foreach (var rep in prodrep)
{
foreach (var crm in prodcrm)
{
if (rep.VEHI_SERIE.Equals(crm.VEHI_SERIE))
{
updateRecord(rep.Data);
}
else
{
insertRecords(rep.Data);
}
}
}
The first problem with this is that it is very slow. The second problem is that obviously the insert statement would't work, but I don't want to do another for each inside a foreach to verify if it doesn't exist, because that would take double the time.
How can I make this more efficient?
This is now not as efficient but this should work.
var existing = prodrep.Where(rep => prodcrm.Any(crm => rep.VEHI_SERIE.Equals(crm.VEHI_SERIE)).Select(rep=> Rep = rep, Crm=prodcrm.FirstOrDefault(crm=>rep.VEHI_SERIE.Equals(crm.VEHI_SERIE));
existing.ForEach(mix=>updateRecord(mix.Rep.Data, mix.Crm.Id));
prodrep.Where(rep => !existing.Any(mix=>mix.Rep==rep)).ForEach(rep=>insertRecords(rep.Data));
var comparators = prodcrm.Select(i => i.VEHI_SERIE).ToList();
foreach (var rep in prodrep)
{
if (comparators.Contains(rep.VEHI_SERIE)
// do something
else
// do something else
}
see Algorithm to optimize nested loops
it's an interesting read, and a cool trick, however not necessarily something that you should apply in every situation.
Also, be careful about answers providing you with LINQ queries. often it "looks cool" because you're not using the word "for", but really it's just hiding those for loops under the hood.
If you're really concerned about performance and the computer can handle it, you can look at the Task Parallel Library. it's not necessarily going to solve all of your problems, because you can be limited by processor/memory and you could end up making your application slower.
Is this something that a user of your application is going to be regularly doing? If so, is it something you can you make it an asynchronous task that they can come back to later, or is it an offline process that they aren't ever going to see. Depending on usage expectations, sometimes the time something takes isn't the end of the world.

Most efficient collection for storing data from LINQ to Entities?

I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.

What is the fastest way to filter a list of strings when making an Intellisense/Autocomplete list?

I'm writing an Intellisense/Autocomplete like the one you find in Visual Studio. It's all fine up until when the list contains probably 2000+ items.
I'm using a simple LINQ statement for doing the filtering:
var filterCollection = from s in listCollection
where s.FilterValue.IndexOf(currentWord,
StringComparison.OrdinalIgnoreCase) >= 0
orderby s.FilterValue
select s;
I then assign this collection to a WPF Listbox's ItemSource, and that's the end of it, works fine.
Noting that, the Listbox is also virtualised as well, so there will only be at most 7-8 visual elements in memory and in the visual tree.
However the caveat right now is that, when the user types extremely fast in the richtextbox, and on every key up I execute the filtering + binding, there's this semi-race condition, or out of sync filtering, like the first key stroke's filtering could still be doing it's filtering or binding work, while the fourth key stroke is also doing the same.
I know I could put in a delay before applying the filter, but I'm trying to achieve a seamless filtering much like the one in Visual Studio.
I'm not sure where my problem exactly lies, so I'm also attributing it to IndexOf's string operation, or perhaps my list of string's could be optimised in some kind of index, that could speed up searching.
Any suggestions of code samples are much welcomed.
Thanks.
Latency is not your problem if you have a result set of 2000 items. I'm making some large assumptions here, but you only really need to return 500 items maximum - your user will keep typing to narrow the result set until it is an acceptable size to browse through.
You should optimize the common case (I'm assuming where it will end up with say ~50 items) - if your user is scrolling through a small list of 2000 items, something else is wrong and the interface needs more work.
I would suggest trying to cap your result set at some number of items and seeing if the problem goes away. That is, you might have 5000 to choose from, but try to return no more than say 100, even if more match. Say:
var filterCollection = (from s in listCollection
where s.FilterValue.IndexOf(currentWord,StringComparison.OrdinalIgnoreCase)>=0
orderby s.FilterValue
select s).Take(100);
If your problem goes away, the slowdown may be caused by too many items being returned for the listbox. I am not sure that the problem will go away, since the ListBox is virtualized, but it's worth a shot. You can also try the same thing, but limiting the result of the filtering to 100 items, before the sort (i.e., orderby) and see if that helps. It's more efficient to do it in this order, anyways:
var filterCollection = (from s in listCollection
where s.FilterValue.IndexOf(currentWord,StringComparison.OrdinalIgnoreCase)>=0
select s).Take(100)
.OrderBy(s => s.FilterValue);
The bottom line is determining if the problem is a function of the number of items returned and assigned to filterColection or of the initial number of items, or both.
I think the problem is that your filter performs O(n) (where n is the total number of items to autocomplete from), that is, it has to go through every item to figure out which ones match. The more items you have, the worse the the filter will perform.
Instead of using a list, try using a trie. Tries perform O(m), where m is the number of characters in the string. This means the size of the dataset does not affect the performance of the lookup.
In Promptu (an app launcher I wrote), I use tries in the intellisense/autocomplete. If you want to see an example of tries in action, you can download it and try it out.
A really simple optimization is
if(currentWord.StartsWith(lastWord))
you can filter on the list of filtered items returned by the last query. That is, unless you reduce the number of items returned by your LINQ query as suggested by some of the other answers. You could always store what's in the query in a variable and then do the Take(100) afterward although you'll need to make sure LINQ's lazy execution doesn't bite you in that case.
On the binding side, rather than replace the entire collection you can use an ObservableCollection and just add/remove items from it. It would be a good idea to invert what your filter returns if you're going to do that but you'd see a much quicker response and wouldn't see such a big performance hit if the user is typing quickly.
Here's your "race condition".
orderby s.FilterValue
Consider the letter sequence d, o, g.
"d" starts running and will match say 30% of the set. 6000 items must be ordered.
"do" starts running and will match say 6% of the set. 1200 items must be ordered.
"dog" starts running and will match say 0.5% of the set. 100 items must be ordered.
When considering the different workloads of each event, it is no surprise that the last event will finish before the first and second event.
The most correct behavior I can imagine is to prevent the binding for any prior active events when an event starts. If you can prevent the binding by halting execution on those events, so much the better. Recall those missles, they don't have targets any more.

Converting IEnumerable<T> to List<T> on a LINQ result, huge performance loss

On a LINQ-result you like this:
var result = from x in Items select x;
List<T> list = result.ToList<T>();
However, the ToList<T> is Really Slow, does it make the list mutable and therefore the conversion is slow?
In most cases I can manage to just have my IEnumerable or as Paralell.DistinctQuery but now I want to bind the items to a DataGridView, so therefore I need to as something else than IEnumerable, suggestions on how I will gain performance on ToList or any replacement?
On 10 million records in the IEnumerable, the .ToList<T> takes about 6 seconds.
.ToList() is slow in comparison to what?
If you are comparing
var result = from x in Items select x;
List<T> list = result.ToList<T>();
to
var result = from x in Items select x;
you should note that since the query is evaluated lazily, the first line doesn't do much at all. It doesn't retrieve any records. Deferred execution makes this comparison completely unfair.
It's because LINQ likes to be lazy and do as little work as possible. This line:
var result = from x in Items select x;
despite your choice of name, isn't actually a result, it's just a query object. It doesn't fetch any data.
List<T> list = result.ToList<T>();
Now you've actually requested the result, hence it must fetch the data from the source and make a copy of it. ToList guarantees that a copy is made.
With that in mind, it's hardly surprising that the second line is much slower than the first.
No, it's not creating the list that takes time, it's fetching the data that takes time.
Your first code line doesn't actually fetch the data, it only sets up an IEnumerable that is capable of fetching the data. It's when you call the ToList method that it will actually get all the data, and that is why all the execution time is in the second line.
You should also consider if having ten million lines in a grid is useful at all. No user is ever going to look through all the lines, so there isn't really any point in getting them all. Perhaps you should offer a way to filter the result before getting any data at all.
I think it's because of memory reallocations: ToList cannot know the size of the collection beforehand, so that it could allocate enough storage to keep all items. Therefore, it has to reallocate the List<T> as it grows.
If you can estimate the size of your resultset, it'll be much faster to preallocate enough elements using List<T>(int) constructor overload, and then manually add items to it.

Does Linq cache results of multiple Froms?

Suppose we got a code like this:
IEnumerable<Foo> A = meh();
IEnumerable<Foo> B = meh();
var x =
from a in A
from b in B
select new {a, b};
Let's also assume that meh returns an IEnumerable which performs a lot of expensive calculations when iterated over. Of course, we can simply cache the calculated results manually by means of
IEnumerable<Foo> A = meh().ToList();
My question is if this manual caching of A and B is required, or if the above query caches the results of A and B itself during execution, so each line gets calculated only once. The difference of 2 * n and n * n calculations may be huge and I did not find a description of the behavior in MSDN, that's why I'm asking.
Assuming you mean LINQ to Objects, it definitely doesn't do any caching - nor should it, IMO. You're building a query, not the results of a query, if you see what I mean. Apart from anything else, you might want to iterate through a sequence which is larger than can reasonably be held in memory (e.g. iterate over every line in a multi-gigabyte log file). I certainly wouldn't want LINQ to try to cache that for me!
If you want a query to be evaluated and buffered for later quick access, calling ToList() is probably your best approach.
Note that it would be perfectly valid for B to depend on A, which is another reason not to cache. For example:
var query = from file in Directory.GetFiles("*.log")
from line in new LineReader(file)
...;
You really don't want LINQ to cache the results of reading the first log file and use the same results for every log file. I suppose it could be possible for the compiler to notice that the second from clause didn't depend on the range variable from the first one - but it could still depend on some side effect or other.

Categories