LINQ instead of ForEach on IEnumerable - c#

I have an IEnumerable object which has a field called Data.
Data has values separated by ';'.
I need to split this Data with ';' and assign this into another field in the same IEnumerable which is an array element.
Eg:
IEnumerable<Object> =
[{
Data = "123;345",
DataList = {}
},
{
Data = "32424;87878",
DataList = {}
}]
and continues..
I need to split the Data in each row and assign to DataList in the same row.
Could anyone provide with a lambda expression to do the same

Putting performance aside for a moment I'll start with the inline "elegant" (opinion based) solution for the question of a substitute for the traditional ForEach:
collection.ToList().ForEach(item => { item.DataList = item.Data.Split(';'); });
Now, looking at the ReferenceSource what actually stands behind ForEach this is a simple loop:
for(int i = 0 ; i < _size; i++) {
/* O(1) code */
action(_items[i]);
}
Note that this is not a linq solution. It uses lambda expressions but ForEach is not a function of linq
Now back to performance. There are two issues.
The less significant one is that for each iteration of the internal loop there is another function call to the passed Action.
The more significant one is that as described the object is an
IEnumerable<T>. ForEach is a method of IList. Therefore there
is a call to ToList() that creates a new collection at performs in
O(n) which is not needed.
I'd just go with the simple straight forward solution of:
foreach(var item in collection)
{
item.DataList = item.Data.Split(';');
}
As a general note about linq:
Linq is for manipulations on collections and not really for updating the items enumerated. Here an update operation is needed so linq is not the best fitting solution.
Performance - For in-memory collection linq will only improve performance when the deffered execution is relevant and the non-linq implementation is not lazy. When working with linq-to-X such as to some database it is a complete different world and very much depends on the specific linq provider. As a whole it can help you perform operations in a single DB query where otherwise, if not implemented in the database, will be in several queries.
Worth reading: Is a LINQ statement faster than a 'foreach' loop?

Related

Most efficient collection for storing data from LINQ to Entities?

I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.

Searching a collection effectively in c#

I have an AsyncObservable collection of some class, say "dashboard". Each item inside dashboard collection contains a collection of some other class, say "chart". That chart has various properties such as name,type etc... I want to search based on chart name, type etc on this collection. Can anybody suggest me some searching technique? Currently I am searching by traversing the whole collection using a foreach and comparing entered input with each item inside the collection (this is not so efficient if amount of data is large)... I want to make it more efficient - I am using c#..
My code is:
foreach (DashBoard item in this.DashBoards)
{
Chart obj1 = item.CurrentCharts.ToList().Find(chart => chart.ChartName.ToUpper().Contains(searchText.ToUpper()));
if (obj1 != null)
{
if (obj1.IsHighlighted != Colors.Wheat)
obj1.IsHighlighted = Colors.Wheat;
item.IsExpanded = true;
flagList.Add(1);
}
else
{
flagList.Add(0);
}
}
You can use the LINQ query.
For example something you can do like this.If you post your code,we can solve the problem
Dashboard.SelectMany(q => q.Chart).Where(a => a.Name == "SomeName")
Here is the reference linq question: querying nested collections
Edit:Foreach loops or LINQ
The answer is not really clear-cut.There are two sides to any code cost arguments: performance and maintainability.The first of these is obvious and quantifiable.
Under the hood LINQ will iterate over the collection, just as foreach will. The difference between LINQ and foreach is that LINQ will defer execution until the iteration begins.
Performance wise take a look at this blog post: http://www.schnieds.com/2009/03/linq-vs-foreach-vs-for-loop-performance.html
In your case:
If the collection is relatively small or medium size i would suggest you to use foreach for better performance.
At the end of the day.
Linq is more elegant but less efficient most of the time, foreach clutters the code a bit but perform better.
On large collections/on a where using parallel computing make sense i would choose LINQ as the performance gaps will be reduced to minimum.

Speed improvement in LINQ Where(Array.Contains)

I initially had a method that contained a LINQ query returning int[], which then got used later in a fashion similar to:
int[] result = something.Where(s => previousarray.Contains(s.field));
This turned out to be horribly slow, until the first array was retrieved as the native IQueryable<int>. It now runs very quickly, but I'm wondering how I'd deal with the situation if I was provided an int[] from elsewhere which then had to be used as above.
Is there a way to speed up the query in such cases? Converting to a List doesn't seem to help.
In LINQ-SQL, a Contains will be converted to a SELECT ... WHERE field IN(...) and should be relatively fast. In LINQ-Objects however, it will call ICollection<T>.Contains if the source is an ICollection<T>.
When a LINQ-SQL result is treated as an IEnumerable instead of an IQueryable, you lose the linq provider - i.e., any further operations will be done in memory and not in the database.
As for why its much slower in memory:
Array.Contains() is an O(n) operation so
something.Where(s => previousarray.Contains(s.field));
is O(p * s) where p is the size of previousarray and s is the size of something.
HashSet<T>.Contains() on the other hand is an O(1) operation. If you first create a hashset, you will see a big improvement on the .Contains operation as it will be O(s) instead of O(p * s).
Example:
var previousSet = new HashSet<int>(previousarray);
var result = something.Where(s => previousSet.Contains(s.field));
Where on Lists/Arrays/IEnumarables etc is O[N] operation. It is O[~1] on HashSet. So you should try to use it.

Does "foreach" cause repeated Linq execution?

I've been working for the first time with the Entity Framework in .NET, and have been writing LINQ queries in order to get information from my model. I would like to program in good habits from the beginning, so I've been doing research on the best way to write these queries, and get their results. Unfortunately, in browsing Stack Exchange, I've seem to have come across two conflicting explanations in how deferred/immediate execution works with LINQ:
A foreach causes the query to be executed in each iteration of the loop:
Demonstrated in question Slow foreach() on a LINQ query - ToList() boosts performance immensely - why is this? , the implication is that "ToList()" needs to be called in order to evaluate the query immediately, as the foreach is evaluating the query on the data source repeatedly, slowing down the operation considerably.
Another example is the question Foreaching through grouped linq results is incredibly slow, any tips? , where the accepted answer also implies that calling "ToList()" on the query will improve performance.
A foreach causes a query to be executed once, and is safe to use with LINQ
Demonstrated in question Does foreach execute the query only once? , the implication is that the foreach causes one enumeration to be established, and will not query the datasource each time.
Continued browsing of the site has turned up many questions where "repeated execution during a foreach loop" is the culprit of the performance concern, and plenty of other answers stating that a foreach will appropriately grab a single query from a datasource, which means that both explanations seem to have validity. If the "ToList()" hypothesis is incorrect (as most of the current answers as of 2013-06-05 1:51 PM EST seem to imply), where does this misconception come from? Is there one of these explanations that is accurate and one that isn't, or are there different circumstances that could cause a LINQ query to evaluate differently?
Edit: In addition to the accepted answer below, I've turned up the following question over on Programmers that very much helped my understanding of query execution, particularly the the pitfalls that could result in multiple datasource hits during a loop, which I think will be helpful for others interested in this question: https://softwareengineering.stackexchange.com/questions/178218/for-vs-foreach-vs-linq
In general LINQ uses deferred execution. If you use methods like First() and FirstOrDefault() the query is executed immediately. When you do something like;
foreach(string s in MyObjects.Select(x => x.AStringProp))
The results are retrieved in a streaming manner, meaning one by one. Each time the iterator calls MoveNext the projection is applied to the next object. If you were to have a Where it would first apply the filter, then the projection.
If you do something like;
List<string> names = People.Select(x => x.Name).ToList();
foreach (string name in names)
Then I believe this is a wasteful operation. ToList() will force the query to be executed, enumerating the People list and applying the x => x.Name projection. Afterwards you will enumerate the list again. So unless you have a good reason to have the data in a list (rather than IEnumerale) you're just wasting CPU cycles.
Generally speaking using a LINQ query on the collection you're enumerating with a foreach will not have worse performance than any other similar and practical options.
Also it's worth noting that people implementing LINQ providers are encouraged to make the common methods work as they do in the Microsoft provided providers but they're not required to. If I were to go write a LINQ to HTML or LINQ to My Proprietary Data Format provider there would be no guarantee that it behaves in this manner. Perhaps the nature of the data would make immediate execution the only practical option.
Also, final edit; if you're interested in this Jon Skeet's C# In Depth is very informative and a great read. My answer summarizes a few pages of the book (hopefully with reasonable accuracy) but if you want more details on how LINQ works under the covers, it's a good place to look.
try this on LinqPad
void Main()
{
var testList = Enumerable.Range(1,10);
var query = testList.Where(x =>
{
Console.WriteLine(string.Format("Doing where on {0}", x));
return x % 2 == 0;
});
Console.WriteLine("First foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0}", i));
}
Console.WriteLine("First foreach ending");
Console.WriteLine("Second foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0} for the second time.", i));
}
Console.WriteLine("Second foreach ending");
}
Each time the where delegate is being run we shall see a console output, hence we can see the Linq query being run each time. Now by looking at the console output we see the second foreach loop still causes the "Doing where on" to print, thus showing that the second usage of foreach does in fact cause the where clause to run again...potentially causing a slow down.
First foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2
Doing where on 3
Doing where on 4
Foreached where on 4
Doing where on 5
Doing where on 6
Foreached where on 6
Doing where on 7
Doing where on 8
Foreached where on 8
Doing where on 9
Doing where on 10
Foreached where on 10
First foreach ending
Second foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2 for the second time.
Doing where on 3
Doing where on 4
Foreached where on 4 for the second time.
Doing where on 5
Doing where on 6
Foreached where on 6 for the second time.
Doing where on 7
Doing where on 8
Foreached where on 8 for the second time.
Doing where on 9
Doing where on 10
Foreached where on 10 for the second time.
Second foreach ending
It depends on how the Linq query is being used.
var q = {some linq query here}
while (true)
{
foreach(var item in q)
{
...
}
}
The code above will execute the Linq query multiple times. Not because of the foreach, but because the foreach is inside another loop, so the foreach itself is being executed multiple times.
If all consumers of a linq query use it "carefully" and avoid dumb mistakes such as the nested loops above, then a linq query should not be executed multiple times needlessly.
There are occasions when reducing a linq query to an in-memory result set using ToList() are warranted, but in my opinion ToList() is used far, far too often. ToList() almost always becomes a poison pill whenever large data is involved, because it forces the entire result set (potentially millions of rows) to be pulled into memory and cached, even if the outermost consumer/enumerator only needs 10 rows. Avoid ToList() unless you have a very specific justification and you know your data will never be large.
Sometimes it might be a good idea to "cache" a LINQ query using ToList() or ToArray(), if the query is being accessed multiple times in your code.
But keep in mind that "caching" it still calls a foreach in turn.
So the basic rule for me is:
if a query is simply used in one foreach (and thats it) - then I don't cache the query
if a query is used in a foreach and in some other places in the code - then I cache it in a var using ToList/ToArray
foreach, by itself, only runs through its data once. In fact, it specifically runs through it once. You can't look ahead or back, or alter the index the way you can with a for loop.
However, if you have multiple foreachs in your code, all operating on the same LINQ query, you may get the query executed multiple times. This is entirely dependent on the data, though. If you're iterating over an LINQ-based IEnumerable/IQueryable that represents a database query, it will run that query each time. If you're iterating over an List or other collection of objets, it will run through the list each time, but won't hit your database repeatedly.
In other words, this is a property of LINQ, not a property of foreach.
The difference is in the underlying type. As LINQ is built on top of IEnumerable (or IQueryable) the same LINQ operator may have completely different performance characteristics.
A List will always be quick to respond, but it takes an upfront effort to build a list.
An iterator is also IEnumerable and may employ any algorithm every time it fetches the "next" item. This will be faster if you don't actually need to go through the complete set of items.
You can turn any IEnumerable into a list by calling ToList() on it and storing the resulting list in a local variable. This is advisable if
You don't depend on deferred execution.
You have to access more total items than the whole set.
You can pay the upfront cost of retrieving and storing all items.
Using LINQ even without entities what you will get is that deferred execution is in effect.
It is only by forcing an iteration that the actual linq expression is evaluated.
In that sense each time you use the linq expression it is going to be evaluated.
Now with entities this is still the same, but there is just more functionality at work here.
When the entity framework sees the expression for the first time, it looks if he has executed this query already. If not, it will go to the database and fetch the data, setup its internal memory model and return the data to you. If the entity framework sees it already fetched the data beforehand, it is not going to go to the database and use the memory model that it setup earlier to return data to you.
This can make your life easier, but it can also be a pain. For instance if you request all records from a table by using a linq expression. The entity framework will load all data from the table. If later on you evaluate the same linq expression, even if in the time being records were deleted or added, you will get the same result.
The entity framework is a complicated thing. There are of course ways to make it reexecute the query, taking into account the changes it has in its own memory model and the like.
I suggest reading "programming entity framework" of Julia Lerman. It addresses lots of issues like the one you having right now.
It will execute the LINQ statement the same number of times no matter if you do .ToList() or not. I have an example here with colored output to the console:
What happens in the code (see code at the bottom):
Create a list of 100 ints (0-99).
Create a LINQ statement that prints every int from the list followed by two * to the console in red color, and then return the int if it's an even number.
Do a foreach on the query, printing out every even number in green color.
Do a foreach on the query.ToList(), printing out every even number in green color.
As you can see in the output below, the number of ints written to the console is the same, meaning the LINQ statement is executed the same number of times.
The difference is in when the statement is executed. As you can see, when you do a foreach on the query (that you have not invoked .ToList() on), the list and the IEnumerable object, returned from the LINQ statement, are enumerated at the same time.
When you cache the list first, they are enumerated separately, but still the same amount of times.
The difference is very important to understand, because if the list is modified after you have defined your LINQ statement, the LINQ statement will operate on the modified list when it is executed (e.g. by .ToList()). BUT if you force execution of the LINQ statement (.ToList()) and then modify the list afterwards, the LINQ statement will NOT work on the modified list.
Here's the output:
Here's my code:
// Main method:
static void Main(string[] args)
{
IEnumerable<int> ints = Enumerable.Range(0, 100);
var query = ints.Where(x =>
{
Console.ForegroundColor = ConsoleColor.Red;
Console.Write($"{x}**, ");
return x % 2 == 0;
});
DoForeach(query, "query");
DoForeach(query, "query.ToList()");
Console.ForegroundColor = ConsoleColor.White;
}
// DoForeach method:
private static void DoForeach(IEnumerable<int> collection, string collectionName)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH BEGIN: ---", collectionName);
if (collectionName.Contains("query.ToList()"))
collection = collection.ToList();
foreach (var item in collection)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write($"{item}, ");
}
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH END ---", collectionName);
}
Note about execution time: I did a few timing tests (not enough to post it here though) and I didn't find any consistency in either method being faster than the other (including the execution of .ToList() in the timing). On larger collections, caching the collection first and then iterating it seemed a bit faster, but there was no definitive conclusion from my test.

Does foreach execute the query only once?

I have a list of items and a LINQ query over them. Now, with LINQ's deferred execution, would a subsequent foreach loop execute the query only once or for each turn in the loop?
Given this example (Taken from Introduction to LINQ Queries (C#), on MSDN)
// The Three Parts of a LINQ Query:
// 1. Data source.
int[] numbers = new int[7] { 0, 1, 2, 3, 4, 5, 6 };
// 2. Query creation.
// numQuery is an IEnumerable<int>
var numQuery =
from num in numbers
where (num % 2) == 0
select num;
// 3. Query execution.
foreach (int num in numQuery)
{
Console.Write("{0,1} ", num);
}
Or, in other words, would there be any difference if I had:
foreach (int num in numQuery.ToList())
And, would it matter, if the underlying data is not in an array, but in a Database?
Now, with LINQ's deferred execution, would a subsequent foreach loop execute the query only once or for each turn in the loop?
Yes, once for the loop. Actually, it may execute the query less than once - you could abort the looping part way through and the (num % 2) == 0 test wouldn't be performed on any remaining items.
Or, in other words, would there be any difference if I had:
foreach (int num in numQuery.ToList())
Two differences:
In the case above, ToList() wastes time and memory, because it first does the same thing as the initial foreach, builds a list from it, and then foreachs that list. The differences will be somewhere between trivial and preventing the code from ever working, depending on the size of the results.
However, in the case where you are going to repeatedly do foreach on the same results, or otherwise use it repeatedly, the then while the foreach only runs the query once, the next foreach runs it again. If the query is expensive, then the ToList() approach (and storing that list) can be a massive saving.
No, it makes no difference. The in expression is evaluated once. More specifically, the foreach construct invokes the GetEnumerator() method on the in expression and repeatedly calls MoveNext() and accesses the Current property in order to traverse the IEnumerable.
OTOH, calling ToList() is redundant. You shouldn't bother calling it.
If the input is a database, the situation is slightly different, since LINQ outputs IQueryable, but I'm pretty sure that foreach still treats it as an IEnumerable (which IQueryable inherits).
As written, each iteration of the loop would do exactly as much work as it needed to fetch the next result. So the answer would technically be "none of the above". The query would execute "in pieces".
If you use ToList() or any other materialization method (ToArray() etc) then the query will be evaluated once on the spot and subsequent operations (such as iterating over the results) will simply work on a "dumb" list.
If numbers were an IQueryable instead of an IEnumerable -- as it would likely be in a database scenario -- then the above is still close to the truth although not a perfectly accurate description. In particular, on the first attempt to materialize a result the queryable provider would talk to the database and produce a result set; then, rows from this result set would be pulled on each iteration.
The linq query will be executed when it is enumerated (either as the result of a .ToList() call or doing a foreach over the results.
If you are enumerating the results of the linq query twice, both times will cause it to query the data source (in your example, enumerating the collection) as it is itself only returning an IEnumerable. However, depending on the linq query, it may not always enumerate the entire collection (e.g .Any() and .Single() will stop on the first object or the first matching object if there is a .Where()).
The implementation details of a linq provider may differ so the usual behaviour when the data source is a database is to call .ToList() straight away to cache the results of the query & also ensure that the query (in the case of EF, L2S or NHibernate) is executed once there and then rather than when the collection is enumerated at some point later in the code and to prevent the query being executed multiple times if the results are enumerated multiple times.

Categories