Searching a collection effectively in c# - c#

I have an AsyncObservable collection of some class, say "dashboard". Each item inside dashboard collection contains a collection of some other class, say "chart". That chart has various properties such as name,type etc... I want to search based on chart name, type etc on this collection. Can anybody suggest me some searching technique? Currently I am searching by traversing the whole collection using a foreach and comparing entered input with each item inside the collection (this is not so efficient if amount of data is large)... I want to make it more efficient - I am using c#..
My code is:
foreach (DashBoard item in this.DashBoards)
{
Chart obj1 = item.CurrentCharts.ToList().Find(chart => chart.ChartName.ToUpper().Contains(searchText.ToUpper()));
if (obj1 != null)
{
if (obj1.IsHighlighted != Colors.Wheat)
obj1.IsHighlighted = Colors.Wheat;
item.IsExpanded = true;
flagList.Add(1);
}
else
{
flagList.Add(0);
}
}

You can use the LINQ query.
For example something you can do like this.If you post your code,we can solve the problem
Dashboard.SelectMany(q => q.Chart).Where(a => a.Name == "SomeName")
Here is the reference linq question: querying nested collections
Edit:Foreach loops or LINQ
The answer is not really clear-cut.There are two sides to any code cost arguments: performance and maintainability.The first of these is obvious and quantifiable.
Under the hood LINQ will iterate over the collection, just as foreach will. The difference between LINQ and foreach is that LINQ will defer execution until the iteration begins.
Performance wise take a look at this blog post: http://www.schnieds.com/2009/03/linq-vs-foreach-vs-for-loop-performance.html
In your case:
If the collection is relatively small or medium size i would suggest you to use foreach for better performance.
At the end of the day.
Linq is more elegant but less efficient most of the time, foreach clutters the code a bit but perform better.
On large collections/on a where using parallel computing make sense i would choose LINQ as the performance gaps will be reduced to minimum.

Related

LINQ instead of ForEach on IEnumerable

I have an IEnumerable object which has a field called Data.
Data has values separated by ';'.
I need to split this Data with ';' and assign this into another field in the same IEnumerable which is an array element.
Eg:
IEnumerable<Object> =
[{
Data = "123;345",
DataList = {}
},
{
Data = "32424;87878",
DataList = {}
}]
and continues..
I need to split the Data in each row and assign to DataList in the same row.
Could anyone provide with a lambda expression to do the same
Putting performance aside for a moment I'll start with the inline "elegant" (opinion based) solution for the question of a substitute for the traditional ForEach:
collection.ToList().ForEach(item => { item.DataList = item.Data.Split(';'); });
Now, looking at the ReferenceSource what actually stands behind ForEach this is a simple loop:
for(int i = 0 ; i < _size; i++) {
/* O(1) code */
action(_items[i]);
}
Note that this is not a linq solution. It uses lambda expressions but ForEach is not a function of linq
Now back to performance. There are two issues.
The less significant one is that for each iteration of the internal loop there is another function call to the passed Action.
The more significant one is that as described the object is an
IEnumerable<T>. ForEach is a method of IList. Therefore there
is a call to ToList() that creates a new collection at performs in
O(n) which is not needed.
I'd just go with the simple straight forward solution of:
foreach(var item in collection)
{
item.DataList = item.Data.Split(';');
}
As a general note about linq:
Linq is for manipulations on collections and not really for updating the items enumerated. Here an update operation is needed so linq is not the best fitting solution.
Performance - For in-memory collection linq will only improve performance when the deffered execution is relevant and the non-linq implementation is not lazy. When working with linq-to-X such as to some database it is a complete different world and very much depends on the specific linq provider. As a whole it can help you perform operations in a single DB query where otherwise, if not implemented in the database, will be in several queries.
Worth reading: Is a LINQ statement faster than a 'foreach' loop?

Removing from a collection while in a foreach with linq

From what I understand, this seems to not be a safe practice...
I have a foreach loop on my list object that I am stepping through. Inside that foreach loop I am looking up records by an Id. Once I have that new list of records returned by that Id I do some parsing and add them to a new list.
What I would like to do is not step through the same Id more than once. So my thought process would be to remove it from the original list. However, this causes an error... and I understand why.
My question is... Is there a safe way to go about this? or should I restructure my thought process a bit? I was wondering if anyone had any experience or thoughts on how to solve this issue?
Here is a little pseudocode:
_myList.ForEach(x =>
{
List<MyModel> newMyList = _myList.FindAll(y => y.SomeId == x.SomeId).ToList();
//Here is where I would do some work with newMyList
//Now I am done... time to remove all records with x.SomeId
_myList.RemoveAll(y => y.SomeId == x.SomeId);
});
I know that _myList.RemoveAll(y => y.SomeId == x.SomeId); is wrong, but in theory that would kinda be what I would be looking for.
I have also toyed around with the idea of pushing the used SomeId to an idList and then have it check each time, but that seems cumbersome and was wondering if there was a nicer way to handle what I am looking to do.
Sorry if i didnt explain this that well. If there are any questions, please feel free to comment and I will answer/make edits where needed.
First off, using ForEach in your example isn't a great idea for these reasons.
You're right to think there are performance downsides to iterating through the full list for each remaining SomeId, but even making the list smaller every time would still require another full iteration of that subset (if it even worked).
As was pointed out in the comments, GroupBy on SomeId organizes the elements into groupings for you, and allows you to efficiently step through each subset for a given SomeId, like so:
_myList.GroupBy(x => x.SomeId)
.Select(g => DoSomethingWithGroupedElements(g));
Jon Skeet has an excellent set of articles about how the Linq extensions could be implemented. I highly recommend checking it out for a better understanding of why this would be more efficient.
First of all, a list inside a foreach is immutable, you can't add or delete content, nor rewrite an element. There are a few ways you could handle this situation:
GroupBy
This is the method I would use. You can group your list by the property you want, and iterate through the IGrouping formed this way
var groups = list.GroupBy(x => x.yourProperty);
foreach(var group in groups)
{
//your code
}
Distinct properties list
You could also save properties in another list, and cycle through that list instead of the original one
var propsList = list.Select(x=>x.yourProperty).Distinct();
foreach(var prop in propsList)
{
var tmpList = list.Where(x=>x.yourProperty == prop);
//your code
}
While loop
This will actually do what you originally wanted, but performances may not be optimal
while(list.Any())
{
var prop = list.First().yourProperty;
var tmpList = list.Where(x=>x.yourProperty == prop);
//your code
list.RemoveAll(x=>x.yourProperty == prop);
}

Efficiency of C# Find on 1000+ records

I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html
If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.
I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.
Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)

Does "foreach" cause repeated Linq execution?

I've been working for the first time with the Entity Framework in .NET, and have been writing LINQ queries in order to get information from my model. I would like to program in good habits from the beginning, so I've been doing research on the best way to write these queries, and get their results. Unfortunately, in browsing Stack Exchange, I've seem to have come across two conflicting explanations in how deferred/immediate execution works with LINQ:
A foreach causes the query to be executed in each iteration of the loop:
Demonstrated in question Slow foreach() on a LINQ query - ToList() boosts performance immensely - why is this? , the implication is that "ToList()" needs to be called in order to evaluate the query immediately, as the foreach is evaluating the query on the data source repeatedly, slowing down the operation considerably.
Another example is the question Foreaching through grouped linq results is incredibly slow, any tips? , where the accepted answer also implies that calling "ToList()" on the query will improve performance.
A foreach causes a query to be executed once, and is safe to use with LINQ
Demonstrated in question Does foreach execute the query only once? , the implication is that the foreach causes one enumeration to be established, and will not query the datasource each time.
Continued browsing of the site has turned up many questions where "repeated execution during a foreach loop" is the culprit of the performance concern, and plenty of other answers stating that a foreach will appropriately grab a single query from a datasource, which means that both explanations seem to have validity. If the "ToList()" hypothesis is incorrect (as most of the current answers as of 2013-06-05 1:51 PM EST seem to imply), where does this misconception come from? Is there one of these explanations that is accurate and one that isn't, or are there different circumstances that could cause a LINQ query to evaluate differently?
Edit: In addition to the accepted answer below, I've turned up the following question over on Programmers that very much helped my understanding of query execution, particularly the the pitfalls that could result in multiple datasource hits during a loop, which I think will be helpful for others interested in this question: https://softwareengineering.stackexchange.com/questions/178218/for-vs-foreach-vs-linq
In general LINQ uses deferred execution. If you use methods like First() and FirstOrDefault() the query is executed immediately. When you do something like;
foreach(string s in MyObjects.Select(x => x.AStringProp))
The results are retrieved in a streaming manner, meaning one by one. Each time the iterator calls MoveNext the projection is applied to the next object. If you were to have a Where it would first apply the filter, then the projection.
If you do something like;
List<string> names = People.Select(x => x.Name).ToList();
foreach (string name in names)
Then I believe this is a wasteful operation. ToList() will force the query to be executed, enumerating the People list and applying the x => x.Name projection. Afterwards you will enumerate the list again. So unless you have a good reason to have the data in a list (rather than IEnumerale) you're just wasting CPU cycles.
Generally speaking using a LINQ query on the collection you're enumerating with a foreach will not have worse performance than any other similar and practical options.
Also it's worth noting that people implementing LINQ providers are encouraged to make the common methods work as they do in the Microsoft provided providers but they're not required to. If I were to go write a LINQ to HTML or LINQ to My Proprietary Data Format provider there would be no guarantee that it behaves in this manner. Perhaps the nature of the data would make immediate execution the only practical option.
Also, final edit; if you're interested in this Jon Skeet's C# In Depth is very informative and a great read. My answer summarizes a few pages of the book (hopefully with reasonable accuracy) but if you want more details on how LINQ works under the covers, it's a good place to look.
try this on LinqPad
void Main()
{
var testList = Enumerable.Range(1,10);
var query = testList.Where(x =>
{
Console.WriteLine(string.Format("Doing where on {0}", x));
return x % 2 == 0;
});
Console.WriteLine("First foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0}", i));
}
Console.WriteLine("First foreach ending");
Console.WriteLine("Second foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0} for the second time.", i));
}
Console.WriteLine("Second foreach ending");
}
Each time the where delegate is being run we shall see a console output, hence we can see the Linq query being run each time. Now by looking at the console output we see the second foreach loop still causes the "Doing where on" to print, thus showing that the second usage of foreach does in fact cause the where clause to run again...potentially causing a slow down.
First foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2
Doing where on 3
Doing where on 4
Foreached where on 4
Doing where on 5
Doing where on 6
Foreached where on 6
Doing where on 7
Doing where on 8
Foreached where on 8
Doing where on 9
Doing where on 10
Foreached where on 10
First foreach ending
Second foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2 for the second time.
Doing where on 3
Doing where on 4
Foreached where on 4 for the second time.
Doing where on 5
Doing where on 6
Foreached where on 6 for the second time.
Doing where on 7
Doing where on 8
Foreached where on 8 for the second time.
Doing where on 9
Doing where on 10
Foreached where on 10 for the second time.
Second foreach ending
It depends on how the Linq query is being used.
var q = {some linq query here}
while (true)
{
foreach(var item in q)
{
...
}
}
The code above will execute the Linq query multiple times. Not because of the foreach, but because the foreach is inside another loop, so the foreach itself is being executed multiple times.
If all consumers of a linq query use it "carefully" and avoid dumb mistakes such as the nested loops above, then a linq query should not be executed multiple times needlessly.
There are occasions when reducing a linq query to an in-memory result set using ToList() are warranted, but in my opinion ToList() is used far, far too often. ToList() almost always becomes a poison pill whenever large data is involved, because it forces the entire result set (potentially millions of rows) to be pulled into memory and cached, even if the outermost consumer/enumerator only needs 10 rows. Avoid ToList() unless you have a very specific justification and you know your data will never be large.
Sometimes it might be a good idea to "cache" a LINQ query using ToList() or ToArray(), if the query is being accessed multiple times in your code.
But keep in mind that "caching" it still calls a foreach in turn.
So the basic rule for me is:
if a query is simply used in one foreach (and thats it) - then I don't cache the query
if a query is used in a foreach and in some other places in the code - then I cache it in a var using ToList/ToArray
foreach, by itself, only runs through its data once. In fact, it specifically runs through it once. You can't look ahead or back, or alter the index the way you can with a for loop.
However, if you have multiple foreachs in your code, all operating on the same LINQ query, you may get the query executed multiple times. This is entirely dependent on the data, though. If you're iterating over an LINQ-based IEnumerable/IQueryable that represents a database query, it will run that query each time. If you're iterating over an List or other collection of objets, it will run through the list each time, but won't hit your database repeatedly.
In other words, this is a property of LINQ, not a property of foreach.
The difference is in the underlying type. As LINQ is built on top of IEnumerable (or IQueryable) the same LINQ operator may have completely different performance characteristics.
A List will always be quick to respond, but it takes an upfront effort to build a list.
An iterator is also IEnumerable and may employ any algorithm every time it fetches the "next" item. This will be faster if you don't actually need to go through the complete set of items.
You can turn any IEnumerable into a list by calling ToList() on it and storing the resulting list in a local variable. This is advisable if
You don't depend on deferred execution.
You have to access more total items than the whole set.
You can pay the upfront cost of retrieving and storing all items.
Using LINQ even without entities what you will get is that deferred execution is in effect.
It is only by forcing an iteration that the actual linq expression is evaluated.
In that sense each time you use the linq expression it is going to be evaluated.
Now with entities this is still the same, but there is just more functionality at work here.
When the entity framework sees the expression for the first time, it looks if he has executed this query already. If not, it will go to the database and fetch the data, setup its internal memory model and return the data to you. If the entity framework sees it already fetched the data beforehand, it is not going to go to the database and use the memory model that it setup earlier to return data to you.
This can make your life easier, but it can also be a pain. For instance if you request all records from a table by using a linq expression. The entity framework will load all data from the table. If later on you evaluate the same linq expression, even if in the time being records were deleted or added, you will get the same result.
The entity framework is a complicated thing. There are of course ways to make it reexecute the query, taking into account the changes it has in its own memory model and the like.
I suggest reading "programming entity framework" of Julia Lerman. It addresses lots of issues like the one you having right now.
It will execute the LINQ statement the same number of times no matter if you do .ToList() or not. I have an example here with colored output to the console:
What happens in the code (see code at the bottom):
Create a list of 100 ints (0-99).
Create a LINQ statement that prints every int from the list followed by two * to the console in red color, and then return the int if it's an even number.
Do a foreach on the query, printing out every even number in green color.
Do a foreach on the query.ToList(), printing out every even number in green color.
As you can see in the output below, the number of ints written to the console is the same, meaning the LINQ statement is executed the same number of times.
The difference is in when the statement is executed. As you can see, when you do a foreach on the query (that you have not invoked .ToList() on), the list and the IEnumerable object, returned from the LINQ statement, are enumerated at the same time.
When you cache the list first, they are enumerated separately, but still the same amount of times.
The difference is very important to understand, because if the list is modified after you have defined your LINQ statement, the LINQ statement will operate on the modified list when it is executed (e.g. by .ToList()). BUT if you force execution of the LINQ statement (.ToList()) and then modify the list afterwards, the LINQ statement will NOT work on the modified list.
Here's the output:
Here's my code:
// Main method:
static void Main(string[] args)
{
IEnumerable<int> ints = Enumerable.Range(0, 100);
var query = ints.Where(x =>
{
Console.ForegroundColor = ConsoleColor.Red;
Console.Write($"{x}**, ");
return x % 2 == 0;
});
DoForeach(query, "query");
DoForeach(query, "query.ToList()");
Console.ForegroundColor = ConsoleColor.White;
}
// DoForeach method:
private static void DoForeach(IEnumerable<int> collection, string collectionName)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH BEGIN: ---", collectionName);
if (collectionName.Contains("query.ToList()"))
collection = collection.ToList();
foreach (var item in collection)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write($"{item}, ");
}
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH END ---", collectionName);
}
Note about execution time: I did a few timing tests (not enough to post it here though) and I didn't find any consistency in either method being faster than the other (including the execution of .ToList() in the timing). On larger collections, caching the collection first and then iterating it seemed a bit faster, but there was no definitive conclusion from my test.

Large LINQ Grouping query, what's happening behind the scenes

Take the following LINQ query as an example. Please don't comment on the code itself as I've just typed it to help with this question.
The following LINQ query uses a 'group by' and calculates summary information. As you can see there are numerous calculations which are being performed on the data but how efficient is LINQ behind the scenes.
var NinjasGrouped = (from ninja in Ninjas
group pos by new { pos.NinjaClan, pos.NinjaRank }
into con
select new NinjaGroupSummary
{
NinjaClan = con.Key.NinjaClan,
NinjaRank = con.Key.NinjaRank,
NumberOfShoes = con.Sum(x => x.Shoes),
MaxNinjaAge = con.Max(x => x.NinjaAge),
MinNinjaAge = con.Min(x => x.NinjaAge),
ComplicatedCalculation = con.Sum(x => x.NinjaGrade) != 0
? con.Sum(x => x.NinjaRedBloodCellCount)/con.Sum(x => x.NinjaDoctorVisits)
: 0,
ListOfNinjas = con.ToList()
}).ToList();
How many times is the list of 'Ninjas' being iterated over in order to calculate each of the values?
Would it be faster to employ a foreach loop to speed up the execution of such a query?
Would adding '.AsParallel()' after Ninjas result in any performance improvements?
Is there a better way of calculating summery information for List?
Any advice is appreciated as we use this type of code throughout our software and I would really like to gain a better understanding of what LINQ is doing underneath the hood (so to speak). Perhaps there is a better way?
Assuming this is a LINQ to Objects query:
Ninjas is only iterated over once; the groups are built up into internal concrete lists, which you're then iterating over multiple times (once per aggregation).
Using a foreach loop almost certainly wouldn't speed things up - you might benefit from cache coherency a bit more (as each time you iterate over a group it'll probably have to fetch data from a higher level cache or main memory) but I very much doubt that it would be significant. The increase in pain in implementing it probably would be significant though :)
Using AsParallel might speed things up - it looks pretty easily parallelizable. Worth a try...
There's not a much better way for LINQ to Objects, to be honest. It would be nice to be able to perform the aggregation as you're grouping, and Reactive Extensions would allow you to do something like that, but for the moment this is probably the simplest approach.
You might want to have a look at the GroupBy post in my Edulinq blog series for more details on a possible implementation.

Categories