Efficiency of C# Find on 1000+ records

Efficiency of C# Find on 1000+ records - c#

I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}

If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html

If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.

I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.

Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)

Related

C# Update or insert from a List by some criteria

I have 2 lists, both with a different number of items, both with 1 parameter in common that I have to compare. If the value of the parameter is the same I have to update the DB, but if the item in a list doesn't have an item in the second list, I have to insert it into the DB.
This is what I was trying:
foreach (var rep in prodrep)
{
foreach (var crm in prodcrm)
{
if (rep.VEHI_SERIE.Equals(crm.VEHI_SERIE))
{
updateRecord(rep.Data);
}
else
{
insertRecords(rep.Data);
}
}
}
The first problem with this is that it is very slow. The second problem is that obviously the insert statement would't work, but I don't want to do another for each inside a foreach to verify if it doesn't exist, because that would take double the time.
How can I make this more efficient?

This is now not as efficient but this should work.
var existing = prodrep.Where(rep => prodcrm.Any(crm => rep.VEHI_SERIE.Equals(crm.VEHI_SERIE)).Select(rep=> Rep = rep, Crm=prodcrm.FirstOrDefault(crm=>rep.VEHI_SERIE.Equals(crm.VEHI_SERIE));
existing.ForEach(mix=>updateRecord(mix.Rep.Data, mix.Crm.Id));
prodrep.Where(rep => !existing.Any(mix=>mix.Rep==rep)).ForEach(rep=>insertRecords(rep.Data));

var comparators = prodcrm.Select(i => i.VEHI_SERIE).ToList();
foreach (var rep in prodrep)
{
if (comparators.Contains(rep.VEHI_SERIE)
// do something
else
// do something else
}

see Algorithm to optimize nested loops
it's an interesting read, and a cool trick, however not necessarily something that you should apply in every situation.
Also, be careful about answers providing you with LINQ queries. often it "looks cool" because you're not using the word "for", but really it's just hiding those for loops under the hood.
If you're really concerned about performance and the computer can handle it, you can look at the Task Parallel Library. it's not necessarily going to solve all of your problems, because you can be limited by processor/memory and you could end up making your application slower.
Is this something that a user of your application is going to be regularly doing? If so, is it something you can you make it an asynchronous task that they can come back to later, or is it an offline process that they aren't ever going to see. Depending on usage expectations, sometimes the time something takes isn't the end of the world.

Compare very large lists of database objects in c#

I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar fields, dates stored as nvarchar, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.
I initially tried making comparisons in a foreach loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any(), .Contains(), and .Exclude() methods.
My research leads me to believe that the .Exclude() method would be the most efficient, but I get out of memory errors when trying that. The .Any() and .Contains() methods seem to both take roughly the same time to complete (which is faster than the foreach loop).
The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.
When comparing two lists of objects (made up of several strings), is the .Exclude() method considered to be the most efficient?
Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:
List<Data> storedData = db.Data;
List<Data> incomingData = someDataPreviouslyParsed;
// No Projection that runs out of memory
var newData = incomingData.Exclude(storedData).ToList();
// PsudoCode that I would like to figure out if is possible
// First use projection on db so as to not get a bunch of irrelevant data
List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 });
var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();
Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
Would raw SQL in EF be a better practice in a situation like this?
Semi-Off-Topic:
When grabbing the data from the database and storing it in the variable storedData, does that eliminate the usefulness of any indexes (should there be any) stored in the table?
I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.
UPDATE
After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!
I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup() that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:
var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());
foreach (var item in parsedData)
{
if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
{
// duplicateData is a previously defined list
duplicateData.Add(item);
}
else
{
// newData is a previously defined list
newData.Add(item);
}
}

No reason to use EF for that.
Grab only columns that are required for you to make decision if you should update or insert the record (so those which represent missing "primary key"). Don't waste memory for other columns.
Build a HashSet of existing primary keys (i.e. if primary key is a number, HashSet of int, if it has multiple keys - combine them to string).
Check your 2000 items against HashSet, that is very fast.
Update or insert items with raw sql.

I suggest you consider doing it in SQL, not C#. You don't say what RDBMS you are using, but you could look at the MERGE statement, e.g. (for SQL Server 2008):
https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx
Broadly, the statement checks if a record is 'new' - if so, you can INSERT it; if not there is UPDATE and DELETE capabilities, or you just ignore it.

Searching a collection effectively in c#

I have an AsyncObservable collection of some class, say "dashboard". Each item inside dashboard collection contains a collection of some other class, say "chart". That chart has various properties such as name,type etc... I want to search based on chart name, type etc on this collection. Can anybody suggest me some searching technique? Currently I am searching by traversing the whole collection using a foreach and comparing entered input with each item inside the collection (this is not so efficient if amount of data is large)... I want to make it more efficient - I am using c#..
My code is:
foreach (DashBoard item in this.DashBoards)
{
Chart obj1 = item.CurrentCharts.ToList().Find(chart => chart.ChartName.ToUpper().Contains(searchText.ToUpper()));
if (obj1 != null)
{
if (obj1.IsHighlighted != Colors.Wheat)
obj1.IsHighlighted = Colors.Wheat;
item.IsExpanded = true;
flagList.Add(1);
}
else
{
flagList.Add(0);
}
}

You can use the LINQ query.
For example something you can do like this.If you post your code,we can solve the problem
Dashboard.SelectMany(q => q.Chart).Where(a => a.Name == "SomeName")
Here is the reference linq question: querying nested collections
Edit:Foreach loops or LINQ
The answer is not really clear-cut.There are two sides to any code cost arguments: performance and maintainability.The first of these is obvious and quantifiable.
Under the hood LINQ will iterate over the collection, just as foreach will. The difference between LINQ and foreach is that LINQ will defer execution until the iteration begins.
Performance wise take a look at this blog post: http://www.schnieds.com/2009/03/linq-vs-foreach-vs-for-loop-performance.html
In your case:
If the collection is relatively small or medium size i would suggest you to use foreach for better performance.
At the end of the day.
Linq is more elegant but less efficient most of the time, foreach clutters the code a bit but perform better.
On large collections/on a where using parallel computing make sense i would choose LINQ as the performance gaps will be reduced to minimum.

EF recursive hierarchy

I've got some kind of layers in my application. These layers are following a structure like: a company has settlements, settlements have sections, sections have machines, machines are producing items, to produce items the machine needs tools,...
At the very end of this hierarchy there are entries how many items could be produced with a specific part of a tool(called cuttingtool). Based on that, a statistic can be calculated. On each layer the statistic results of the next upper layer are getting added.
Take a look at this diagram:
On each layer, a statistic is displayed. For example: The user navigates to the second layer(Items). There are 10 items. The user can see a pie chart which displays the costs of each item. These costs are calculated by adding all costs of the items tools(the next upper layer). The costs of the tools are calculated by adding all costs of the "parts of the tools"...
I know that is a bit complicated so if there any questions, just ask me for a more detailed explaination.
Now my problem: To calculate the cost of an item(the same statistic is provided for machines, tools,... => for each layer on the diagram), I need to get all Lifetimes of the Item. So I am using a recursive call to skip all layers between the Item and the Lifetime.
That workes quite well BUT I am using far to many SelectMany-linq commands. As a result, the performance is extremely bad.
I've thought about a joins or procedures(stored in the database) to speed that up, but I am by far not experied which techniques like databases. So I want to ask you, what you would do?
Currently I am using something like that:
public IEnumerable<IHierachyEntity> GetLifetimes(IEnumerable<IHierachyEntity> entities)
{
if(entities is IEnumerable<Lifetime>)
{
return entities;
}
else
{
return GetLifetimes(entities.SelectMany(x => x.Childs))
}
}

Since this probably is a pretty fixed hierarchy in the heart of your application I wouldn't mind writing a dedicated piece of code for it. Moreover, writing an efficient generic routine for hierarchical queries is impossible with LINQ to a database backend. The n+1 problem just can't be avoided.
So just do something like this:
public IQueryable<Lifetime> GetLifetimes<T>(IQueryable<T> entities)
{
var machines = entities as IQueryable<Machine>;
if (machines != null)
return machines.SelectMany (m => m.Items)
.SelectMany (i => i.Tools)
.SelectMany (i => i.Parts)
.SelectMany (i => i.Lifetimes);
var items = entities as IQueryable<Item>;
if (items != null)
return items.SelectMany (i => i.Tools)
.SelectMany (i => i.Parts)
.SelectMany (i => i.Lifetimes);
var tools = entities as IQueryable<Tool>;
if (tools != null)
return tools.SelectMany (i => i.Parts)
.SelectMany (i => i.Lifetimes);
var parts = entities as IQueryable<Part>;
if (parts != null)
return parts.SelectMany (i => i.Lifetimes);
return Enumerable.Empty<Lifetime>().AsQueryable();
}
Repetitive code, yes, but its is crystal clear what happens and it's probably among the most stable parts of the code. Repetitive code is a potential problem when continuous maintenance is to be expected.

As much as I understood you trying to pull very long history of your actions. I need create a routine which will update your statistics as changes happened. This has no "ultimate" solution your should figure it out. E.g. I have "in" and "out" stock transactions and to find out current stock level for all items I should go through 20 years history. To come around I can do monthly summaries and only calculate changes from month start. Or I can use a database trigger to update my summaries as soon as changes happened (could be performance costly one). Or I can have a service that will update it time to time ( would not be 100% up to date possibly). In another words you need table/class which will keep your aggregated results ready to use.

Does "foreach" cause repeated Linq execution?

I've been working for the first time with the Entity Framework in .NET, and have been writing LINQ queries in order to get information from my model. I would like to program in good habits from the beginning, so I've been doing research on the best way to write these queries, and get their results. Unfortunately, in browsing Stack Exchange, I've seem to have come across two conflicting explanations in how deferred/immediate execution works with LINQ:
A foreach causes the query to be executed in each iteration of the loop:
Demonstrated in question Slow foreach() on a LINQ query - ToList() boosts performance immensely - why is this? , the implication is that "ToList()" needs to be called in order to evaluate the query immediately, as the foreach is evaluating the query on the data source repeatedly, slowing down the operation considerably.
Another example is the question Foreaching through grouped linq results is incredibly slow, any tips? , where the accepted answer also implies that calling "ToList()" on the query will improve performance.
A foreach causes a query to be executed once, and is safe to use with LINQ
Demonstrated in question Does foreach execute the query only once? , the implication is that the foreach causes one enumeration to be established, and will not query the datasource each time.
Continued browsing of the site has turned up many questions where "repeated execution during a foreach loop" is the culprit of the performance concern, and plenty of other answers stating that a foreach will appropriately grab a single query from a datasource, which means that both explanations seem to have validity. If the "ToList()" hypothesis is incorrect (as most of the current answers as of 2013-06-05 1:51 PM EST seem to imply), where does this misconception come from? Is there one of these explanations that is accurate and one that isn't, or are there different circumstances that could cause a LINQ query to evaluate differently?
Edit: In addition to the accepted answer below, I've turned up the following question over on Programmers that very much helped my understanding of query execution, particularly the the pitfalls that could result in multiple datasource hits during a loop, which I think will be helpful for others interested in this question: https://softwareengineering.stackexchange.com/questions/178218/for-vs-foreach-vs-linq

In general LINQ uses deferred execution. If you use methods like First() and FirstOrDefault() the query is executed immediately. When you do something like;
foreach(string s in MyObjects.Select(x => x.AStringProp))
The results are retrieved in a streaming manner, meaning one by one. Each time the iterator calls MoveNext the projection is applied to the next object. If you were to have a Where it would first apply the filter, then the projection.
If you do something like;
List<string> names = People.Select(x => x.Name).ToList();
foreach (string name in names)
Then I believe this is a wasteful operation. ToList() will force the query to be executed, enumerating the People list and applying the x => x.Name projection. Afterwards you will enumerate the list again. So unless you have a good reason to have the data in a list (rather than IEnumerale) you're just wasting CPU cycles.
Generally speaking using a LINQ query on the collection you're enumerating with a foreach will not have worse performance than any other similar and practical options.
Also it's worth noting that people implementing LINQ providers are encouraged to make the common methods work as they do in the Microsoft provided providers but they're not required to. If I were to go write a LINQ to HTML or LINQ to My Proprietary Data Format provider there would be no guarantee that it behaves in this manner. Perhaps the nature of the data would make immediate execution the only practical option.
Also, final edit; if you're interested in this Jon Skeet's C# In Depth is very informative and a great read. My answer summarizes a few pages of the book (hopefully with reasonable accuracy) but if you want more details on how LINQ works under the covers, it's a good place to look.

try this on LinqPad
void Main()
{
var testList = Enumerable.Range(1,10);
var query = testList.Where(x =>
{
Console.WriteLine(string.Format("Doing where on {0}", x));
return x % 2 == 0;
});
Console.WriteLine("First foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0}", i));
}
Console.WriteLine("First foreach ending");
Console.WriteLine("Second foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0} for the second time.", i));
}
Console.WriteLine("Second foreach ending");
}
Each time the where delegate is being run we shall see a console output, hence we can see the Linq query being run each time. Now by looking at the console output we see the second foreach loop still causes the "Doing where on" to print, thus showing that the second usage of foreach does in fact cause the where clause to run again...potentially causing a slow down.
First foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2
Doing where on 3
Doing where on 4
Foreached where on 4
Doing where on 5
Doing where on 6
Foreached where on 6
Doing where on 7
Doing where on 8
Foreached where on 8
Doing where on 9
Doing where on 10
Foreached where on 10
First foreach ending
Second foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2 for the second time.
Doing where on 3
Doing where on 4
Foreached where on 4 for the second time.
Doing where on 5
Doing where on 6
Foreached where on 6 for the second time.
Doing where on 7
Doing where on 8
Foreached where on 8 for the second time.
Doing where on 9
Doing where on 10
Foreached where on 10 for the second time.
Second foreach ending

It depends on how the Linq query is being used.
var q = {some linq query here}
while (true)
{
foreach(var item in q)
{
...
}
}
The code above will execute the Linq query multiple times. Not because of the foreach, but because the foreach is inside another loop, so the foreach itself is being executed multiple times.
If all consumers of a linq query use it "carefully" and avoid dumb mistakes such as the nested loops above, then a linq query should not be executed multiple times needlessly.
There are occasions when reducing a linq query to an in-memory result set using ToList() are warranted, but in my opinion ToList() is used far, far too often. ToList() almost always becomes a poison pill whenever large data is involved, because it forces the entire result set (potentially millions of rows) to be pulled into memory and cached, even if the outermost consumer/enumerator only needs 10 rows. Avoid ToList() unless you have a very specific justification and you know your data will never be large.

Sometimes it might be a good idea to "cache" a LINQ query using ToList() or ToArray(), if the query is being accessed multiple times in your code.
But keep in mind that "caching" it still calls a foreach in turn.
So the basic rule for me is:
if a query is simply used in one foreach (and thats it) - then I don't cache the query
if a query is used in a foreach and in some other places in the code - then I cache it in a var using ToList/ToArray

foreach, by itself, only runs through its data once. In fact, it specifically runs through it once. You can't look ahead or back, or alter the index the way you can with a for loop.
However, if you have multiple foreachs in your code, all operating on the same LINQ query, you may get the query executed multiple times. This is entirely dependent on the data, though. If you're iterating over an LINQ-based IEnumerable/IQueryable that represents a database query, it will run that query each time. If you're iterating over an List or other collection of objets, it will run through the list each time, but won't hit your database repeatedly.
In other words, this is a property of LINQ, not a property of foreach.

The difference is in the underlying type. As LINQ is built on top of IEnumerable (or IQueryable) the same LINQ operator may have completely different performance characteristics.
A List will always be quick to respond, but it takes an upfront effort to build a list.
An iterator is also IEnumerable and may employ any algorithm every time it fetches the "next" item. This will be faster if you don't actually need to go through the complete set of items.
You can turn any IEnumerable into a list by calling ToList() on it and storing the resulting list in a local variable. This is advisable if
You don't depend on deferred execution.
You have to access more total items than the whole set.
You can pay the upfront cost of retrieving and storing all items.

Using LINQ even without entities what you will get is that deferred execution is in effect.
It is only by forcing an iteration that the actual linq expression is evaluated.
In that sense each time you use the linq expression it is going to be evaluated.
Now with entities this is still the same, but there is just more functionality at work here.
When the entity framework sees the expression for the first time, it looks if he has executed this query already. If not, it will go to the database and fetch the data, setup its internal memory model and return the data to you. If the entity framework sees it already fetched the data beforehand, it is not going to go to the database and use the memory model that it setup earlier to return data to you.
This can make your life easier, but it can also be a pain. For instance if you request all records from a table by using a linq expression. The entity framework will load all data from the table. If later on you evaluate the same linq expression, even if in the time being records were deleted or added, you will get the same result.
The entity framework is a complicated thing. There are of course ways to make it reexecute the query, taking into account the changes it has in its own memory model and the like.
I suggest reading "programming entity framework" of Julia Lerman. It addresses lots of issues like the one you having right now.

It will execute the LINQ statement the same number of times no matter if you do .ToList() or not. I have an example here with colored output to the console:
What happens in the code (see code at the bottom):
Create a list of 100 ints (0-99).
Create a LINQ statement that prints every int from the list followed by two * to the console in red color, and then return the int if it's an even number.
Do a foreach on the query, printing out every even number in green color.
Do a foreach on the query.ToList(), printing out every even number in green color.
As you can see in the output below, the number of ints written to the console is the same, meaning the LINQ statement is executed the same number of times.
The difference is in when the statement is executed. As you can see, when you do a foreach on the query (that you have not invoked .ToList() on), the list and the IEnumerable object, returned from the LINQ statement, are enumerated at the same time.
When you cache the list first, they are enumerated separately, but still the same amount of times.
The difference is very important to understand, because if the list is modified after you have defined your LINQ statement, the LINQ statement will operate on the modified list when it is executed (e.g. by .ToList()). BUT if you force execution of the LINQ statement (.ToList()) and then modify the list afterwards, the LINQ statement will NOT work on the modified list.
Here's the output:
Here's my code:
// Main method:
static void Main(string[] args)
{
IEnumerable<int> ints = Enumerable.Range(0, 100);
var query = ints.Where(x =>
{
Console.ForegroundColor = ConsoleColor.Red;
Console.Write($"{x}**, ");
return x % 2 == 0;
});
DoForeach(query, "query");
DoForeach(query, "query.ToList()");
Console.ForegroundColor = ConsoleColor.White;
}
// DoForeach method:
private static void DoForeach(IEnumerable<int> collection, string collectionName)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH BEGIN: ---", collectionName);
if (collectionName.Contains("query.ToList()"))
collection = collection.ToList();
foreach (var item in collection)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write($"{item}, ");
}
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH END ---", collectionName);
}
Note about execution time: I did a few timing tests (not enough to post it here though) and I didn't find any consistency in either method being faster than the other (including the execution of .ToList() in the timing). On larger collections, caching the collection first and then iterating it seemed a bit faster, but there was no definitive conclusion from my test.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficiency of C# Find on 1000+ records - c#

Related

C# Update or insert from a List by some criteria

Compare very large lists of database objects in c#

Searching a collection effectively in c#

EF recursive hierarchy

Does "foreach" cause repeated Linq execution?

Categories

Resources