Understanding lazy loading optimization in C#

Understanding lazy loading optimization in C# - c#

After reading a bit of how yield, foreach, linq deferred execution and iterators work in C#. I decided to give it a try optimizing an attribute based validation mechanic inside a small project. The result:
private IEnumerable<string> GetPropertyErrors(PropertyInfo property)
{
// where Entity is the current object instance
string propertyValue = property.GetValue(Entity)?.ToString();
foreach (var attribute in property.GetCustomAttributes().OfType<ValidationAttribute>())
{
if (!attribute.IsValid(propertyValue))
{
yield return $"Error: {property.Name} {attribute.ErrorMessage}";
}
}
}
// inside another method
foreach(string error in GetPropertyErrors(property))
{
// Some display/insert log operation
}
I find this slow but that also could be due to reflection or a large amount of properties to process.
So my question is... Is this optimal or a good use of the lazy loading mechanic? or I'm missing something and just wasting tons of resources.
NOTE: The code intention itself is not important, my concern is the use of lazy loading in it.

Lazy loading is not something specific to C# or to Entity Framework. It's a common pattern, which allows defer some data loading. Deferring means not loading immediately. Some samples when you need that:
Loading images in (Word) document. Document may be big and it can contain thousands of images. If you'll load all them when document is opened it might take big amount of time. Nobody wants sit and watch 30 seconds on loading document. Same approach is used in web browsers - resources are not sent with body of page. Browser defers resources loading.
Loading graphs of objects. It may be objects from database, file system objects etc. Loading full graph might be equal to loading all database content into memory. How long it will take? Is it efficient? No. If you are building some file system explorer will you load info about every file in system before you start using it? It's much faster if you will load info about current directory only (and probably it's direct children).
Lazy loading not always mean deferring loading until you really need data. Loading might occur in background thread before you really need that data. E.g. you might never scroll to the bottom of web page to see footer image. Lazy loading means only deferring. And C# enumerators can help you with that. Consider getting list of files in directory:
string[] files = Directory.GetFiles("D:");
IEnumerable<string> filesEnumerator = Directory.EnumerateFiles("D:");
First approach returns array of files. It means directory should get all its files and save their names to array before you can get even first file name. It's like loading all images before you see document.
Second approach uses enumerator - it returns files one by one when you ask for next file name. It means that enumerator is returned immediately without getting all files and saving them to some collection. And you can process files one by one when you need that. Here getting files list is deferred.
But you should be careful. If underlying operation is not deferred, then returning enumerator gives you no benefits. E.g.
public IEnumerable<string> EnumerateFiles(string path)
{
foreach(string file in Directory.GetFiles(path))
yield return file;
}
Here you use GetFiles method which fills array of file names before returning them. So yielding files one by one gives you no speed benefits.
Btw in your case you have exactly same problem - GetCustomAttributes extension internally uses Attribute.GetCustomAttributes method which returns array of attributes. So you will not reduce time of getting first result.

This isn't quite how the term "lazy loading" is generally used in .NET. "Lazy loading" is most often used of something like:
public SomeType SomeValue
{
get
{
if (_backingField == null)
_backingField = RelativelyLengthyCalculationOrRetrieval();
return _backingField;
}
}
As opposed to just having _backingField set when an instance was constructed. Its advantage is that it costs nothing in the cases when SomeValue is never accessed, at the expense of a slightly greater cost when it is. It's therefore advantageous when the chances of SomeValue not being called are relatively high, and generally disadvantageous otherwise with some exceptions (when we might care about how quickly things are done in between instance creation and the first call to SomeValue).
Here we have deferred execution. It's similar, but not quite the same. When you call GetPropertyErrors(property) rather than receiving a collection of all of the errors you receive an object that can find those errors when asked for them.
It will always save the time taken to get the first such item, because it allows you to act upon it immediately rather than waiting until it has finished processing.
It will always reduce memory use, because it isn't spending memory on a collection.
It will also save time in total, because no time is spent creating a collection.
However, if you need to access it more than once, then while a collection will still have the same results, it will have to calculate them all again (unlike lazy loading which loads its results and stores them for subsequent reuse).
If you're rarely going to want to hit the same set of results, it's generally always a win.
If you're almost always going to want to hit the same set of results, it's generally a lose.
If you are sometimes going to want to hit the same set of results though, you can pass the decision on whether to cache or not up to the caller, with a single use calling GetPropertyErrors() and acting on the results directly, but a repeated use calling ToList() on that and then acting repeatedly on that list.
As such, the approach of not sending a list is the more flexible, allowing the calling code to decide which approach is the more efficient for its particular use of it.
You could also combine it with lazy loading:
private IEnumerable<string> LazyLoadedEnumerator()
{
if (_store == null)
return StoringCalculatingEnumerator();
return _store;
}
private IEnumerable<string> StoringCalculatingEnumerator()
{
List<string> store = new List<string>();
foreach(string str in SomethingThatCalculatesTheseStrings())
{
yield return str;
store.Add(str);
}
_store = store;
}
This combination is rarely useful in practice though.
As a rule, start with deferred evaluation as the normal approach and decide further up the call chain whether to store the results or not. An exception though is if you can know the size of the results before you begin (you can't here because you don't know if an element will be added or not until you've examined the property). In this case there is the possibility of a performance improvement in just how you create that list, because you can set its capacity ahead of time. This though is a micro-optimisation that is only applicable if you also know that you'll also always want to work on a list and doesn't save that much in the grand scheme of things.

Related

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?

This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.

Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .

This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

C# huge performance drop assigning float value

I am trying to optimize my code and was running VS performance monitor on it.
It shows that simple assignment of float takes up a major chunk of computing power?? I don't understand how is that possible.
Here is the code for TagData:
public class TagData
{
public int tf;
public float tf_idf;
}
So all I am really doing is:
float tag_tfidf = td.tf_idf;
I am confused.

I'll post another theory: it might be the cache miss of the first access to members of td. A memory load takes 100-200 cycles which in this case seems to amount to about 1/3 of the total duration of the method.
Points to test this theory:
Is your data set big? It bet it is.
Are you accessing the TagData's in random memory order? I bet they are not sequential in memory. This causes the memory prefetcher of the CPU to be dysfunctional.
Add a new line int dummy = td.tf; before the expensive line. This new line will now be the most expensive line because it will trigger the cache miss. Find some way to do a dummy load operation that the JIT does not optimize out. Maybe add all td.tf values to a local and pass that value to GC.KeepAlive at the end of the method. That should keep the memory load in the JIT-emitted x86.
I might be wrong but contrary to the other theories so far mine is testable.
Try making TagData a struct. That will make all items of term.tags sequential in memory and give you a nice performance boost.

Are you using LINQ? If so, LINQ uses lazy enumeration so the first time you access the value you pulled out, it's going to be painful.
If you are using LINQ, call ToList() after your query to only pay the price once.
It also looks like your data structure is sub optimal but since I don't have access to your source (and probably couldn't help even if I did :) ), I can't tell you what would be better.
EDIT: As commenters have pointed out, LINQ may not be to blame; however my question is based on the fact that both foreach statements are using IEnumerable. The TagData assignment is a pointer to the item in the collection of the IEnumerable (which may or may not have been enumerated yet). The first access of legitimate data is the line that pulls the property from the object. The first time this happens, it may be executing the entire LINQ statement and since profiling uses the average, it may be off. The same can be said for tagScores (which I'm guessing is database backed) whose first access is really slow and then speeds up. I wasn't pointing out the solution just a possible problem given my understanding of IEnumerable.
See http://odetocode.com/blogs/scott/archive/2008/10/01/lazy-linq-and-enumerable-objects.aspx

As we can see that next line to the suspicious one takes only 0.6 i.e
float tag_tfidf = td.tf_idf;//29.6
string tagName =...;//0.6
I suspect this is caused bu the excessive number of calls, and also note float is a value type, meaning they are copied by value. So everytime you assign it, runtime creates new float (Single) struct and initializes it by copying the value from td.tf_idf which takes huge time.
You can see string tagName =...; doesn't takes much because it is copied by reference.
Edit: As comments pointed out I may be wrong in that respect, this might be a bug in profiler also, Try re profiling and see if that makes any difference.

I am wondering about the state of connection and impact on code performance by 'yield' while iterating over data reader object

Here is my sample code that I am using to fetch data from database:
on DAO layer:
public IEnumerable<IDataRecord> GetDATA(ICommonSearchCriteriaDto commonSearchCriteriaDto)
{
using(DbContext)
{
DbDataReader reader = DbContext.GetReader("ABC_PACKAGE.GET_DATA", oracleParams.ToArray(), CommandType.StoredProcedure);
while (reader.Read())
{
yield return reader;
}
}
}
On BO layer I am calling the above method like:
List<IGridDataDto> GridDataDtos = MapMultiple(_costDriversGraphDao.GetGraphData(commonSearchCriteriaDto)).ToList();
on mapper layer MapMultiple method is defined like:
public IGridDataDto MapSingle(IDataRecord dataRecord)
{
return new GridDataDto
{
Code = Convert.ToString(dataRecord["Code"]),
Name = Convert.ToString(dataRecord["Name"]),
Type = Convert.ToString(dataRecord["Type"])
};
}
public IEnumerable<IGridDataDto> MapMultiple(IEnumerable<IDataRecord> dataRecords)
{
return dataRecords.Select(MapSingle);
}
The above code is working well and good but I am wondering about two concerns with the above code.
How long data reader’s connection will be opened?
When I consider code performance factor only, Is this a good idea to use ‘yield return’ instead of adding record into a list and returning the whole list?

your code doesn't show where you open/close the connection; but the reader here will actually only be open while you are iterating the data. Deferred execution, etc. The only bit of your code that does this is the .ToList(), so it'll be fine. In the more general case, yes: the reader will be open for the amount of time you take to iterate it; if you do a .ToList() that will be minimal; if you do a foreach and (for every item) make an external http request and wait 20 seconds, then yes - it will be open for longer.
Both have their uses; the non-buffered approach is great for huge results that you want to process as a stream, without ever having to load them into a single in-memory list (or even have all of them in memory at a time); returning a list keeps the connection closed quickly, and makes it easy to avoid accidentally using the connection while it already has an open reader, but is not ideal for large results
If you return an iterator block, the caller can decide what is sane; if you always return a list, they don't have much option. A third way (that we do in dapper) is to make the choice theirs; we have an optional bool parameter which defaults to "return a list", but which the caller can change to indicate "return an iterator block"; basically:
bool buffered = true
in the parameters, and:
var data = QueryInternal<T>(...blah...);
return buffered ? data.ToList() : data;
in the implementation. In most cases, returning a list is perfectly reasonable and avoids a lot of problems, hence we make that the default.

How long data reader’s connection will be opened?
The connection will remain open until the reader is dismissed, which means that it would be open until the iteration is over.
When I consider code performance factor only, Is this a good idea to use yield return instead of adding record into a list and returning the whole list?
This depends on several factors:
If you are not planning to fetch the entire result, yield return will help you save on the amount of data transferred on the network
If you are not planning to convert returned data to objects, or if multiple rows are used to create a single object, yield return will help you save on the memory used at the peak usage point of your program
If you plan to iterate the enture result set over a short period of time, there will be no performance penalties for using yield return. If the iteration is going to last for a significant amount of time on multiple concurrent threads, the number of open cursors on the RDBMS side may become exceeded.

This answer ignores flaws in the shown implementation and covers the general idea.
It is a tradeoff - it is impossible to tell whether it is a good idea without knowing the constraints of your system - what is the amount of data you expect to get, the memory consumption you are willing to accept, expected load on the database, etc

new objects added during long loop

We currently have a production application that runs as a windows service. Many times this application will end up in a loop that can take several hours to complete. We are using Entity Framework for .net 4.0 for our data access.
I'm looking for confirmation that if we load new data into the system, after this loop is initialized, it will not result in items being added to the loop itself. When the loop is initialized we are looking for data "as of" that moment. Although I'm relatively certain that this will work exactly like using ADO and doing a loop on the data (the loop only cycles through data that was present at the time of initialization), I am looking for confirmation for co-workers.
Thanks in advance for your help.
//update : here's some sample code in c# - question is the same, will the enumeration change if new items are added to the table that EF is querying?
IEnumerable<myobject> myobjects = (from o in db.theobjects where o.id==myID select o);
foreach (myobject obj in myobjects)
{
//perform action on obj here
}

It depends on your precise implementation.
Once a query has been executed against the database then the results of the query will not change (assuming you aren't using lazy loading). To ensure this you can dispose of the context after retrieving query results--this effectively "cuts the cord" between the retrieved data and that database.
Lazy loading can result in a mix of "initial" and "new" data; however once the data has been retrieved it will become a fixed snapshot and not susceptible to updates.
You mention this is a long running process; which implies that there may be a very large amount of data involved. If you aren't able to fully retrieve all data to be processed (due to memory limitations, or other bottlenecks) then you likely can't ensure that you are working against the original data. The results are not fixed until a query is executed, and any updates prior to query execution will appear in results.

I think your best bet is to change the logic of your application such that when the "loop" logic is determining whether it should do another interation or exit you take the opportunity to load the newly added items to the list. see pseudo code below:
var repo = new Repository();
while (repo.HasMoreItemsToProcess())
{
var entity = repo.GetNextItem();
}
Let me know if this makes sense.

The easiest way to assure that this happens - if the data itself isn't too big - is to convert the data you retrieve from the database to a List<>, e.g., something like this (pulled at random from my current project):
var sessionIds = room.Sessions.Select(s => s.SessionId).ToList();
And then iterate through the list, not through the IEnumerable<> that would otherwise be returned. Converting it to a list triggers the enumeration, and then throws all the results into memory.
If there's too much data to fit into memory, and you need to stick with an IEnumerable<>, then the answer to your question depends on various database and connection settings.

I'd take a snapshot of ID's to be processed -- quickly and as a transaction -- then work that list in the fashion you're doing today.
In addition to accomplishing the goal of not changing the sample mid-stream, this also gives you the ability to extend your solution to track status on each item as it's processed. For a long-running process, this can be very helpful for progress reporting restart / retry capabilities, etc.

Overhead of File/Directory.Exists in getter?

I have a class that has several properties that refer to file/directory locations on the local disk. These values can be dynamic and i want to ensure that anytime they are accessed, i verify that it exists first without having to include this code in every method that uses the values.
My question is, does putting this in the getter incur a performance penalty? It would not be called thousands of times in a loop, so that is not a consideration. Just want to make sure i am not doing something that would cause unnecessary bottle necks.
I know that generally it is not wise to optimize too early, but i would rather have this error checking in place now before i have to go back and remove it from the getter and add it all over the place.
Clarification:
The files/directories being pointed to by the properties are going to be used by System.Diagnostics.Process. i won't be reading/writing to these files/directories directly, i just want to make sure they exist before i spawn a child process.

Anything that's not a simple lookup or computation should go in a method, not a property. Properties should be conceptually similar to just accessing a field - if there is any additional overhead or chance of failure (and IO - even just checking a file exists - would fail that test on both counts) then properties are not the right choice.
Remember that properties even get called by the debugger when looking at object state.
Your question about what the overhead actually is, and optimising early, becomes irrelevant when looked at from this perspective. Hope this helps.

If you're that worried about performance (and you're right when you say that it's not a good idea to optimize too early), there are ways to mitigate this. If you consider that the expensive operation is the File I/O and you have lots of these going on, you could always look at using something like a Dictionary in your class. Consider this (fairly contrived) sample code:
private Dictionary<string, bool> _directories = new Dictionary<string, bool>();
private void CheckDirectory(string directory, bool create)
{
if (_directories.ContainsKey(_directories))
{
bool exists = Directory.Exists(directory);
if (create && !exists)
{
Directory.CreateDirectory(directory);
}
// Add the directory to the dictionary. The value depends on
// whether the directory previously existed or the method has been told
// to create it.
_directories.Add(directory, create || exists);
}
}
It's a simple matter later on to add those directories that don't exist by iterating over this list.

It is feasible for the path to exist at the point it is check but be moved/deleted in between that and the operation on it.
you may already know this and accept the risk but just so you are aware of it.
If you are going to do it anyway it doesn't matter whether it's in a property or not, just what granularity of checking you do (once per operation or once per group of operations)
If you use the non static FileInfo operations be aware that this object will cache its view on the file system.
This could be a good thing for you as you can control how often the cache is refreshed via the Refresh() method or it may lead to possible bugs in your code.
The usual try it first before worrying about performance recommendation applies but you indicate you are aware of this.

If you are reusing an object you should consider using the FileInfo class vs the static File class. The static methods of the File class does a possible unnecessary security check each time.
FileInfo - DirectoryInfo - File - Directory
EDIT:
My answer would still apply. To make sure your file exists you would do something like so in your getter:
if(File.Exists(string))
//do stuff
else
//file doesn't exist
OR
FileInfo fi = new FileInfo(fName);
if (fi.Exists)
//do stuff
else
//file doesn't exist
Correct?
What I am saying is that if your are looping through this logic thousands of time then use the FileInfo instance VS the static File class because you will get a negative performance impact if you use the static File.Exits method.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.