Overhead of File/Directory.Exists in getter?

Overhead of File/Directory.Exists in getter? - c#

I have a class that has several properties that refer to file/directory locations on the local disk. These values can be dynamic and i want to ensure that anytime they are accessed, i verify that it exists first without having to include this code in every method that uses the values.
My question is, does putting this in the getter incur a performance penalty? It would not be called thousands of times in a loop, so that is not a consideration. Just want to make sure i am not doing something that would cause unnecessary bottle necks.
I know that generally it is not wise to optimize too early, but i would rather have this error checking in place now before i have to go back and remove it from the getter and add it all over the place.
Clarification:
The files/directories being pointed to by the properties are going to be used by System.Diagnostics.Process. i won't be reading/writing to these files/directories directly, i just want to make sure they exist before i spawn a child process.

Anything that's not a simple lookup or computation should go in a method, not a property. Properties should be conceptually similar to just accessing a field - if there is any additional overhead or chance of failure (and IO - even just checking a file exists - would fail that test on both counts) then properties are not the right choice.
Remember that properties even get called by the debugger when looking at object state.
Your question about what the overhead actually is, and optimising early, becomes irrelevant when looked at from this perspective. Hope this helps.

If you're that worried about performance (and you're right when you say that it's not a good idea to optimize too early), there are ways to mitigate this. If you consider that the expensive operation is the File I/O and you have lots of these going on, you could always look at using something like a Dictionary in your class. Consider this (fairly contrived) sample code:
private Dictionary<string, bool> _directories = new Dictionary<string, bool>();
private void CheckDirectory(string directory, bool create)
{
if (_directories.ContainsKey(_directories))
{
bool exists = Directory.Exists(directory);
if (create && !exists)
{
Directory.CreateDirectory(directory);
}
// Add the directory to the dictionary. The value depends on
// whether the directory previously existed or the method has been told
// to create it.
_directories.Add(directory, create || exists);
}
}
It's a simple matter later on to add those directories that don't exist by iterating over this list.

It is feasible for the path to exist at the point it is check but be moved/deleted in between that and the operation on it.
you may already know this and accept the risk but just so you are aware of it.
If you are going to do it anyway it doesn't matter whether it's in a property or not, just what granularity of checking you do (once per operation or once per group of operations)
If you use the non static FileInfo operations be aware that this object will cache its view on the file system.
This could be a good thing for you as you can control how often the cache is refreshed via the Refresh() method or it may lead to possible bugs in your code.
The usual try it first before worrying about performance recommendation applies but you indicate you are aware of this.

If you are reusing an object you should consider using the FileInfo class vs the static File class. The static methods of the File class does a possible unnecessary security check each time.
FileInfo - DirectoryInfo - File - Directory
EDIT:
My answer would still apply. To make sure your file exists you would do something like so in your getter:
if(File.Exists(string))
//do stuff
else
//file doesn't exist
OR
FileInfo fi = new FileInfo(fName);
if (fi.Exists)
//do stuff
else
//file doesn't exist
Correct?
What I am saying is that if your are looping through this logic thousands of time then use the FileInfo instance VS the static File class because you will get a negative performance impact if you use the static File.Exits method.

Related

Are Private properties of a class called within a Parallel.Foreach body Thread Safe?

I am tasked with writing a system to process result files created by a different process(which I have no control over) and and trying to modify my code to make use of Parallel.Foreach. The code works fine when just calling a foreach but I have some concerns about thread safety when using the parallel version. The base question I need answered here is "Is the way I am doing this going to guarantee thread safety?" or is this going to cause everything to go sideways on me.
I have tried to make sure all calls are to instances and have removed every static anything except the initial static void Main. It is my current understanding that this will do alot towards assuring thread safety.
I have basically the following, edited for brevity
static void Main(string[] args)
{
MyProcess process = new MyProcess();
process.DoThings();
}
And then in the actual process to do stuff I have
public class MyProcess
{
public void DoThings()
{
//Get some list of things
List<Thing> things = getThings();
Parallel.Foreach(things, item => {
//based on some criteria, take actions from MyActionClass
MyActionClass myAct = new MyActionClass(item);
string tempstring = myAct.DoOneThing();
if(somecondition)
{
MyAct.DoOtherThing();
}
...other similar calls to myAct below here
};
}
}
And over in the MyActionClass I have something like the following:
public class MyActionClass
{
private Thing _thing;
public MyActionClass(Thing item)
{
_thing = item;
}
public string DoOneThing()
{
return _thing.GetSubThings().FirstOrDefault();
}
public void DoOtherThing()
{
_thing.property1 = "Somenewvalue";
}
}
If I can explain this any better I'll try, but I think that's the basics of my needs
EDIT:
Something else I just noticed. If I change the value of a property of the item I'm working with while inside the Parallel.Foreach (in this case, a string value that gets written to a database inside the loop), will that have any affect on the rest of the loop iterations or just the one I'm on? Would it be better to create a new instance of Thing inside the loop to store the item i'm working with in this case?

There is no shared mutable state between actions in the Parallel.ForEach that I can see, so it should be thread-safe, because at most one thread can touch one object at a time.
But as it has been mentioned there is nothing shared that can be seen. It doesn't mean that in the actual code you use everything is as good as it seems here.
Or that nothing will be changed by you or your coworker that will make some state both shared and mutable (in the Thing, for example), and now you start getting difficult to reproduce crashes at best or just plain wrong behaviour at worst that can be left undetected for a long time.
So, perhaps you should try to go fully immutable near threading code?
Perhaps.
Immutability is good, but it is not a silver bullet, and it is not always easy to use and implement, or that every task can be reasonably expressed through immutable objects. And even that accidental "make shared and mutable" change may happen to it as well, though much less likely.
It should at least be considered as a possible option/alternative.
About the EDIT
If I change the value of a property of the item I'm working with while
inside the Parallel.Foreach (in this case, a string value that gets
written to a database inside the loop), will that have any affect on
the rest of the loop iterations or just the one I'm on?
If you change a property and that object is not used anywhere else, and it doesn't rely on some global mutable state (for example, sort of a public static Int32 ChangesCount that increments with each state change), then you should be safe.
a string value that gets written to a database inside the loop - depending on the used data access technology and how you use it, you may be in trouble, because most of them are not designed for multithreaded environment, like EF DbContext, for example. And obviously do not forget that dealing with concurrent access in database is not always easy, though that is a bit away from our original theme.
Would it be better to create a new instance of Thing inside the loop to store the item i'm working with in this case - if there is no risk of external concurrent changes, then it is just an unnecessary work. And if there is a chance of another threads(not Parallel.For) making changes to those objects that are being persisted, then you already have bigger problems than Parallel.For.
Objects should always have observable consistent state (unlike when half of properties set by one thread, and half by another, while you try to persist that who-knows-what), and if they are used by many threads, then they should be already thread-safe - there should be no way to put them into inconsistent state.
And if they want to be persisted by external code, such objects should probably provide:
Either SyncRoot property to synchronize property reading code.
Or some current state snapshot DTO that is created internally by some thread-safe method like ThingSnapshot Thing.GetCurrentData() { lock() {} }.
Or something more exotic.

Understanding lazy loading optimization in C#

After reading a bit of how yield, foreach, linq deferred execution and iterators work in C#. I decided to give it a try optimizing an attribute based validation mechanic inside a small project. The result:
private IEnumerable<string> GetPropertyErrors(PropertyInfo property)
{
// where Entity is the current object instance
string propertyValue = property.GetValue(Entity)?.ToString();
foreach (var attribute in property.GetCustomAttributes().OfType<ValidationAttribute>())
{
if (!attribute.IsValid(propertyValue))
{
yield return $"Error: {property.Name} {attribute.ErrorMessage}";
}
}
}
// inside another method
foreach(string error in GetPropertyErrors(property))
{
// Some display/insert log operation
}
I find this slow but that also could be due to reflection or a large amount of properties to process.
So my question is... Is this optimal or a good use of the lazy loading mechanic? or I'm missing something and just wasting tons of resources.
NOTE: The code intention itself is not important, my concern is the use of lazy loading in it.

Lazy loading is not something specific to C# or to Entity Framework. It's a common pattern, which allows defer some data loading. Deferring means not loading immediately. Some samples when you need that:
Loading images in (Word) document. Document may be big and it can contain thousands of images. If you'll load all them when document is opened it might take big amount of time. Nobody wants sit and watch 30 seconds on loading document. Same approach is used in web browsers - resources are not sent with body of page. Browser defers resources loading.
Loading graphs of objects. It may be objects from database, file system objects etc. Loading full graph might be equal to loading all database content into memory. How long it will take? Is it efficient? No. If you are building some file system explorer will you load info about every file in system before you start using it? It's much faster if you will load info about current directory only (and probably it's direct children).
Lazy loading not always mean deferring loading until you really need data. Loading might occur in background thread before you really need that data. E.g. you might never scroll to the bottom of web page to see footer image. Lazy loading means only deferring. And C# enumerators can help you with that. Consider getting list of files in directory:
string[] files = Directory.GetFiles("D:");
IEnumerable<string> filesEnumerator = Directory.EnumerateFiles("D:");
First approach returns array of files. It means directory should get all its files and save their names to array before you can get even first file name. It's like loading all images before you see document.
Second approach uses enumerator - it returns files one by one when you ask for next file name. It means that enumerator is returned immediately without getting all files and saving them to some collection. And you can process files one by one when you need that. Here getting files list is deferred.
But you should be careful. If underlying operation is not deferred, then returning enumerator gives you no benefits. E.g.
public IEnumerable<string> EnumerateFiles(string path)
{
foreach(string file in Directory.GetFiles(path))
yield return file;
}
Here you use GetFiles method which fills array of file names before returning them. So yielding files one by one gives you no speed benefits.
Btw in your case you have exactly same problem - GetCustomAttributes extension internally uses Attribute.GetCustomAttributes method which returns array of attributes. So you will not reduce time of getting first result.

This isn't quite how the term "lazy loading" is generally used in .NET. "Lazy loading" is most often used of something like:
public SomeType SomeValue
{
get
{
if (_backingField == null)
_backingField = RelativelyLengthyCalculationOrRetrieval();
return _backingField;
}
}
As opposed to just having _backingField set when an instance was constructed. Its advantage is that it costs nothing in the cases when SomeValue is never accessed, at the expense of a slightly greater cost when it is. It's therefore advantageous when the chances of SomeValue not being called are relatively high, and generally disadvantageous otherwise with some exceptions (when we might care about how quickly things are done in between instance creation and the first call to SomeValue).
Here we have deferred execution. It's similar, but not quite the same. When you call GetPropertyErrors(property) rather than receiving a collection of all of the errors you receive an object that can find those errors when asked for them.
It will always save the time taken to get the first such item, because it allows you to act upon it immediately rather than waiting until it has finished processing.
It will always reduce memory use, because it isn't spending memory on a collection.
It will also save time in total, because no time is spent creating a collection.
However, if you need to access it more than once, then while a collection will still have the same results, it will have to calculate them all again (unlike lazy loading which loads its results and stores them for subsequent reuse).
If you're rarely going to want to hit the same set of results, it's generally always a win.
If you're almost always going to want to hit the same set of results, it's generally a lose.
If you are sometimes going to want to hit the same set of results though, you can pass the decision on whether to cache or not up to the caller, with a single use calling GetPropertyErrors() and acting on the results directly, but a repeated use calling ToList() on that and then acting repeatedly on that list.
As such, the approach of not sending a list is the more flexible, allowing the calling code to decide which approach is the more efficient for its particular use of it.
You could also combine it with lazy loading:
private IEnumerable<string> LazyLoadedEnumerator()
{
if (_store == null)
return StoringCalculatingEnumerator();
return _store;
}
private IEnumerable<string> StoringCalculatingEnumerator()
{
List<string> store = new List<string>();
foreach(string str in SomethingThatCalculatesTheseStrings())
{
yield return str;
store.Add(str);
}
_store = store;
}
This combination is rarely useful in practice though.
As a rule, start with deferred evaluation as the normal approach and decide further up the call chain whether to store the results or not. An exception though is if you can know the size of the results before you begin (you can't here because you don't know if an element will be added or not until you've examined the property). In this case there is the possibility of a performance improvement in just how you create that list, because you can set its capacity ahead of time. This though is a micro-optimisation that is only applicable if you also know that you'll also always want to work on a list and doesn't save that much in the grand scheme of things.

Partially thread-safe dictionary

I have a class that maintains a private Dictionary instance that caches some data.
The class writes to the dictionary from multiple threads using a ReaderWriterLockSlim.
I want to expose the dictionary's values outside the class.
What is a thread-safe way of doing that?
Right now, I have the following:
public ReadOnlyCollection<MyClass> Values() {
using (sync.ReadLock())
return new ReadOnlyCollection<MyClass>(cache.Values.ToArray());
}
Is there a way to do this without copying the collection many times?
I'm using .Net 3.5 (not 4.0)

I want to expose the dictionary's values outside the class.
What is a thread-safe way of doing that?
You have three choices.
1) Make a copy of the data, hand out the copy. Pros: no worries about thread safe access to the data. Cons: Client gets a copy of out-of-date data, not fresh up-to-date data. Also, copying is expensive.
2) Hand out an object that locks the underlying collection when it is read from. You'll have to write your own read-only collection that has a reference to the lock of the "parent" collection. Design both objects carefully so that deadlocks are impossible. Pros: "just works" from the client's perspective; they get up-to-date data without having to worry about locking. Cons: More work for you.
3) Punt the problem to the client. Expose the lock, and make it a requirement that clients lock all views on the data themselves before using it. Pros: No work for you. Cons: Way more work for the client, work they might not be willing or able to do. Risk of deadlocks, etc, now become the client's problem, not your problem.

If you want a snapshot of the current state of the dictionary, there's really nothing else you can do with this collection type. This is the same technique used by the ConcurrentDictionary<TKey, TValue>.Values property.
If you don't mind throwing an InvalidOperationException if the collection is modified while you are enumerating it, you could just return cache.Values since it's readonly (and thus can't corrupt the dictionary data).

EDIT: I personally believe the below code is technically answering your question correctly (as in, it provides a way to enumerate over the values in a collection without creating a copy). Some developers far more reputable than I strongly advise against this approach, for reasons they have explained in their edits/comments. In short: This is apparently a bad idea. Therefore I'm leaving the answer but suggesting you not use it.
Unless I'm missing something, I believe you could expose your values as an IEnumerable<MyClass> without needing to copy values by using the yield keyword:
public IEnumerable<MyClass> Values {
get {
using (sync.ReadLock()) {
foreach (MyClass value in cache.Values)
yield return value;
}
}
}
Be aware, however (and I'm guessing you already knew this), that this approach provides lazy evaluation, which means that the Values property as implemented above can not be treated as providing a snapshot.
In other words... well, take a look at this code (I am of course guessing as to some of the details of this class of yours):
var d = new ThreadSafeDictionary<string, string>();
// d is empty right now
IEnumerable<string> values = d.Values;
d.Add("someKey", "someValue");
// if values were a snapshot, this would output nothing...
// but in FACT, since it is lazily evaluated, it will now have
// what is CURRENTLY in d.Values ("someValue")
foreach (string s in values) {
Console.WriteLine(s);
}
So if it's a requirement that this Values property be equivalent to a snapshot of what is in cache at the time the property is accessed, then you're going to have to make a copy.
(begin 280Z28): The following is an example of how someone unfamiliar with the "C# way of doing things" could lock the code:
IEnumerator enumerator = obj.Values.GetEnumerator();
MyClass first = null;
if (enumerator.MoveNext())
first = enumerator.Current;
(end 280Z28)

Review next possibility, just exposes ICollection interface, so in Values() you can return your own implementation. This implementation will use only reference on Dictioanry.Values and always use ReadLock for access items.

Should I check whether particular key is present in Dictionary before accessing it?

Should I check whether particular key is present in Dictionary if I am sure it will be added in dictionary by the time I reach the code to access it?
There are two ways I can access the value in dictionary
checking ContainsKey method. If it returns true then I access using indexer [key] of dictionary object.
or
TryGetValue which will return true or false as well as return value through out parameter.
(2nd will perform better than 1st if I want to get value. Benchmark.)
However if I am sure that the function which is accessing global dictionary will surely have the key then should I still check using TryGetValue or without checking I should use indexer[].
Or I should never assume that and always check?

Use the indexer if the key is meant to be present - if it's not present, it will throw an appropriate exception, which is the right behaviour if the absence of the key indicates a bug.
If it's valid for the key not to be present, use TryGetValue instead and react accordingly.
(Also apply Marc's advice about accessing a shared dictionary safely.)

If the dictionary is global (static/shared), you should be synchronizing access to it (this is important; otherwise you can corrupt it).
Even if your thread is only reading data, it needs to respect the locks of other threads that might be editing it.
However; if you are sure that the item is there, the indexer should be fine:
Foo foo;
lock(syncLock) {
foo = data[key];
}
// use foo...
Otherwise, a useful pattern is to check and add in the same lock:
Foo foo;
lock(syncLock) {
if(!data.TryGetValue(key, out foo)) {
foo = new Foo(key);
data.Add(key, foo);
}
}
// use foo...
Here we only add the item if it wasn't there... but inside the same lock.

Always check. Never say never. I assume your application is not that performance critical that you will have to save the checking time.
TIP: If you decide not to check, at least use Debug.Assert( dict.ContainsKey( key ) ); This will only be compiled when in Debug mode, your release build will not contain it. That way you could at least have the check when debugging.
Still: if possible, just check it :-)
EDIT: There have been some misconceptions here. By "always check" I did not only mean using an if somewhere. Handling an exception properly was also included in this. So, to be more precise: never take anything for granted, expect the unexpected. Check by ContainsKey or handle the potential exception, but do SOMETHING in case the element is not contained.

Personally I'd check the key is there, regardless of whether or not you are SURE it is, some may say this check is superfluous and that dictionary will throw an exception which you can catch, but imho you should not rely on that exception, you should check yourself and then either throw your own exception which means something or a result object with a success flag and reason inside... the failure mechanism is really implementation dependant.

Surely the answer is "it all depends on the situation". You need to balance the risk that the key will be missing from the dictionary (low for small systems where there is limited access to the data, where you can rely on the order things are done, larger for larger systems, multiple programmers accessing the same data, especially with read/write/delete access, where threads are involved and order cannot be guaranteed or where data originates externally and reading can fail) with the impact of the risk (safety-critical systems, commercial releases or systems that a business will rely on compared with something made for fun, for a one-off job and/or for your use only) and with any requirements for speed, size and laziness.
If I were making a system to control railway signalling I would want to be safe against all possible and impossible errors, and safe from errors in the error-handling and so on (Murphy's 2nd law: "what can't go wrong will go wrong".) If I'm chucking stuff together for fun, even if size and speed are not an issue I will be MUCH more relaxed about stuff like this - I will want to get to the fun stuff.
Of course, sometimes this is the fun stuff in itself.

TryGetValue is the same code as indexing it by key, except the former returns a default value (for the out parameter) where the latter throws an exception. Use TryGetValue and you'll get consistent checks with absolutely no performance loss.
Edit: As Jon said, if you know it will always have the key, then you can index it and let it throw the appropriate exception. However, if you can provide better context information by throwing it yourself with a detailed message, that would be preferable.

There's 2 trains of thought on this from a performance point of view.
1) Avoid exceptions where possible, as exceptions are expensive - i.e. check before you try to retrieve a specific key from the dictionary, whether it exists or not. Better approach in my opinion if there's a fair chance it may not exist. This would prevent fairly common exceptions.
2) If you're confident the item will exist in there 99% of the time, then don't check for it's existence before accessing it. The 1% of times when it doesn't exist, an exception will be thrown but you've saved time for the other 99% of the time by not checking.
What I'm saying is, optimise for the majority if there is a clear one. If there is any real degree in uncertainty about an item existing, then check before retrieving.

If you know that the dictionary normally contains the key, you don't have to check for it before accessing it.
If something would be wrong and the dictionary doesn't contain the items that it should, you can let the dictionary throw the exception. The only reason for checking for the key first would be if you want to take care of this problem situation yourself without getting the exception. Letting the dictionary throw the exception and catch that is however a perfectly valid way of handling the situation.

I think Marc and Jon have it (as usual) pretty sown up. Since you also mention performance in your question it might be worth considering how you lock the dictionary.
The straightforward lock serialises all read access which may not be desirable if read is massively frequent and writes are relatively few. In that case using a ReaderWriterLockSlim might be better. The downside is the code is a little more complex and writes are slightly slower.

Is it true I should not do "long running" things in a property accessor?

And if so, why?
and what constitutes "long running"?
Doing magic in a property accessor seems like my prerogative as a class designer. I always thought that is why the designers of C# put those things in there - so I could do what I want.
Of course it's good practice to minimize surprises for users of a class, and so embedding truly long running things - eg, a 10-minute monte carlo analysis - in a method makes sense.
But suppose a prop accessor requires a db read. I already have the db connection open. Would db access code be "acceptable", within the normal expectations, in a property accessor?

Like you mentioned, it's a surprise for the user of the class. People are used to being able to do things like this with properties (contrived example follows:)
foreach (var item in bunchOfItems)
foreach (var slot in someCollection)
slot.Value = item.Value;
This looks very natural, but if item.Value actually is hitting the database every time you access it, it would be a minor disaster, and should be written in a fashion equivalent to this:
foreach (var item in bunchOfItems)
{
var temp = item.Value;
foreach (var slot in someCollection)
slot.Value = temp;
}
Please help steer people using your code away from hidden dangers like this, and put slow things in methods so people know that they're slow.
There are some exceptions, of course. Lazy-loading is fine as long as the lazy load isn't going to take some insanely long amount of time, and sometimes making things properties is really useful for reflection- and data-binding-related reasons, so maybe you'll want to bend this rule. But there's not much sense in violating the convention and violating people's expectations without some specific reason for doing so.

In addition to the good answers already posted, I'll add that the debugger automatically displays the values of properties when you inspect an instance of a class. Do you really want to be debugging your code and have database fetches happening in the debugger every time you inspect your class? Be nice to the future maintainers of your code and don't do that.
Also, this question is extensively discussed in the Framework Design Guidelines; consider picking up a copy.

A db read in a property accessor would be fine - thats actually the whole point of lazy-loading. I think the most important thing would be to document it well so that users of the class understand that there might be a performance hit when accessing that property.

You can do whatever you want, but you should keep the consumers of your API in mind. Accessors and mutators (getters and setters) are expected to be very light weight. With that expectation, developers consuming your API might make frequent and chatty calls to these properties. If you are consuming external resources in your implementation, there might be an unexpected bottleneck.
For consistency sake, it's good to stick with convention for public APIs. If your implementations will be exclusively private, then there's probably no harm (other than an inconsistent approach to solving problems privately versus publicly).

It is just a "good practice" not to make property accessors taking long time to execute.
That's because properties looks like fields for the caller and hence caller (a user of your API that is) usually assumes there is nothing more than just a "return smth;"
If you really need some "action" behind the scenes, consider creating a method for that...

I don't see what the problem is with that, as long as you provide XML documentation so that the Intellisense notifies the object's consumer of what they're getting themselves into.
I think this is one of those situations where there is no one right answer. My motto is "Saying always is almost always wrong." You should do what makes the most sense in any given situation without regard to broad generalizations.

A database access in a property getter is fine, but try to limit the amount of times the database is hit through caching the value.
There are many times that people use properties in loops without thinking about the performance, so you have to anticipate this use. Programmers don't always store the value of a property when they are going to use it many times.
Cache the value returned from the database in a private variable, if it is feasible for this piece of data. This way the accesses are usually very quick.

This isn't directly related to your question, but have you considered going with a load once approach in combination with a refresh parameter?
class Example
{
private bool userNameLoaded = false;
private string userName = "";
public string UserName(bool refresh)
{
userNameLoaded = !refresh;
return UserName();
}
public string UserName()
{
if (!userNameLoaded)
{
/*
userName=SomeDBMethod();
*/
userNameLoaded = true;
}
return userName;
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.